## Modèle de risque de crédit

Le but de ce projet est de prédire à l'aide des informations disponibles quels clients sont les plus susceptibles de ne pas rembourser leurs prêts.

## Frameworks necessaires

In [3]:
!pip install polars



In [4]:
!pip install lightgbm



In [5]:
!pip install --upgrade pandas



In [6]:
!pip install --upgrade dask lightgbm

Collecting dask
  Downloading dask-2024.5.1-py3-none-any.whl.metadata (3.8 kB)
Collecting lightgbm
  Downloading lightgbm-4.3.0-py3-none-manylinux_2_28_x86_64.whl.metadata (19 kB)
Downloading dask-2024.5.1-py3-none-any.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading lightgbm-4.3.0-py3-none-manylinux_2_28_x86_64.whl (3.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m61.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m
[?25hInstalling collected packages: lightgbm, dask
  Attempting uninstall: lightgbm
    Found existing installation: lightgbm 4.2.0
    Uninstalling lightgbm-4.2.0:
      Successfully uninstalled lightgbm-4.2.0
  Attempting uninstall: dask
    Found existing installation: dask 2024.4.1
    Uninstalling dask-2024.4.1:
      Successfully uninstalled dask-2024.4.1
[31mERROR: pip's dependency resolver does not currently take into accoun

## Chargement des données

In [15]:
import dask
import os
import polars as pl
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score 

#dataPath = "C:/Users/astri/OneDrive/Documents/M2 DS/PROJETS PERSO/KAGGLE/"
dataPath = "/kaggle/input/home-credit-credit-risk-model-stability/"

In [16]:
# Set the correct data path
data_path = "C:/Users/astri/OneDrive/Documents/M2 DS/PROJETS PERSO/KAGGLE/csv_files/train/"

# Check if the file exists
file_path = os.path.join(data_path, "train_base.csv")
if os.path.exists(file_path):
    # Load the CSV file
    train_basetable = pl.read_csv(file_path)
    print("File loaded successfully!")
else:
    print("File not found. Please check the file path.")


Data path: C:/Users/astri/OneDrive/Documents/M2 DS/PROJETS PERSO/KAGGLE/csv_files/train/
File not found. Please check the file path.


Ces bouts de code suivants permettent de charger facilement les données.

In [18]:
def set_table_dtypes(df: pl.DataFrame) -> pl.DataFrame:
    # implement here all desired dtypes for tables
    # the following is just an example
    for col in df.columns:
        # last letter of column name will help you determine the type
        if col[-1] in ("P", "A"):
            df = df.with_columns(pl.col(col).cast(pl.Float64).alias(col))

    return df

def convert_strings(df: pd.DataFrame) -> pd.DataFrame:
    for col in df.columns:  
        if df[col].dtype.name in ['object', 'string']:
            df[col] = df[col].astype("string").astype('category')
            current_categories = df[col].cat.categories
            new_categories = current_categories.to_list() + ["Unknown"]
            new_dtype = pd.CategoricalDtype(categories=new_categories, ordered=True)
            df[col] = df[col].astype(new_dtype)
    return df


train_basetable = pl.read_csv(dataPath + "csv_files/train/train_base.csv")
train_static = pl.concat(
    [
        pl.read_csv(dataPath + "csv_files/train/train_static_0_0.csv").pipe(set_table_dtypes),
        pl.read_csv(dataPath + "csv_files/train/train_static_0_1.csv").pipe(set_table_dtypes),
    ],
    how="vertical_relaxed",
)
train_static_cb = pl.read_csv(dataPath + "csv_files/train/train_static_cb_0.csv").pipe(set_table_dtypes)
train_person_1 = pl.read_csv(dataPath + "csv_files/train/train_person_1.csv").pipe(set_table_dtypes) 
train_credit_bureau_b_2 = pl.read_csv(dataPath + "csv_files/train/train_credit_bureau_b_2.csv").pipe(set_table_dtypes) 


In [21]:
test_basetable = pl.read_csv(dataPath + "csv_files/test/test_base.csv")
test_static = pl.concat(
    [
        pl.read_csv(dataPath + "csv_files/test/test_static_0_0.csv").pipe(set_table_dtypes),
        pl.read_csv(dataPath + "csv_files/test/test_static_0_1.csv").pipe(set_table_dtypes),
        pl.read_csv(dataPath + "csv_files/test/test_static_0_2.csv").pipe(set_table_dtypes),
    ],
    how="vertical_relaxed",
)
test_static_cb = pl.read_csv(dataPath + "csv_files/test/test_static_cb_0.csv").pipe(set_table_dtypes)
test_person_1 = pl.read_csv(dataPath + "csv_files/test/test_person_1.csv").pipe(set_table_dtypes) 
test_credit_bureau_b_2 = pl.read_csv(dataPath + "csv_files/test/test_credit_bureau_b_2.csv").pipe(set_table_dtypes) 

In [22]:
##Traitement de données

In [23]:
# We need to use aggregation functions in tables with depth > 1, so tables that contain num_group1 column or 
# also num_group2 column.
train_person_1_feats_1 = train_person_1.group_by("case_id").agg(
    pl.col("mainoccupationinc_384A").max().alias("mainoccupationinc_384A_max"),
    (pl.col("incometype_1044T") == "SELFEMPLOYED").max().alias("mainoccupationinc_384A_any_selfemployed")
)

# Here num_group1=0 has special meaning, it is the person who applied for the loan.
train_person_1_feats_2 = train_person_1.select(["case_id", "num_group1", "housetype_905L"]).filter(
    pl.col("num_group1") == 0
).drop("num_group1").rename({"housetype_905L": "person_housetype"})

# Here we have num_goup1 and num_group2, so we need to aggregate again.
train_credit_bureau_b_2_feats = train_credit_bureau_b_2.group_by("case_id").agg(
    pl.col("pmts_pmtsoverdue_635A").max().alias("pmts_pmtsoverdue_635A_max"),
    (pl.col("pmts_dpdvalue_108P") > 31).max().alias("pmts_dpdvalue_108P_over31")
)

# We will process in this examples only A-type and M-type columns, so we need to select them.
selected_static_cols = []
for col in train_static.columns:
    if col[-1] in ("A", "M"):
        selected_static_cols.append(col)
print(selected_static_cols)

selected_static_cb_cols = []
for col in train_static_cb.columns:
    if col[-1] in ("A", "M"):
        selected_static_cb_cols.append(col)
print(selected_static_cb_cols)

# Join all tables together.
data = train_basetable.join(
    train_static.select(["case_id"]+selected_static_cols), how="left", on="case_id"
).join(
    train_static_cb.select(["case_id"]+selected_static_cb_cols), how="left", on="case_id"
).join(
    train_person_1_feats_1, how="left", on="case_id"
).join(
    train_person_1_feats_2, how="left", on="case_id"
).join(
    train_credit_bureau_b_2_feats, how="left", on="case_id"
)

['amtinstpaidbefduel24m_4187115A', 'annuity_780A', 'annuitynextmonth_57A', 'avginstallast24m_3658937A', 'avglnamtstart24m_4525187A', 'avgoutstandbalancel6m_4187114A', 'avgpmtlast12m_4525200A', 'credamount_770A', 'currdebt_22A', 'currdebtcredtyperange_828A', 'disbursedcredamount_1113A', 'downpmt_116A', 'inittransactionamount_650A', 'lastapprcommoditycat_1041M', 'lastapprcommoditytypec_5251766M', 'lastapprcredamount_781A', 'lastcancelreason_561M', 'lastotherinc_902A', 'lastotherlnsexpense_631A', 'lastrejectcommoditycat_161M', 'lastrejectcommodtypec_5251769M', 'lastrejectcredamount_222A', 'lastrejectreason_759M', 'lastrejectreasonclient_4145040M', 'maininc_215A', 'maxannuity_159A', 'maxannuity_4075009A', 'maxdebt4_972A', 'maxinstallast24m_3658928A', 'maxlnamtstart6m_4525199A', 'maxoutstandbalancel12m_4187113A', 'maxpmtlast3m_4525190A', 'previouscontdistrict_112M', 'price_1097A', 'sumoutstandtotal_3546847A', 'sumoutstandtotalest_4493215A', 'totaldebt_9A', 'totalsettled_863A', 'totinstallas

In [24]:
test_person_1_feats_1 = test_person_1.group_by("case_id").agg(
    pl.col("mainoccupationinc_384A").max().alias("mainoccupationinc_384A_max"),
    (pl.col("incometype_1044T") == "SELFEMPLOYED").max().alias("mainoccupationinc_384A_any_selfemployed")
)

test_person_1_feats_2 = test_person_1.select(["case_id", "num_group1", "housetype_905L"]).filter(
    pl.col("num_group1") == 0
).drop("num_group1").rename({"housetype_905L": "person_housetype"})

test_credit_bureau_b_2_feats = test_credit_bureau_b_2.group_by("case_id").agg(
    pl.col("pmts_pmtsoverdue_635A").max().alias("pmts_pmtsoverdue_635A_max"),
    (pl.col("pmts_dpdvalue_108P") > 31).max().alias("pmts_dpdvalue_108P_over31")
)

data_submission = test_basetable.join(
    test_static.select(["case_id"]+selected_static_cols), how="left", on="case_id"
).join(
    test_static_cb.select(["case_id"]+selected_static_cb_cols), how="left", on="case_id"
).join(
    test_person_1_feats_1, how="left", on="case_id"
).join(
    test_person_1_feats_2, how="left", on="case_id"
).join(
    test_credit_bureau_b_2_feats, how="left", on="case_id"
)

In [25]:
case_ids = data["case_id"].unique().shuffle(seed=1)
case_ids_train, case_ids_test = train_test_split(case_ids, train_size=0.6, random_state=1)
case_ids_valid, case_ids_test = train_test_split(case_ids_test, train_size=0.5, random_state=1)

cols_pred = []
for col in data.columns:
    if col[-1].isupper() and col[:-1].islower():
        cols_pred.append(col)

print(cols_pred)

def from_polars_to_pandas(case_ids: pl.DataFrame) -> pl.DataFrame:
    return (
        data.filter(pl.col("case_id").is_in(case_ids))[["case_id", "WEEK_NUM", "target"]].to_pandas(),
        data.filter(pl.col("case_id").is_in(case_ids))[cols_pred].to_pandas(),
        data.filter(pl.col("case_id").is_in(case_ids))["target"].to_pandas()
    )

base_train, X_train, y_train = from_polars_to_pandas(case_ids_train)
base_valid, X_valid, y_valid = from_polars_to_pandas(case_ids_valid)
base_test, X_test, y_test = from_polars_to_pandas(case_ids_test)

for df in [X_train, X_valid, X_test]:
    df = convert_strings(df)
    

['amtinstpaidbefduel24m_4187115A', 'annuity_780A', 'annuitynextmonth_57A', 'avginstallast24m_3658937A', 'avglnamtstart24m_4525187A', 'avgoutstandbalancel6m_4187114A', 'avgpmtlast12m_4525200A', 'credamount_770A', 'currdebt_22A', 'currdebtcredtyperange_828A', 'disbursedcredamount_1113A', 'downpmt_116A', 'inittransactionamount_650A', 'lastapprcommoditycat_1041M', 'lastapprcommoditytypec_5251766M', 'lastapprcredamount_781A', 'lastcancelreason_561M', 'lastotherinc_902A', 'lastotherlnsexpense_631A', 'lastrejectcommoditycat_161M', 'lastrejectcommodtypec_5251769M', 'lastrejectcredamount_222A', 'lastrejectreason_759M', 'lastrejectreasonclient_4145040M', 'maininc_215A', 'maxannuity_159A', 'maxannuity_4075009A', 'maxdebt4_972A', 'maxinstallast24m_3658928A', 'maxlnamtstart6m_4525199A', 'maxoutstandbalancel12m_4187113A', 'maxpmtlast3m_4525190A', 'previouscontdistrict_112M', 'price_1097A', 'sumoutstandtotal_3546847A', 'sumoutstandtotalest_4493215A', 'totaldebt_9A', 'totalsettled_863A', 'totinstallas

In [29]:
print(f"Train: {X_train.shape}")
print(f"Valid: {X_valid.shape}")
print(f"Test: {X_test.shape}")

Train: (915995, 48)
Valid: (305332, 48)
Test: (305332, 48)


Ces deux ensembles de données X_train et base_train seront utilisés ensemble pour l'entraînement d'un modèle des modèles. X_train contient les caractéristiques sur lesquelles le modèle sera entraîné, tandis que base_train contient les étiquettes cibles (ou les valeurs à prédire) ainsi que les scores associés

In [123]:
X_train

Unnamed: 0,amtinstpaidbefduel24m_4187115A,annuity_780A,annuitynextmonth_57A,avginstallast24m_3658937A,avglnamtstart24m_4525187A,avgoutstandbalancel6m_4187114A,avgpmtlast12m_4525200A,credamount_770A,currdebt_22A,currdebtcredtyperange_828A,...,totinstallast1m_4525188A,description_5085714M,education_1103M,education_88M,maritalst_385M,maritalst_893M,pmtaverage_3A,pmtaverage_4527227A,pmtaverage_4955615A,pmtssum_45A
0,,3390.199951,0.000000,,,,,44000.0,0.000000,0.000,...,,-1.0,-1.0,-1.0,-1.0,-1.0,,,,
1,,9568.600586,0.000000,,,,,100000.0,0.000000,0.000,...,,-1.0,-1.0,-1.0,-1.0,-1.0,,,,
2,,5109.600098,0.000000,,,,,80000.0,0.000000,0.000,...,,-1.0,-1.0,-1.0,-1.0,-1.0,,,,
3,,2581.000000,0.000000,,,,,28000.0,0.000000,0.000,...,,-1.0,-1.0,-1.0,-1.0,-1.0,,,,
4,,2400.000000,0.000000,,,,,40000.0,0.000000,0.000,...,,-1.0,-1.0,-1.0,-1.0,-1.0,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
305327,119089.992188,4138.399902,0.000000,5671.000000,,20909.320312,6878.800293,40000.0,0.000000,0.000,...,12445.384766,0.0,2.0,3.0,0.0,3.0,,,16097.200195,
305328,0.000000,4747.200195,0.000000,,,,,60000.0,0.000000,0.000,...,,0.0,1.0,0.0,0.0,1.0,,,,
305329,335469.250000,7088.600098,7216.000000,13376.600586,,109874.585938,14549.000000,100000.0,87968.875000,87968.875,...,7216.000000,0.0,2.0,3.0,0.0,3.0,,,20508.201172,
305330,169487.718750,4960.800293,2717.199951,7369.000000,,12492.797852,10033.200195,60000.0,7647.200195,0.000,...,2717.199951,0.0,2.0,3.0,0.0,3.0,,,,


In [128]:
X_train.describe()

Unnamed: 0,amtinstpaidbefduel24m_4187115A,annuity_780A,annuitynextmonth_57A,avginstallast24m_3658937A,avglnamtstart24m_4525187A,avgoutstandbalancel6m_4187114A,avgpmtlast12m_4525200A,credamount_770A,currdebt_22A,currdebtcredtyperange_828A,...,totinstallast1m_4525188A,description_5085714M,education_1103M,education_88M,maritalst_385M,maritalst_893M,pmtaverage_3A,pmtaverage_4527227A,pmtaverage_4955615A,pmtssum_45A
count,579122.0,915995.0,915994.0,541137.0,97373.0,411376.0,299622.0,915995.0,915994.0,915994.0,...,211305.0,915995.0,915995.0,915995.0,915995.0,915995.0,86118.0,68821.0,43159.0,343834.0
mean,56051.03,4039.998047,1438.738403,5399.34082,44660.792969,45983.77,6398.227051,49875.789062,19714.37,11017.32,...,10417.49707,0.845033,2.160383,2.904335,1.423946,2.902839,9314.902344,10052.182617,17616.445312,13232.382812
std,71688.48,3011.037109,2809.423584,6436.574707,44763.152344,64416.92,9204.316406,44182.878906,50906.23,36837.03,...,16187.276367,0.406428,1.061152,0.584711,1.260771,0.57729,5568.039062,5530.822266,6777.177734,18190.787109
min,0.0,83.0,0.0,0.0,0.0,-7588198.0,0.0,2000.0,0.0,0.0,...,0.222,-1.0,-1.0,-1.0,-1.0,-1.0,0.0,5.0,6.0,0.0
25%,7451.65,1968.200073,0.0,2529.800049,15685.799805,8719.109,2595.199951,19998.0,0.0,0.0,...,3314.600098,1.0,1.0,3.0,0.0,3.0,6590.600098,7192.200195,13649.400391,3167.616211
50%,29755.65,3151.600098,0.0,4071.600098,28460.800781,22783.38,4422.0,35190.0,0.0,0.0,...,6216.0,1.0,3.0,3.0,2.0,3.0,7305.899902,7553.0,15765.600586,8400.0
75%,76525.77,5231.399902,2038.800049,6553.0,56377.800781,55369.72,7516.200195,63984.0,13556.95,0.0,...,11697.600586,1.0,3.0,3.0,2.0,3.0,13027.474609,13464.400391,21829.5,17005.658203
max,1198913.0,91601.398438,71878.601562,496148.8125,513520.0,1131136.0,495910.40625,950000.0,1210629.0,1028338.0,...,794899.1875,1.0,4.0,4.0,5.0,5.0,145257.40625,205848.609375,99085.398438,476843.40625


In [126]:
base_train

Unnamed: 0,case_id,WEEK_NUM,target,score
0,0,0,0,0.043662
1,2,0,0,0.048437
2,5,0,0,0.030461
3,6,0,0,0.072764
4,7,0,0,0.026860
...,...,...,...,...
915990,2703449,91,0,0.076109
915991,2703450,91,0,0.005971
915992,2703452,91,0,0.024759
915993,2703453,91,0,0.005225


In [134]:
base_train.isnull().sum()

case_id     0
WEEK_NUM    0
target      0
score       0
dtype: int64

In [136]:
base_train.describe()

Unnamed: 0,case_id,WEEK_NUM,target,score
count,915995.0,915995.0,915995.0,915995.0
mean,1285867.0,40.76614,0.03152,0.031487
std,718984.3,23.798236,0.174718,0.030791
min,0.0,0.0,0.0,0.001321
25%,765812.5,23.0,0.0,0.01338
50%,1357130.0,40.0,0.0,0.023322
75%,1738910.0,55.0,0.0,0.039356
max,2703454.0,91.0,1.0,0.831328


Nous constatons que nos données on beaucoup de valeurs manquantes, mais ce n'est pas grave car les modèles que nous allons utiliser ont la capacité de gérer ces valeurs manquantes lors de leur entrainement.

Voici l'explication de quelques variables de ces données: \
**case_id**: Un identifiant unique pour chaque observation.\
**WEEK_NUM**: La semaine à laquelle l'observation fait référence.\
**target**: La variable cible,la valeur que le modèle essaie de prédire.\
**score**: le score associé à chaque observation.

Les variables de X_train représentesnt les iformations des clients comme le régime matrimonial, le type d'éducation, l'adresse postale, le niveau de transaction bancaire, le montant emprunté, le taux d'intérêt, et bien d'autres paramètres.

Ayant donc ces informations, on pourea par exemple faire une analyse en composantes principales ou une sélection de variables pour observer quelles variables influencent le plus sur le score d'un client. Mais nous n'allons pas le faire car il faudra supprimer les données qui on des lignes manquantes et cela pourra un peu biaiser notre modèle final. Mais rein n'empêche de le faire.

Après avoir chargé nos données train, validation et test, nous pouvons à présent les traiter et les utiliser.

In [108]:
# Identifions les colonnes catégorielles et Convertissons les en codes numériques
categorical_cols = X_train.select_dtypes(include=['category']).columns
print("Colonnes catégorielles:", categorical_cols)

# Convertir les colonnes catégorielles en catégories uniformes
for col in categorical_cols:
    # Combiner les catégories des trois ensembles
    all_categories = pd.concat([X_train[col], X_valid[col], X_test[col]], axis=0).astype('category')
    
    # Redéfinir les colonnes avec les catégories uniformes
    X_train[col] = X_train[col].astype(pd.CategoricalDtype(categories=all_categories.cat.categories)).cat.codes
    X_valid[col] = X_valid[col].astype(pd.CategoricalDtype(categories=all_categories.cat.categories)).cat.codes
    X_test[col] = X_test[col].astype(pd.CategoricalDtype(categories=all_categories.cat.categories)).cat.codes

# Conversion en float32
X_train = X_train.astype(np.float32)
X_valid = X_valid.astype(np.float32)
X_test = X_test.astype(np.float32)

Colonnes catégorielles: Index([], dtype='object')


## **MODELISATION**:

Nous allons utiliser un modèle LightGBM qui est méthode de gradient boosting efficace pour la gestion de grands nombres de données.\
Dans un premier temps nous allons utiliser un modèle avec des paramètres de base que nous fixerons aléatoirement. Ensuite, nous allons utliser une gridsearch, un randomized search ou une bayesian search pour optimiser notre modèle.

In [None]:
import pandas as pd
import numpy as np
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import ParameterGrid

lgb_train = lgb.Dataset(X_train, label=y_train)
lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)

params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "auc",
    "max_depth": 3,
    "num_leaves": 31,
    "learning_rate": 0.05,
    "feature_fraction": 0.9,
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "n_estimators": 1000,
    "verbose": -1,
}

gbm = lgb.train(
    params,
    lgb_train,
    valid_sets=lgb_valid,
    callbacks=[lgb.log_evaluation(50), lgb.early_stopping(10)]

In [None]:
# Paramètres pour une Grid Search
param_grid = {
    'boosting_type': ['gbdt'],
    'objective': ['binary'],
    'metric': ['auc'],
    'max_depth': [5, 7],
    'num_leaves': [31, 63],
    'learning_rate': [0.05, 0.1],
    'feature_fraction': [0.8, 0.9],
    'bagging_fraction': [0.7, 0.8],
    'bagging_freq': [5],
    'n_estimators': [500, 1000]
}

# Fonction pour évaluer les paramètres
def evaluate_params(params, X_train, y_train, X_valid, y_valid, base_valid):
    lgb_train = lgb.Dataset(X_train, label=y_train)
    lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)
    
    gbm = lgb.train(
        params,
        lgb_train,
        valid_sets=lgb_valid,
        callbacks=[lgb.log_evaluation(50), lgb.early_stopping(10)]
    )
    
    y_pred = gbm.predict(X_valid, num_iteration=gbm.best_iteration)
    base_valid["score"] = y_pred
    
    return roc_auc_score(base_valid["target"], base_valid["score"])

# Recherche des meilleurs paramètres
best_score = float('-inf')
best_params = None

for params in ParameterGrid(param_grid):
    score = evaluate_params(params, X_train, y_train, X_valid, y_valid, base_valid)
    if score > best_score:
        best_score = score
        best_params = params

print("Meilleurs paramètres trouvés:")
print(best_params)

Meilleurs paramètres trouvés:
{'bagging_fraction': 0.7, 'bagging_freq': 5, 'boosting_type': 'gbdt', 'feature_fraction': 0.9, 'learning_rate': 0.1, 'max_depth': 5, 'metric': 'auc', 'n_estimators': 100, 'num_leaves': 31, 'objective': 'binary'}

{'bagging_fraction': 0.8, 'bagging_freq': 5, 'boosting_type': 'gbdt', 'feature_fraction': 0.9, 'learning_rate': 0.05, 'max_depth': 7, 'metric': 'auc', 'n_estimators': 500, 'num_leaves': 31, 'objective': 'binary'}


In [27]:
# Entraînons le modèle final avec les meilleurs paramètres sur l'ensemble complet des données
best_model = lgb.LGBMClassifier(**best_params)
best_model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], callbacks=[lgb.early_stopping(10), lgb.log_evaluation(50)])

# Prédictions et évaluation
y_pred_train = best_model.predict_proba(X_train)[:, 1]
y_pred_valid = best_model.predict_proba(X_valid)[:, 1]
y_pred_test = best_model.predict_proba(X_test)[:, 1]

# les scores AUC
base_train["score"] = y_pred_train
base_valid["score"] = y_pred_valid
base_test["score"] = y_pred_test

stability_score_train = roc_auc_score(base_train["target"], base_train["score"])
stability_score_valid = roc_auc_score(base_valid["target"], base_valid["score"])
stability_score_test = roc_auc_score(base_test["target"], base_test["score"])

print(f'The AUC score on the train set is: {stability_score_train}') 
print(f'The AUC score on the valid set is: {stability_score_valid}') 
print(f'The AUC score on the test set is: {stability_score_test}')

[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.544660 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.699502
[100]	valid_0's auc: 0.715798
[150]	valid_0's auc: 0.720191
Early stopping, best iteration is:
[179]	valid_0's auc: 0.72152
The AUC score on the train set is: 0.7719679254229231
The AUC score on the valid set is: 0.7215204701358409
The AUC score on the test set is: 0.7231673682933353


In [30]:
## Calcul de la métrique à considérer pour l'évaluation de notre modèle.

def gini_stability(base, w_fallingrate=88.0, w_resstd=-0.5):
    gini_in_time = base.loc[:, ["WEEK_NUM", "target", "score"]]\
        .sort_values("WEEK_NUM")\
        .groupby("WEEK_NUM")[["target", "score"]]\
        .apply(lambda x: 2*roc_auc_score(x["target"], x["score"])-1).tolist()
    
    x = np.arange(len(gini_in_time))
    y = gini_in_time
    a, b = np.polyfit(x, y, 1)
    y_hat = a*x + b
    residuals = y - y_hat
    res_std = np.std(residuals)
    avg_gini = np.mean(gini_in_time)
    return avg_gini + w_fallingrate * min(0, a) + w_resstd * res_std

stability_score_train = gini_stability(base_train)
stability_score_valid = gini_stability(base_valid)
stability_score_test = gini_stability(base_test)

print(f'The stability score on the train set is: {stability_score_train}') 
print(f'The stability score on the valid set is: {stability_score_valid}') 
print(f'The stability score on the test set is: {stability_score_test}')  

The stability score on the train set is: 0.5185379029535647
The stability score on the valid set is: 0.4076619345459677
The stability score on the test set is: 0.4069504782503902


In [46]:
## Utilisons une BayesianSearch

from bayes_opt import BayesianOptimization
# Fonction d'évaluation pour la recherche bayésienne
def evaluate_params(max_depth, num_leaves, learning_rate, feature_fraction, bagging_fraction, bagging_freq, n_estimators):
    params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric': 'auc',
        'max_depth': int(max_depth),
        'num_leaves': int(num_leaves),
        'learning_rate': learning_rate,
        'feature_fraction': feature_fraction,
        'bagging_fraction': bagging_fraction,
        'bagging_freq': int(bagging_freq),
        'n_estimators': int(n_estimators)
    }
    
    lgb_train = lgb.Dataset(X_train, label=y_train)
    lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)
    
    gbm = lgb.train(
        params,
        lgb_train,
        valid_sets=lgb_valid,
        callbacks=[lgb.log_evaluation(50), lgb.early_stopping(10)]
    )
    
    y_pred = gbm.predict(X_valid, num_iteration=gbm.best_iteration)
    score = roc_auc_score(base_valid["target"], y_pred)
    
    return score

# Définissons les paramètres de la recherche bayésienne
pbounds = {
    'max_depth': (5, 7),
    'num_leaves': (31, 63),
    'learning_rate': (0.05, 0.1),
    'feature_fraction': (0.8, 0.9),
    'bagging_fraction': (0.7, 0.8),
    'n_estimators': (100, 1000),
    'bagging_freq': (5, 8)
}

# initialisation et optimisation bayésienne
optimizer = BayesianOptimization(
    f=evaluate_params,
    pbounds=pbounds,
    random_state=42,
    verbose=2
)

optimizer.maximize(
    init_points=5,
    n_iter=10
)

# Meilleurs paramètres et score
best_params = optimizer.max['params']
best_params['max_depth'] = int(best_params['max_depth'])
best_params['num_leaves'] = int(best_params['num_leaves'])
best_params['n_estimators'] = int(best_params['n_estimators'])
best_score = optimizer.max['target']

print("Meilleurs paramètres trouvés par BayesianOptimization:")
print(best_params)

|   iter    |  target   | baggin... | baggin... | featur... | learni... | max_depth | n_esti... | num_le... |
-------------------------------------------------------------------------------------------------------------




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.346566 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.711906
[100]	valid_0's auc: 0.722648
Early stopping, best iteration is:
[104]	valid_0's auc: 0.722763
| [0m1        [0m | [0m0.7228   [0m | [0m0.7375   [0m | [0m7.852    [0m | [0m0.8732   [0m | [0m0.07993  [0m | [0m5.312    [0m | [0m240.4    [0m | [0m32.86    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.189341 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.695119
Early stopping, best iteration is:
[79]	valid_0's auc: 0.708467
| [0m2        [0m | [0m0.7085   [0m | [0m0.7866   [0m | [0m6.803    [0m | [0m0.8708   [0m | [0m0.05103  [0m | [0m6.94     [0m | [0m849.2    [0m | [0m37.79    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.189206 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.709501
Early stopping, best iteration is:
[80]	valid_0's auc: 0.718499
| [0m3        [0m | [0m0.7185   [0m | [0m0.7182   [0m | [0m5.55     [0m | [0m0.8304   [0m | [0m0.07624  [0m | [0m5.864    [0m | [0m362.1    [0m | [0m50.58    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.191899 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.703824
Early stopping, best iteration is:
[80]	valid_0's auc: 0.711608
| [0m4        [0m | [0m0.7116   [0m | [0m0.7139   [0m | [0m5.876    [0m | [0m0.8366   [0m | [0m0.0728   [0m | [0m6.57     [0m | [0m279.7    [0m | [0m47.46    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.452196 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.702958
Early stopping, best iteration is:
[83]	valid_0's auc: 0.714905
| [0m5        [0m | [0m0.7149   [0m | [0m0.7592   [0m | [0m5.139    [0m | [0m0.8608   [0m | [0m0.05853  [0m | [0m5.13     [0m | [0m954.0    [0m | [0m61.9     [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.188326 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.714836
Early stopping, best iteration is:
[54]	valid_0's auc: 0.716729
| [0m6        [0m | [0m0.7167   [0m | [0m0.7237   [0m | [0m7.019    [0m | [0m0.883    [0m | [0m0.09897  [0m | [0m5.264    [0m | [0m240.4    [0m | [0m32.36    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.190505 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.712272
Early stopping, best iteration is:
[82]	valid_0's auc: 0.72023
| [0m7        [0m | [0m0.7202   [0m | [0m0.7149   [0m | [0m5.744    [0m | [0m0.8774   [0m | [0m0.07222  [0m | [0m5.362    [0m | [0m361.1    [0m | [0m50.91    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.312094 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.70384
Early stopping, best iteration is:
[81]	valid_0's auc: 0.711596
| [0m8        [0m | [0m0.7116   [0m | [0m0.7879   [0m | [0m5.463    [0m | [0m0.8665   [0m | [0m0.0639   [0m | [0m6.102    [0m | [0m361.3    [0m | [0m49.55    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.192523 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.707133
Early stopping, best iteration is:
[79]	valid_0's auc: 0.713661
| [0m9        [0m | [0m0.7137   [0m | [0m0.72     [0m | [0m7.838    [0m | [0m0.8157   [0m | [0m0.07432  [0m | [0m6.257    [0m | [0m240.2    [0m | [0m33.55    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.190018 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.702488
[100]	valid_0's auc: 0.715089
Early stopping, best iteration is:
[112]	valid_0's auc: 0.717
| [0m10       [0m | [0m0.717    [0m | [0m0.7556   [0m | [0m7.711    [0m | [0m0.8662   [0m | [0m0.06153  [0m | [0m6.932    [0m | [0m456.5    [0m | [0m31.89    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.189762 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.702524
Early stopping, best iteration is:
[80]	valid_0's auc: 0.711023
| [0m11       [0m | [0m0.711    [0m | [0m0.7809   [0m | [0m5.578    [0m | [0m0.8851   [0m | [0m0.08255  [0m | [0m6.326    [0m | [0m146.1    [0m | [0m44.4     [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.191076 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.706833
Early stopping, best iteration is:
[70]	valid_0's auc: 0.712803
| [0m12       [0m | [0m0.7128   [0m | [0m0.771    [0m | [0m6.894    [0m | [0m0.8802   [0m | [0m0.0837   [0m | [0m6.69     [0m | [0m294.0    [0m | [0m61.72    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.187871 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.707025
[100]	valid_0's auc: 0.714688
Early stopping, best iteration is:
[112]	valid_0's auc: 0.717846
| [0m13       [0m | [0m0.7178   [0m | [0m0.7119   [0m | [0m7.851    [0m | [0m0.8968   [0m | [0m0.09767  [0m | [0m6.85     [0m | [0m312.8    [0m | [0m44.49    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.191993 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.706197
[100]	valid_0's auc: 0.71799
Early stopping, best iteration is:
[112]	valid_0's auc: 0.718577
| [0m14       [0m | [0m0.7186   [0m | [0m0.7976   [0m | [0m7.1      [0m | [0m0.8903   [0m | [0m0.07223  [0m | [0m5.816    [0m | [0m316.4    [0m | [0m36.73    [0m |




[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.189241 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
[50]	valid_0's auc: 0.709632
Early stopping, best iteration is:
[72]	valid_0's auc: 0.71719
| [0m15       [0m | [0m0.7172   [0m | [0m0.7428   [0m | [0m5.982    [0m | [0m0.8518   [0m | [0m0.09907  [0m | [0m5.034    [0m | [0m901.5    [0m | [0m47.62    [0m |
Meilleurs paramètres trouvés par BayesianOptimization:
{'bagging_fraction': 0.7374540118847362, 'bagging_

Meilleurs paramètres trouvés par BayesianOptimization:
{'bagging_fraction': 0.7649629555677399, 'feature_fraction': 0.8870130839701755, 'learning_rate': 0.07796225729846441, 'max_depth': 5, 'n_estimators': 975, 'num_leaves': 48}

Utilisation des bests params

In [137]:
lgb_train = lgb.Dataset(X_train, label=y_train)
lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)

params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "auc",
    "max_depth": 5,
    "num_leaves": 48,
    "learning_rate": 0.07796225729846441,
    "feature_fraction": 0.8870130839701755, 
    "bagging_fraction": 0.7649629555677399,
    "bagging_freq": 5,
    "n_estimators": 975,
    "verbose": -1,
}

gbm = lgb.train(
    params,
    lgb_train,
    valid_sets=lgb_valid,
    callbacks=[lgb.log_evaluation(60), lgb.early_stopping(10)]
)

for base, X in [(base_train, X_train), (base_valid, X_valid), (base_test, X_test)]:
    y_pred = gbm.predict(X, num_iteration=gbm.best_iteration)
    base["score"] = y_pred

print(f'The AUC score on the train set is: {roc_auc_score(base_train["target"], base_train["score"])}') 
print(f'The AUC score on the valid set is: {roc_auc_score(base_valid["target"], base_valid["score"])}') 
print(f'The AUC score on the test set is: {roc_auc_score(base_test["target"], base_test["score"])}')  



Training until validation scores don't improve for 10 rounds
[60]	valid_0's auc: 0.713083
[120]	valid_0's auc: 0.721541
Early stopping, best iteration is:
[122]	valid_0's auc: 0.721824
The AUC score on the train set is: 0.7672716241168197
The AUC score on the valid set is: 0.7218236929890642
The AUC score on the test set is: 0.72278874702814


In [138]:
def gini_stability(base, w_fallingrate=88.0, w_resstd=-0.5):
    gini_in_time = base.loc[:, ["WEEK_NUM", "target", "score"]]\
        .sort_values("WEEK_NUM")\
        .groupby("WEEK_NUM")[["target", "score"]]\
        .apply(lambda x: 2*roc_auc_score(x["target"], x["score"])-1).tolist()
    
    x = np.arange(len(gini_in_time))
    y = gini_in_time
    a, b = np.polyfit(x, y, 1)
    y_hat = a*x + b
    residuals = y - y_hat
    res_std = np.std(residuals)
    avg_gini = np.mean(gini_in_time)
    return avg_gini + w_fallingrate * min(0, a) + w_resstd * res_std

stability_score_train = gini_stability(base_train)
stability_score_valid = gini_stability(base_valid)
stability_score_test = gini_stability(base_test)

print(f'The stability score on the train set is: {stability_score_train}') 
print(f'The stability score on the valid set is: {stability_score_valid}') 
print(f'The stability score on the test set is: {stability_score_test}')  

The stability score on the train set is: 0.5090086963195201
The stability score on the valid set is: 0.40621974237788144
The stability score on the test set is: 0.4032130762785661


In [51]:
### randomized search

from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV

# Paramètres pour Randomized Search
param_dist = {
    'boosting_type': ['gbdt'],
    'objective': ['binary'],
    'metric': ['auc'],
    'max_depth': randint(5, 8),
    'num_leaves': randint(30, 90),
    'learning_rate': uniform(0.05, 0.9),
    'feature_fraction': uniform(0.8, 0.2),
    'bagging_fraction': uniform(0.01, 0.9),
    'bagging_freq': [5],
    'n_estimators': randint(100, 1001)
}

# Modèle LightGBM
lgb_model = lgb.LGBMClassifier()

# Randomized Search
random_search = RandomizedSearchCV(
    estimator=lgb_model,
    param_distributions=param_dist,
    n_iter=10,
    scoring='roc_auc',
    cv=3,
    verbose=1,
    random_state=42,
    n_jobs=-1
)

# Entraîner la recherche de paramètres
random_search.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], eval_metric='auc', callbacks=[lgb.early_stopping(10), lgb.log_evaluation(50)])

# Meilleurs paramètres et score
best_params = random_search.best_params_
best_score = random_search.best_score_

print("Meilleurs paramètres trouvés par RandomizedSearchCV:")
print(best_params)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[LightGBM] [Info] Number of positive: 19248, number of negative: 591415
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.371937 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 9012
[LightGBM] [Info] Number of data points in the train set: 610663, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[12]	valid_0's auc: 0.678975
[LightGBM] [Info] Number of positive: 19248, number of negative: 591415
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.351865 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not

Meilleurs paramètres trouvés par RandomizedSearchCV:
{'bagging_fraction': 0.780839734811646, 'bagging_freq': 5, 'boosting_type': 'gbdt', 'feature_fraction': 0.8609227538346742, 'learning_rate': 0.1379049026057455, 'max_depth': 7, 'metric': 'auc', 'n_estimators': 554, 'num_leaves': 31, 'objective': 'binary'}

In [94]:
lgb_train = lgb.Dataset(X_train, label=y_train)
lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)

params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "auc",
    "max_depth": 7,
    "num_leaves": 31,
    "learning_rate": 0.1379049026057455,
    "feature_fraction": 0.8609227538346742, 
    "bagging_fraction": 0.780839734811646,
    "bagging_freq": 5,
    "n_estimators": 554,
    "verbose": -1,
}

gbm = lgb.train(
    params,
    lgb_train,
    valid_sets=lgb_valid,
    callbacks=[lgb.log_evaluation(60), lgb.early_stopping(10)]
)

for base, X in [(base_train, X_train), (base_valid, X_valid), (base_test, X_test)]:
    y_pred = gbm.predict(X, num_iteration=gbm.best_iteration)
    base["score"] = y_pred

print(f'The AUC score on the train set is: {roc_auc_score(base_train["target"], base_train["score"])}') 
print(f'The AUC score on the valid set is: {roc_auc_score(base_valid["target"], base_valid["score"])}') 
print(f'The AUC score on the test set is: {roc_auc_score(base_test["target"], base_test["score"])}')



[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.213033 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[35]	valid_0's auc: 0.72252
The AUC score on the train set is: 0.7541449399822275
The AUC score on the valid set is: 0.7225202908527186
The AUC score on the test set is: 0.7212562105697082


In [95]:
def gini_stability(base, w_fallingrate=88.0, w_resstd=-0.5):
    gini_in_time = base.loc[:, ["WEEK_NUM", "target", "score"]]\
        .sort_values("WEEK_NUM")\
        .groupby("WEEK_NUM")[["target", "score"]]\
        .apply(lambda x: 2*roc_auc_score(x["target"], x["score"])-1).tolist()
    
    x = np.arange(len(gini_in_time))
    y = gini_in_time
    a, b = np.polyfit(x, y, 1)
    y_hat = a*x + b
    residuals = y - y_hat
    res_std = np.std(residuals)
    avg_gini = np.mean(gini_in_time)
    return avg_gini + w_fallingrate * min(0, a) + w_resstd * res_std

stability_score_train = gini_stability(base_train)
stability_score_valid = gini_stability(base_valid)
stability_score_test = gini_stability(base_test)

print(f'The stability score on the train set is: {stability_score_train}') 
print(f'The stability score on the valid set is: {stability_score_valid}') 
print(f'The stability score on the test set is: {stability_score_test}')  

The stability score on the train set is: 0.47920748162231436
The stability score on the valid set is: 0.4109371166357929
The stability score on the test set is: 0.40613899443705753


In [57]:
pip install hyperopt

Note: you may need to restart the kernel to use updated packages.


In [140]:
##autre méthode d'optimisation:

from hyperopt import hp, tpe, fmin, Trials, STATUS_OK
import lightgbm as lgb
from sklearn.metrics import roc_auc_score

def evaluate_params(params):
    params['max_depth'] = int(params['max_depth'])
    params['num_leaves'] = int(params['num_leaves'])

    lgb_train = lgb.Dataset(X_train, label=y_train)
    lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)

    gbm = lgb.train(
        params,
        lgb_train,
        valid_sets=lgb_valid,
        callbacks=[lgb.log_evaluation(100), lgb.early_stopping(10)]
    )

    y_pred = gbm.predict(X_valid, num_iteration=gbm.best_iteration)
    score = roc_auc_score(y_valid, y_pred)
    return {'loss': -score, 'status': STATUS_OK}

space = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': hp.uniform('learning_rate', 0.01, 0.2),
    'max_depth': hp.quniform('max_depth', 3, 10, 1),
    'num_leaves': hp.quniform('num_leaves', 20, 100, 1),
    'feature_fraction': hp.uniform('feature_fraction', 0.5, 1.0),
    'bagging_fraction': hp.uniform('bagging_fraction', 0.5, 1.0)
}

trials = Trials()
best = fmin(fn=evaluate_params,
            space=space,
            algo=tpe.suggest,
            max_evals=20,
            trials=trials,
            rstate=np.random.default_rng(42))

print(best)


[LightGBM] [Info] Number of positive: 28872, number of negative: 887123
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.203698 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8980                     
[LightGBM] [Info] Number of data points in the train set: 915995, number of used features: 48
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.031520 -> initscore=-3.425111
[LightGBM] [Info] Start training from score -3.425111 
Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:                    
[27]	valid_0's auc: 0.708813
[LightGBM] [Info] Number of positive: 28872, number of negative: 887123          
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.162416 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, 

In [143]:
lgb_train = lgb.Dataset(X_train, label=y_train)
lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)

params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "auc",
    "max_depth": 8,
    "num_leaves": 43,
    "learning_rate": 0.12572978832678067,
    "feature_fraction": 0.6084348379285371, 
    "bagging_fraction": 0.9043917325109541,
    "bagging_freq": 5,
    "n_estimators": 1000,
    "verbose": -1,
}
#bagging_fraction': , 'feature_fraction': , 'learning_rate': , 'max_depth': 5, 'n_estimators': 975, 'num_leaves': 48
gbm = lgb.train(
    params,
    lgb_train,
    valid_sets=lgb_valid,
    callbacks=[lgb.log_evaluation(60), lgb.early_stopping(10)]
)

for base, X in [(base_train, X_train), (base_valid, X_valid), (base_test, X_test)]:
    y_pred = gbm.predict(X, num_iteration=gbm.best_iteration)
    base["score"] = y_pred

print(f'The AUC score on the train set is: {roc_auc_score(base_train["target"], base_train["score"])}') 
print(f'The AUC score on the valid set is: {roc_auc_score(base_valid["target"], base_valid["score"])}') 
print(f'The AUC score on the test set is: {roc_auc_score(base_test["target"], base_test["score"])}')  



Training until validation scores don't improve for 10 rounds
Early stopping, best iteration is:
[44]	valid_0's auc: 0.71927
The AUC score on the train set is: 0.7645672890821629
The AUC score on the valid set is: 0.7192696884405961
The AUC score on the test set is: 0.7164991741742833


In [144]:
def gini_stability(base, w_fallingrate=88.0, w_resstd=-0.5):
    gini_in_time = base.loc[:, ["WEEK_NUM", "target", "score"]]\
        .sort_values("WEEK_NUM")\
        .groupby("WEEK_NUM")[["target", "score"]]\
        .apply(lambda x: 2*roc_auc_score(x["target"], x["score"])-1).tolist()
    
    x = np.arange(len(gini_in_time))
    y = gini_in_time
    a, b = np.polyfit(x, y, 1)
    y_hat = a*x + b
    residuals = y - y_hat
    res_std = np.std(residuals)
    avg_gini = np.mean(gini_in_time)
    return avg_gini + w_fallingrate * min(0, a) + w_resstd * res_std

stability_score_train = gini_stability(base_train)
stability_score_valid = gini_stability(base_valid)
stability_score_test = gini_stability(base_test)

print(f'The stability score on the train set is: {stability_score_train}') 
print(f'The stability score on the valid set is: {stability_score_valid}') 
print(f'The stability score on the test set is: {stability_score_test}')  

The stability score on the train set is: 0.5022621276620299
The stability score on the valid set is: 0.4070058676578434
The stability score on the test set is: 0.39337574545092435


Avec une recherche manuelle nous avon trouvé ce modèle un peu plus performant que les autre obtenus avec des méthodes d'optimisation. Cela peut peut-etre du au choix de l'espace des hyperparamètres.

In [148]:
##Meilleur modèle lightgbm
lgb_train = lgb.Dataset(X_train, label=y_train)
lgb_valid = lgb.Dataset(X_valid, label=y_valid, reference=lgb_train)

params = {
    "boosting_type": "gbdt",
    "objective": "binary",
    "metric": "auc",
    "max_depth": 15,
    "num_leaves": 48,
    "learning_rate": 0.05,
    "feature_fraction": 0.9, 
    "bagging_fraction": 0.8,
    "bagging_freq": 5,
    "n_estimators": 2000,
    "verbose": -1,
}
#bagging_fraction': , 'feature_fraction': , 'learning_rate': , 'max_depth': 5, 'n_estimators': 975, 'num_leaves': 48
gbm = lgb.train(
    params,
    lgb_train,
    valid_sets=lgb_valid,
    callbacks=[lgb.log_evaluation(60), lgb.early_stopping(10)]
)

for base, X in [(base_train, X_train), (base_valid, X_valid), (base_test, X_test)]:
    y_pred = gbm.predict(X, num_iteration=gbm.best_iteration)
    base["score"] = y_pred

print(f'The AUC score on the train set is: {roc_auc_score(base_train["target"], base_train["score"])}') 
print(f'The AUC score on the valid set is: {roc_auc_score(base_valid["target"], base_valid["score"])}') 
print(f'The AUC score on the test set is: {roc_auc_score(base_test["target"], base_test["score"])}')  



Training until validation scores don't improve for 10 rounds
[60]	valid_0's auc: 0.712763
[120]	valid_0's auc: 0.722032
Early stopping, best iteration is:
[142]	valid_0's auc: 0.723386
The AUC score on the train set is: 0.7824174684316207
The AUC score on the valid set is: 0.7233862700684043
The AUC score on the test set is: 0.7240282962502375


In [146]:
def gini_stability(base, w_fallingrate=88.0, w_resstd=-0.5):
    gini_in_time = base.loc[:, ["WEEK_NUM", "target", "score"]]\
        .sort_values("WEEK_NUM")\
        .groupby("WEEK_NUM")[["target", "score"]]\
        .apply(lambda x: 2*roc_auc_score(x["target"], x["score"])-1).tolist()
    
    x = np.arange(len(gini_in_time))
    y = gini_in_time
    a, b = np.polyfit(x, y, 1)
    y_hat = a*x + b
    residuals = y - y_hat
    res_std = np.std(residuals)
    avg_gini = np.mean(gini_in_time)
    return avg_gini + w_fallingrate * min(0, a) + w_resstd * res_std

stability_score_train = gini_stability(base_train)
stability_score_valid = gini_stability(base_valid)
stability_score_test = gini_stability(base_test)

print(f'The stability score on the train set is: {stability_score_train}') 
print(f'The stability score on the valid set is: {stability_score_valid}') 
print(f'The stability score on the test set is: {stability_score_test}')  

The stability score on the train set is: 0.5378837797386191
The stability score on the valid set is: 0.41289640230069946
The stability score on the test set is: 0.4104966787746163


Après tous ces codes d'optimisation de modèle, c'est le dernier modèle qui est le meilleur avec un score de 41% sur les ensembles validation et test, et 53% sur le train.

In [152]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Obtenez les prédictions de scores pour les données de test
y_test_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

# Comparez les prédictions avec les vrais scores
mse = mean_squared_error(base_test['score'], y_test_pred)
rmse = mean_squared_error(base_test['score'], y_test_pred, squared=False)
mae = mean_absolute_error(base_test['score'], y_test_pred)

# Affichage des résultats
print(f"Mean Squared Error (MSE) : {mse}") 
print(f"Root Mean Squared Error (RMSE) : {rmse}")
print(f"Mean Absolute Error (MAE) : {mae}")

Mean Squared Error (MSE) : 0.0
Root Mean Squared Error (RMSE) : 0.0
Mean Absolute Error (MAE) : 0.0


Dans la suite, nous devons tester d'autres modèles comme le gradient boosting ou le catboost avec ces doonnées, ensuite regarder ceux qui prédisent le mieux, en enfin faire un stacking ou un mélange de modèles pour augmenter la performance.