# 02. Entraînement et Sélection du Modèle Final

Ce notebook implémente le workflow complet validé :
1. **Entraînement Initial** : Comparer 4 modèles de base (Dummy, LogReg, RF, XGB) sur les datasets V1 et V2 pour choisir le meilleur dataset.
2. **Sélection Dataset & Split** : On fixe le split (Train/Val/Test) du meilleur dataset identifié.
3. **Optimisation LightGBM** : On optimise LightGBM uniquement sur ce dataset.
4. **Cross-Validation & Construction Ensemble** : On entraîne 5 modèles via CV qui sont assemblés (Ensemble) pour une robustesse maximale, sans ré-entraînement global.
5. **Evaluation Test & Sauvegarde** : On évalue cet Ensemble final sur le Test set (jamais vu).

In [1]:
import pandas as pd
import numpy as np
import mlflow
import sys
import os
import joblib
import shutil

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path: sys.path.append(project_root)

from src.model_utils import (
    get_train_val_test_split,
    train_dummy, 
    train_random_forest, 
    train_xgboost, 
    train_lightgbm,
    train_model_cv, 
    optimize_lightgbm,
    evaluate_model, 
    find_best_threshold
)

mlflow.set_tracking_uri("../mlruns")
mlflow.set_experiment("Credit_Scoring_Final_Workflow")

  return FileStore(store_uri, store_uri)
2025/12/13 20:28:11 INFO mlflow.tracking.fluent: Experiment with name 'Credit_Scoring_Final_Workflow' does not exist. Creating a new experiment.


<Experiment: artifact_location=('file:///c:/Users/aubin/Majeur IA/data '
 'analysis/credit-scoring/notebooks/../mlruns/466018668070655282'), creation_time=1765654091893, experiment_id='466018668070655282', last_update_time=1765654091893, lifecycle_stage='active', name='Credit_Scoring_Final_Workflow', tags={}>

In [2]:
# Chargement V1/V2
X_v1 = pd.read_pickle('../data/processed/X_prepared_v1.pkl')
y_v1 = pd.read_pickle('../data/processed/y_prepared_v1.pkl')
X_v2 = pd.read_pickle('../data/processed/X_prepared_v2.pkl')
y_v2 = pd.read_pickle('../data/processed/y_prepared_v2.pkl')

def clean_cols(df):
    df.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in df.columns]
    return df
X_v1 = clean_cols(X_v1)
X_v2 = clean_cols(X_v2)

## 1. Entraînement initial

In [3]:
results = []

# On stocke les données splittées pour pouvoir réutiliser celles du gagnant
splits = {}

for name, X, y in [("v1", X_v1, y_v1), ("v2", X_v2, y_v2)]:
    print(f"Benchmarking Dataset {name}")
    Xt, yt, Xv, yv, Xte, yte = get_train_val_test_split(X, y)
    splits[name] = (Xt, yt, Xv, yv, Xte, yte)
    
    # Dummy
    _, m = train_dummy(Xt, yt, Xv, yv, Xte, yte, name)
    results.append({"Data": name, "Model": "Dummy", **m})
    
    # RF
    _, m = train_random_forest(Xt, yt, Xv, yv, Xte, yte, name)
    results.append({"Data": name, "Model": "RF", **m})
    
    # XGB
    _, m = train_xgboost(Xt, yt, Xv, yv, Xte, yte, name)
    results.append({"Data": name, "Model": "XGB", **m})
    
    # LightGBM (Baseline)
    _, m = train_lightgbm(Xt, yt, Xv, yv, Xte, yte, name)
    results.append({"Data": name, "Model": "LightGBM", **m})

df_res = pd.DataFrame(results).sort_values("val_best_cost")
display(df_res[["Data", "Model", "business_cost", "auc", "val_best_cost"]])

Benchmarking Dataset v1
Entraînement Dummy_v1...
Meilleur seuil trouvé (Val): 0.01 (Coût: 37240)




Metrics (Test): {'auc': 0.5, 'recall': 0.0, 'f1': 0.0, 'accuracy': 0.9192663732737876, 'business_cost': np.int64(37240), 'val_best_cost': np.int64(37240)}
Entraînement RandomForest_v1...
Meilleur seuil trouvé (Val): 0.49 (Coût: 24679)




Metrics (Test): {'auc': 0.751841252998915, 'recall': 0.64312567132116, 'f1': 0.2729966944032828, 'accuracy': 0.7234591454029093, 'business_cost': np.int64(24717), 'val_best_cost': np.int64(24679)}
Entraînement XGBoost_v1...




Meilleur seuil trouvé (Val): 0.47 (Coût: 23317)




Metrics (Test): {'auc': 0.7811011886027459, 'recall': 0.6541353383458647, 'f1': 0.30628025397623687, 'accuracy': 0.7607691807401306, 'business_cost': np.int64(22627), 'val_best_cost': np.int64(23317)}
Entraînement LightGBM_v1...
[LightGBM] [Info] Number of positive: 17377, number of negative: 197880
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.123890 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 20520
[LightGBM] [Info] Number of data points in the train set: 215257, number of used features: 275
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.581632	valid_0's business_cost: 24417
[100]	valid_0's binary_logloss: 0.559113	valid_0's business_cost: 23567
[150]	valid_0's binary_logloss: 0.547297	valid_0's business_cost: 232



Metrics (Test): {'auc': 0.7818178650645198, 'recall': 0.685016111707841, 'f1': 0.29554538608584835, 'accuracy': 0.7363583150866088, 'business_cost': np.int64(22718), 'val_best_cost': np.int64(23141)}
Benchmarking Dataset v2
Entraînement Dummy_v2...
Meilleur seuil trouvé (Val): 0.01 (Coût: 37230)




Metrics (Test): {'auc': 0.5, 'recall': 0.0, 'f1': 0.0, 'accuracy': 0.9192663732737876, 'business_cost': np.int64(37240), 'val_best_cost': np.int64(37230)}
Entraînement RandomForest_v2...
Meilleur seuil trouvé (Val): 0.49 (Coût: 24370)




Metrics (Test): {'auc': 0.7544609364703311, 'recall': 0.6296992481203008, 'f1': 0.27576880108190743, 'accuracy': 0.7329763479090338, 'business_cost': np.int64(24728), 'val_best_cost': np.int64(24370)}
Entraînement XGBoost_v2...




Meilleur seuil trouvé (Val): 0.47 (Coût: 23192)




Metrics (Test): {'auc': 0.777869056571474, 'recall': 0.6514500537056928, 'f1': 0.2959980478282089, 'accuracy': 0.7498211459665706, 'business_cost': np.int64(23222), 'val_best_cost': np.int64(23192)}
Entraînement LightGBM_v2...
[LightGBM] [Info] Number of positive: 17377, number of negative: 197880
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.085694 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12549
[LightGBM] [Info] Number of data points in the train set: 215257, number of used features: 117
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.580694	valid_0's business_cost: 24346
[100]	valid_0's binary_logloss: 0.558407	valid_0's business_cost: 23624
[150]	valid_0's binary_logloss: 0.546756	valid_0's business_cost: 23311



Metrics (Test): {'auc': 0.7789777125237858, 'recall': 0.6855531686358755, 'f1': 0.28369818868763197, 'accuracy': 0.7205107637609209, 'business_cost': np.int64(23431), 'val_best_cost': np.int64(23048)}


Unnamed: 0,Data,Model,business_cost,auc,val_best_cost
7,v2,LightGBM,23431,0.778978,23048
3,v1,LightGBM,22718,0.781818,23141
6,v2,XGB,23222,0.777869,23192
2,v1,XGB,22627,0.781101,23317
5,v2,RF,24728,0.754461,24370
1,v1,RF,24717,0.751841,24679
4,v2,Dummy,37240,0.5,37230
0,v1,Dummy,37240,0.5,37240


## 2. Sélection du Dataset Gagnant pour l'Optimisation
On prend le dataset qui a donné le meilleur score LightGBM (le modèle cible).

In [4]:
lgbm_res = df_res[df_res["Model"] == "LightGBM"].sort_values("val_best_cost")
best_data_name = lgbm_res.iloc[0]["Data"]
print(f"Dataset sélectionné pour optimisation LightGBM : {best_data_name}")

# Récupération des splits EXISTANTS
X_train, y_train, X_val, y_val, X_test, y_test = splits[best_data_name]
print(f"Shape Train: {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}")

Dataset sélectionné pour optimisation LightGBM : v2
Shape Train: (215257, 124), Val: (46126, 124), Test: (46127, 124)


## 3. Optimisation Optuna LightGBM

In [5]:
best_params = optimize_lightgbm(X_train, y_train, X_val, y_val, n_trials=10)

final_params = best_params.copy()
final_params.update({
    "metric": "custom", "objective": "binary", "verbosity": -1,
    "boosting_type": "gbdt", "random_state": 42, "n_jobs": -1,
    "class_weight": "balanced", "n_estimators": 1000
})

[I 2025-12-13 20:31:26,714] A new study created in memory with name: no-name-203febb9-5cd4-4b8c-9432-5341587ebf5b
[I 2025-12-13 20:31:48,083] Trial 0 finished with value: 23212.0 and parameters: {'learning_rate': 0.14942003648617996, 'num_leaves': 167, 'max_depth': 8, 'min_child_samples': 81, 'min_split_gain': 0.16459778279698467, 'reg_alpha': 16.13110834403555, 'reg_lambda': 24.15329235983281}. Best is trial 0 with value: 23212.0.
[I 2025-12-13 20:32:06,316] Trial 1 finished with value: 23098.0 and parameters: {'learning_rate': 0.24156006546723113, 'num_leaves': 81, 'max_depth': 6, 'min_child_samples': 13, 'min_split_gain': 0.4180430648173412, 'reg_alpha': 21.21827513812108, 'reg_lambda': 20.26443030298551}. Best is trial 1 with value: 23098.0.
[I 2025-12-13 20:32:21,391] Trial 2 finished with value: 23286.0 and parameters: {'learning_rate': 0.23594889097981117, 'num_leaves': 45, 'max_depth': 11, 'min_child_samples': 81, 'min_split_gain': 0.6231616444495178, 'reg_alpha': 28.7863229985

Meilleurs params: {'learning_rate': 0.042259918925449495, 'num_leaves': 47, 'max_depth': 8, 'min_child_samples': 97, 'min_split_gain': 0.5002888403429622, 'reg_alpha': 4.299254375267902, 'reg_lambda': 14.348012168303342}


## 4. Cross-Validation & Construction de l'Ensemble Final
On entraîne 5 modèles fold par CV. L'Ensemble final est la moyenne de ces 5 modèles (stockée dans une classe simplifiée).

In [6]:
# Appel à notre fonction CV simplifiée qui retourne l'Ensemble et ses metrics Test

ensemble_final, metrics_final = train_model_cv(
    X_train, y_train, X_val, y_val, X_test, y_test, 
    dataset_name=f"{best_data_name}_Ensemble_Final", 
    params=final_params
)

print("METRICS FINALES SUR TEST (Ensemble CV):", metrics_final)



Fold 1 terminé (best_iteration=264).




Fold 2 terminé (best_iteration=326).




Fold 3 terminé (best_iteration=293).




Fold 4 terminé (best_iteration=320).




Fold 5 terminé (best_iteration=197).

Calibration seuil Ensemble sur X_val...
Metrics Test (Ensemble): {'auc': 0.7817818505991547, 'recall': 0.6769602577873255, 'f1': 0.29055494727136516, 'accuracy': 0.7331064235697097, 'business_cost': np.int64(23138)}




METRICS FINALES SUR TEST (Ensemble CV): {'auc': 0.7817818505991547, 'recall': 0.6769602577873255, 'f1': 0.29055494727136516, 'accuracy': 0.7331064235697097, 'business_cost': np.int64(23138)}


## 5. Entraînement du modèle final sans CV car le notebook 3 ne fonctionne pas avec l'ensemble de modèles du CV

In [7]:
#on repasse au meilleur modele avec meilleur parametre sans CV car ne fonctionne pas
modele_final, metrics_final = train_lightgbm(
    X_train, y_train, X_val, y_val, X_test, y_test, 
    dataset_name=f"{best_data_name}_Ensemble_Final", 
    params=final_params
)

Entraînement LightGBM_v2_Ensemble_Final...
Training until validation scores don't improve for 50 rounds
[50]	valid_0's business_cost: 24478
[100]	valid_0's business_cost: 23865
[150]	valid_0's business_cost: 23313
[200]	valid_0's business_cost: 23127
[250]	valid_0's business_cost: 22999
[300]	valid_0's business_cost: 22920
[350]	valid_0's business_cost: 22874
[400]	valid_0's business_cost: 22811
[450]	valid_0's business_cost: 22931
Early stopping, best iteration is:
[400]	valid_0's business_cost: 22811
Meilleur seuil trouvé (Val): 0.50 (Coût: 22811)




Metrics (Test): {'auc': 0.7814696260192563, 'recall': 0.6616541353383458, 'f1': 0.2956563474922006, 'accuracy': 0.7454852906107052, 'business_cost': np.int64(23080), 'val_best_cost': np.int64(22811)}


## 6. Sauvegarde

In [8]:
if not os.path.exists("../models"):
    os.makedirs("../models")

joblib.dump(modele_final, "../models/best_model.pkl")

path_serving = "../models/final_model"
if os.path.exists(path_serving): shutil.rmtree(path_serving)
# Sauvegarde MLflow comme modèle sklearn
mlflow.sklearn.save_model(modele_final, path_serving)
print("Modèle Ensemble sauvegardé et prêt pour Docker!")

Modèle Ensemble sauvegardé et prêt pour Docker!
