# 02. Entraînement et Sélection du Modèle Final

Ce notebook implémente le workflow complet validé :
1. **Entraînement Initial** : Comparer 4 modèles de base (Dummy, LogReg, RF, XGB) sur les datasets V1 et V2 pour choisir le meilleur dataset.
2. **Sélection Dataset & Split** : On fixe le split (Train/Val/Test) du meilleur dataset identifié.
3. **Optimisation LightGBM** : On optimise LightGBM uniquement sur ce dataset.
4. **Cross-Validation & Construction Ensemble** : On entraîne 5 modèles via CV qui sont assemblés (Ensemble) pour une robustesse maximale, sans ré-entraînement global.
5. **Evaluation Test & Sauvegarde** : On évalue cet Ensemble final sur le Test set (jamais vu).

In [1]:
import pandas as pd
import numpy as np
import mlflow
import sys
import os
import joblib
import shutil

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path: sys.path.append(project_root)

from src.model_utils import (
    get_train_val_test_split,
    train_dummy, 
    train_random_forest, 
    train_xgboost, 
    train_lightgbm,
    train_model_cv, 
    optimize_lightgbm,
    evaluate_model, 
    find_best_threshold
)

mlflow.set_tracking_uri("../mlruns")
mlflow.set_experiment("Credit_Scoring_Final_Workflow")

  return FileStore(store_uri, store_uri)


<Experiment: artifact_location=('file:///c:/Users/aubin/Majeur IA/data '
 'analysis/credit-scoring/notebooks/../mlruns/813913874702955039'), creation_time=1765642766148, experiment_id='813913874702955039', last_update_time=1765642766148, lifecycle_stage='active', name='Credit_Scoring_Final_Workflow', tags={'mlflow.experimentKind': 'custom_model_development'}>

In [2]:
# Chargement V1/V2
X_v1 = pd.read_pickle('../data/processed/X_prepared_v1.pkl')
y_v1 = pd.read_pickle('../data/processed/y_prepared_v1.pkl')
X_v2 = pd.read_pickle('../data/processed/X_prepared_v2.pkl')
y_v2 = pd.read_pickle('../data/processed/y_prepared_v2.pkl')

def clean_cols(df):
    df.columns = ["".join (c if c.isalnum() else "_" for c in str(x)) for x in df.columns]
    return df
X_v1 = clean_cols(X_v1)
X_v2 = clean_cols(X_v2)

## 1. Entraînement initial

In [None]:
results = []

# On stocke les données splittées pour pouvoir réutiliser celles du gagnant
splits = {}

for name, X, y in [("v1", X_v1, y_v1), ("v2", X_v2, y_v2)]:
    print(f"Benchmarking Dataset {name}")
    Xt, yt, Xv, yv, Xte, yte = get_train_val_test_split(X, y)
    splits[name] = (Xt, yt, Xv, yv, Xte, yte)
    
    # Dummy
    _, m = train_dummy(Xt, yt, Xv, yv, Xte, yte, name)
    results.append({"Data": name, "Model": "Dummy", **m})
    
    # RF
    _, m = train_random_forest(Xt, yt, Xv, yv, Xte, yte, name)
    results.append({"Data": name, "Model": "RF", **m})
    
    # XGB
    _, m = train_xgboost(Xt, yt, Xv, yv, Xte, yte, name)
    results.append({"Data": name, "Model": "XGB", **m})
    
    # LightGBM (Baseline)
    _, m = train_lightgbm(Xt, yt, Xv, yv, Xte, yte, name)
    results.append({"Data": name, "Model": "LightGBM", **m})

df_res = pd.DataFrame(results).sort_values("val_best_cost")
display(df_res[["Data", "Model", "business_cost", "auc", "val_best_cost"]])

Benchmarking Dataset v1
Entraînement LightGBM_v1...
[LightGBM] [Info] Number of positive: 17377, number of negative: 197880
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.035342 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 20520
[LightGBM] [Info] Number of data points in the train set: 215257, number of used features: 275
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.581632	valid_0's business_cost: 24417
[100]	valid_0's binary_logloss: 0.559113	valid_0's business_cost: 23567
[150]	valid_0's binary_logloss: 0.547297	valid_0's business_cost: 23280
[200]	valid_0's binary_logloss: 0.538914	valid_0's business_cost: 23258
[250]	valid_0's binary_logloss: 0.5318



Metrics (Test): {'auc': 0.7818178650645198, 'recall': 0.685016111707841, 'f1': 0.29554538608584835, 'accuracy': 0.7363583150866088, 'business_cost': np.int64(22718), 'val_best_cost': np.int64(23141)}
Benchmarking Dataset v2
Entraînement LightGBM_v2...
[LightGBM] [Info] Number of positive: 17377, number of negative: 197880
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.069702 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12549
[LightGBM] [Info] Number of data points in the train set: 215257, number of used features: 117
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=-0.000000
[LightGBM] [Info] Start training from score -0.000000
Training until validation scores don't improve for 50 rounds
[50]	valid_0's binary_logloss: 0.580694	valid_0's business_cost: 24346
[100]	valid_0's binary_logloss: 0.558407	valid_0's business_cost: 23624
[150]	valid_0's binary_logloss: 0.546756	valid



Metrics (Test): {'auc': 0.7789777125237858, 'recall': 0.6855531686358755, 'f1': 0.28369818868763197, 'accuracy': 0.7205107637609209, 'business_cost': np.int64(23431), 'val_best_cost': np.int64(23048)}


Unnamed: 0,Data,Model,business_cost,auc,val_best_cost
1,v2,LightGBM,23431,0.778978,23048
0,v1,LightGBM,22718,0.781818,23141


## 2. Sélection du Dataset Gagnant pour l'Optimisation
On prend le dataset qui a donné le meilleur score LightGBM (le modèle cible).

In [9]:
lgbm_res = df_res[df_res["Model"] == "LightGBM"].sort_values("val_best_cost")
best_data_name = lgbm_res.iloc[0]["Data"]
print(f"Dataset sélectionné pour optimisation LightGBM : {best_data_name}")

# Récupération des splits EXISTANTS
X_train, y_train, X_val, y_val, X_test, y_test = splits[best_data_name]
print(f"Shape Train: {X_train.shape}, Val: {X_val.shape}, Test: {X_test.shape}")

Dataset sélectionné pour optimisation LightGBM : v2
Shape Train: (215257, 124), Val: (46126, 124), Test: (46127, 124)


## 3. Optimisation Optuna LightGBM

In [5]:
best_params = optimize_lightgbm(X_train, y_train, X_val, y_val, n_trials=10)

final_params = best_params.copy()
final_params.update({
    "metric": "custom", "objective": "binary", "verbosity": -1,
    "boosting_type": "gbdt", "random_state": 42, "n_jobs": -1,
    "class_weight": "balanced", "n_estimators": 1000
})

[I 2025-12-13 19:07:36,423] A new study created in memory with name: no-name-a33cd128-8386-47c3-ba34-fb5f3b0bd4b9
[I 2025-12-13 19:08:04,282] Trial 0 finished with value: 23053.0 and parameters: {'learning_rate': 0.20556559861622456, 'num_leaves': 219, 'max_depth': 4, 'min_child_samples': 89, 'min_split_gain': 0.08886159437542485, 'reg_alpha': 31.648969953576, 'reg_lambda': 20.60133891829057}. Best is trial 0 with value: 23053.0.


Meilleurs params: {'learning_rate': 0.20556559861622456, 'num_leaves': 219, 'max_depth': 4, 'min_child_samples': 89, 'min_split_gain': 0.08886159437542485, 'reg_alpha': 31.648969953576, 'reg_lambda': 20.60133891829057}


## 4. Cross-Validation & Construction de l'Ensemble Final
On entraîne 5 modèles fold par CV. L'Ensemble final est la moyenne de ces 5 modèles (stockée dans une classe simplifiée).

In [6]:
# Appel à notre fonction CV simplifiée qui retourne l'Ensemble et ses metrics Test

ensemble_final, metrics_final = train_model_cv(
    X_train, y_train, X_val, y_val, X_test, y_test, 
    dataset_name=f"{best_data_name}_Ensemble_Final", 
    params=final_params
)

print("METRICS FINALES SUR TEST (Ensemble CV):", metrics_final)



Fold 1 terminé (best_iteration=100).




Fold 2 terminé (best_iteration=188).




Fold 3 terminé (best_iteration=214).




Fold 4 terminé (best_iteration=148).




Fold 5 terminé (best_iteration=176).





Calibration seuil Ensemble sur X_val...
Metrics Test (Ensemble): {'auc': 0.7861555974863765, 'recall': 0.680719656283566, 'f1': 0.30399328456649477, 'accuracy': 0.7483469551455764, 'business_cost': np.int64(22309)}


FileNotFoundError: [Errno 2] No such file or directory: 'c:\\Users\\aubin\\Majeur IA\\data analysis\\credit-scoring\\notebooks\\models\\LGBM_Ensemble_v1_Ensemble_Final.joblib'

## 5. Entraînement du modèle final sans CV car le notebook 3 ne fonctionne pas avec l'ensemble de modèles du CV

In [7]:
#on repasse au meilleur modele avec meilleur parametre sans CV car ne fonctionne pas
modele_final, metrics_final = train_lightgbm(
    X_train, y_train, X_val, y_val, X_test, y_test, 
    dataset_name=f"{best_data_name}_Ensemble_Final", 
    params=final_params
)

Entraînement LightGBM_v1_Ensemble_Final...
Training until validation scores don't improve for 50 rounds
[50]	valid_0's business_cost: 23666
[100]	valid_0's business_cost: 23263
[150]	valid_0's business_cost: 23220
Early stopping, best iteration is:
[116]	valid_0's business_cost: 23053




Meilleur seuil trouvé (Val): 0.50 (Coût: 23053)




Metrics (Test): {'auc': 0.7828163973056544, 'recall': 0.6987110633727175, 'f1': 0.29283664397051373, 'accuracy': 0.7275565287142021, 'business_cost': np.int64(22665), 'val_best_cost': np.int64(23053)}


## 6. Sauvegarde

In [8]:
if not os.path.exists("../models"):
    os.makedirs("../models")

joblib.dump(modele_final, "../models/best_model.pkl")

path_serving = "../models/final_model"
if os.path.exists(path_serving): shutil.rmtree(path_serving)
# Sauvegarde MLflow comme modèle sklearn
mlflow.sklearn.save_model(modele_final, path_serving)
print("Modèle Ensemble sauvegardé et prêt pour Docker!")

Modèle Ensemble sauvegardé et prêt pour Docker!
