# üöÄ Mod√©lisation Avanc√©e - Home Credit Default Risk

Ce notebook explore des mod√®les avanc√©s avec optimisation d'hyperparam√®tres via **Optuna** :
- **XGBoost** : Gradient Boosting optimis√©
- **LightGBM** : Gradient Boosting ultra-rapide
- **MLP** : Multi-Layer Perceptron (r√©seau de neurones)

## Strat√©gie

1. **Pipelines flexibles** : Preprocessing + mod√®le
2. **Optimisation Optuna** : Recherche bay√©sienne d'hyperparam√®tres
3. **Tracking MLflow** : Versionnement et comparaison
4. **M√©trique m√©tier** : Co√ªt FN = 10x FP
5. **Validation crois√©e** : StratifiedKFold

## üì¶ Imports et Configuration

In [26]:
# Imports standards
import pandas as pd
import numpy as np
import os
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Sklearn
from sklearn.model_selection import StratifiedKFold, cross_validate, train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    roc_auc_score, recall_score, f1_score, 
    make_scorer, confusion_matrix, classification_report
)

# Mod√®les
import xgboost as xgb
import lightgbm as lgb
from sklearn.neural_network import MLPClassifier

# Optimisation & Tracking
import optuna
from optuna.integration.mlflow import MLflowCallback
import mlflow
import mlflow.sklearn
import mlflow.xgboost
import mlflow.lightgbm

print("Imports reussis")
print(f"Versions:")
print(f"   - XGBoost:  {xgb.__version__}")
print(f"   - LightGBM: {lgb.__version__}")
print(f"   - Optuna:   {optuna.__version__}")
print(f"   - MLflow:   {mlflow.__version__}")

Imports reussis
Versions:
   - XGBoost:  3.2.0
   - LightGBM: 4.6.0
   - Optuna:   4.7.0
   - MLflow:   2.22.4


## üóÇÔ∏è Chargement des Donn√©es

In [27]:
# Charger les features engineered
df = pd.read_parquet('../data/features_engineered.parquet')

# S√©parer train et test
train = df[df['TARGET'].notna()].copy()
test = df[df['TARGET'].isna()].copy()

# Pr√©parer pour l'entra√Ænement
X_train = train.drop(['TARGET', 'SK_ID_CURR'], axis=1)
y_train = train['TARGET']

# Pour les pr√©dictions finales
X_test = test.drop(['TARGET', 'SK_ID_CURR'], axis=1)
test_ids = test['SK_ID_CURR']

print(f"Donnees chargees:")
print(f"   Train : {X_train.shape} ({len(y_train):,} echantillons)")
print(f"   Test  : {X_test.shape}")
print(f"\nDistribution des classes:")
print(f"   Classe 0: {(y_train == 0).sum():,} ({(y_train == 0).mean()*100:.1f}%)")
print(f"   Classe 1: {(y_train == 1).sum():,} ({(y_train == 1).mean()*100:.1f}%)")
print(f"   Ratio: 1:{(y_train == 0).sum() / (y_train == 1).sum():.1f}")

Donnees chargees:
   Train : (307507, 795) (307,507 echantillons)
   Test  : (48744, 795)

Distribution des classes:
   Classe 0: 282,682 (91.9%)
   Classe 1: 24,825 (8.1%)
   Ratio: 1:11.4


In [28]:
# Nettoyage des valeurs probl√©matiques
print("Nettoyage des donnees...")

X_train = X_train.replace([np.inf, -np.inf], np.nan).fillna(0)
X_test = X_test.replace([np.inf, -np.inf], np.nan).fillna(0)

print(f"Donnees nettoyees: {X_train.shape}")

Nettoyage des donnees...
Donnees nettoyees: (307507, 795)


In [29]:
# Train/Val split pour le test rapide
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, 
    test_size=0.2, 
    stratify=y_train, 
    random_state=42
)

print(f"Train subset: {X_tr.shape}")
print(f"Val subset:   {X_val.shape}")

Train subset: (246005, 795)
Val subset:   (61502, 795)


## ‚öôÔ∏è Configuration MLflow

In [30]:
from pathlib import Path

tracking_uri = Path.cwd().parent / 'mlruns'
mlflow.set_tracking_uri(tracking_uri.as_uri())
mlflow.set_experiment("Advanced Models - Optuna Optimization")

print(f"MLflow Tracking URI: {mlflow.get_tracking_uri()}")
print(f"Stockage: {tracking_uri}")
print(f"Experience: Advanced Models - Optuna Optimization")

# Fermer toute run active
if mlflow.active_run():
    mlflow.end_run()
    
print("MLflow configure")

MLflow Tracking URI: file:///e:/oc_mlops/mlruns
Stockage: e:\oc_mlops\mlruns
Experience: Advanced Models - Optuna Optimization
MLflow configure


## üìê M√©triques Personnalis√©es

In [31]:
def business_cost_scorer(y_true, y_pred):
    """
    Co√ªt m√©tier : FN (faux n√©gatif) co√ªte 10 fois plus cher que FP (faux positif)
    On retourne le n√©gatif du co√ªt pour maximiser (sklearn maximise les scores)
    """
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    cost = fp * 1 + fn * 10  # FN co√ªte 10x plus
    return -cost  # N√©gatif car on veut minimiser le co√ªt

# Configuration de la validation crois√©e
N_SPLITS = 3
RANDOM_STATE = 42

skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)

# D√©finir les scorers
scoring = {
    'roc_auc': make_scorer(roc_auc_score, response_method='predict_proba'),
    'recall_minority': make_scorer(recall_score, pos_label=1, zero_division=0),
    'f1': make_scorer(f1_score, pos_label=1, zero_division=0),
    'business_cost': make_scorer(business_cost_scorer)
}

print(f"Metriques configurees:")
print(f"   - ROC-AUC")
print(f"   - Recall (classe minoritaire)")
print(f"   - F1-Score")
print(f"   - Cout metier (FN=10x FP)")

Metriques configurees:
   - ROC-AUC
   - Recall (classe minoritaire)
   - F1-Score
   - Cout metier (FN=10x FP)


In [32]:
# === DETECTION GPU/CPU AUTOMATIQUE ===
# Cette fonction detecte automatiquement la disponibilite du GPU
# Le code s'adaptera automatiquement selon le materiel disponible

def check_gpu_availability():
    """
    Detecte si un GPU est disponible pour XGBoost et LightGBM
    
    Returns:
        dict: Configuration GPU pour chaque framework
            - 'available': True si GPU detecte, False sinon
            - 'device': 'cuda'/'gpu' si disponible, 'cpu' sinon
    """
    gpu_config = {
        'xgboost': {'available': False, 'device': 'cpu'},
        'lightgbm': {'available': False, 'device': 'cpu'}
    }
    
    # Test XGBoost GPU
    print("Detection GPU pour XGBoost...")
    try:
        # Methode 1: Verifier via PyTorch si disponible
        import torch
        if torch.cuda.is_available():
            gpu_config['xgboost']['available'] = True
            gpu_config['xgboost']['device'] = 'cuda'
            gpu_name = torch.cuda.get_device_name(0)
            print(f"  -> GPU detecte: {gpu_name}")
    except ImportError:
        # Methode 2: Test direct avec XGBoost
        try:
            test_data = xgb.DMatrix(np.random.rand(10, 5), label=np.random.randint(0, 2, 10))
            xgb.train({'device': 'cuda', 'tree_method': 'hist'}, test_data, num_boost_round=1)
            gpu_config['xgboost']['available'] = True
            gpu_config['xgboost']['device'] = 'cuda'
            print("  -> GPU detecte")
        except Exception:
            print("  -> Pas de GPU, utilisation CPU")
    
    # Test LightGBM GPU
    print("\nDetection GPU pour LightGBM...")
    try:
        # LightGBM GPU necessite une compilation speciale
        test_data = lgb.Dataset(np.random.rand(10, 5), label=np.random.randint(0, 2, 10))
        lgb.train({'device': 'gpu', 'verbose': -1}, test_data, num_boost_round=1)
        gpu_config['lightgbm']['available'] = True
        gpu_config['lightgbm']['device'] = 'gpu'
        print("  -> GPU detecte")
    except Exception:
        print("  -> Pas de GPU, utilisation CPU")
    
    return gpu_config


# Executer la detection
print("="*80)
print("DETECTION DU MATERIEL DISPONIBLE")
print("="*80 + "\n")

GPU_CONFIG = check_gpu_availability()

print("\n" + "="*80)
print("CONFIGURATION FINALE:")
print("="*80)
print(f"XGBoost  : {GPU_CONFIG['xgboost']['device'].upper()}")
print(f"LightGBM : {GPU_CONFIG['lightgbm']['device'].upper()}")
print("="*80)

DETECTION DU MATERIEL DISPONIBLE

Detection GPU pour XGBoost...
  -> GPU detecte

Detection GPU pour LightGBM...
  -> GPU detecte

CONFIGURATION FINALE:
XGBoost  : CUDA
LightGBM : GPU


## Pipelines pour chaque modele

Chaque pipeline inclut :
1. **Scaler** : Normalisation des features
2. **Classifier** : Modele de classification
3. **Adaptation automatique GPU/CPU** : Parametres optimises selon le materiel disponible

### Points cles pour GPU :
- **XGBoost GPU** : Necessite `device='cuda'` + `tree_method='hist'`
- **LightGBM GPU** : Necessite `device='gpu'`
- **n_jobs** : RETIRE en mode GPU (conflit), utilise en mode CPU pour parallelisation
- **max_bin** : Augmente en mode GPU pour plus de calculs paralleles

In [33]:
def create_mlp_pipeline(params=None):
    """
    Pipeline MLP (Multi-Layer Perceptron) avec preprocessing
    
    Note: MLPClassifier de sklearn ne supporte pas le GPU
    Pour utiliser GPU avec reseaux de neurones, utiliser PyTorch ou TensorFlow
    
    Args:
        params: Dictionnaire d'hyperparametres (optionnel)
        
    Returns:
        Pipeline sklearn avec scaler + MLP
    """
    default_params = {
        'hidden_layer_sizes': (100, 50),
        'activation': 'relu',
        'solver': 'adam',
        'alpha': 0.0001,
        'learning_rate_init': 0.001,
        'max_iter': 200,
        'early_stopping': True,
        'validation_fraction': 0.1,
        'random_state': RANDOM_STATE,
        'verbose': False
    }
    
    if params:
        default_params.update(params)
    
    pipeline = Pipeline([
        ('scaler', RobustScaler()),  # RobustScaler pour MLP (plus robuste aux outliers)
        ('classifier', MLPClassifier(**default_params))
    ])
    
    return pipeline

print("Pipeline MLP cree")
print("  Mode: CPU (sklearn ne supporte pas GPU)")

Pipeline MLP cree
  Mode: CPU (sklearn ne supporte pas GPU)


#### 3. Pipeline MLP (CPU uniquement)

In [34]:
def create_lightgbm_pipeline(params=None):
    """
    Pipeline LightGBM avec adaptation automatique GPU/CPU
    
    Args:
        params: Dictionnaire d'hyperparametres (optionnel)
        
    Returns:
        Pipeline sklearn avec scaler + LightGBM
    """
    # Parametres par defaut
    default_params = {
        'n_estimators': 100,
        'max_depth': -1,
        'learning_rate': 0.1,
        'num_leaves': 31,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'min_child_samples': 20,
        'reg_alpha': 0,
        'reg_lambda': 0,
        'random_state': RANDOM_STATE,
        'verbose': -1
    }
    
    # === ADAPTATION GPU/CPU AUTOMATIQUE ===
    if GPU_CONFIG['lightgbm']['available']:
        # Configuration GPU
        default_params['device'] = 'gpu'
        default_params['gpu_use_dp'] = False  # Single precision = plus rapide
        default_params['max_bin'] = 255  # Optimal pour GPU LightGBM
        # PAS de n_jobs avec GPU (conflit)
    else:
        # Configuration CPU
        default_params['device'] = 'cpu'
        default_params['n_jobs'] = -1  # Multiprocessing CPU
    
    # Merge avec parametres personnalises
    if params:
        # Si params force le device, adapter les parametres associes
        if 'device' in params:
            if params['device'] == 'gpu':
                params['gpu_use_dp'] = params.get('gpu_use_dp', False)
                params['max_bin'] = params.get('max_bin', 255)
                params.pop('n_jobs', None)  # Retirer n_jobs si present
            elif params['device'] == 'cpu':
                params.pop('gpu_use_dp', None)
                if 'n_jobs' not in params:
                    params['n_jobs'] = -1
        
        default_params.update(params)
    
    # Gestion du desequilibre des classes
    default_params['is_unbalance'] = True
    
    # Creation du pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', lgb.LGBMClassifier(**default_params))
    ])
    
    return pipeline

print("Pipeline LightGBM cree")
print(f"  Mode: {GPU_CONFIG['lightgbm']['device'].upper()}")

Pipeline LightGBM cree
  Mode: GPU


#### 2. Pipeline LightGBM

In [35]:
def create_xgboost_pipeline(params=None):
    """
    Pipeline XGBoost avec adaptation automatique GPU/CPU
    
    Args:
        params: Dictionnaire d'hyperparametres (optionnel)
        
    Returns:
        Pipeline sklearn avec scaler + XGBoost
    """
    # Parametres par defaut
    default_params = {
        'n_estimators': 100,
        'max_depth': 6,
        'learning_rate': 0.1,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'min_child_weight': 1,
        'gamma': 0,
        'reg_alpha': 0,
        'reg_lambda': 1,
        'random_state': RANDOM_STATE,
        'use_label_encoder': False,
        'eval_metric': 'logloss'
    }
    
    # === ADAPTATION GPU/CPU AUTOMATIQUE ===
    if GPU_CONFIG['xgboost']['available']:
        # Configuration GPU
        default_params['device'] = 'cuda'
        default_params['tree_method'] = 'hist'  # OBLIGATOIRE pour GPU
        default_params['max_bin'] = 256  # Plus de bins = plus de calculs GPU
        default_params['grow_policy'] = 'depthwise'  # Algorithme optimal pour GPU
        # PAS de n_jobs avec GPU (conflit)
    else:
        # Configuration CPU
        default_params['device'] = 'cpu'
        default_params['n_jobs'] = -1  # Multiprocessing CPU
    
    # Merge avec parametres personnalises
    if params:
        # Si params force le device, adapter les parametres associes
        if 'device' in params:
            if params['device'] == 'cuda':
                params['tree_method'] = 'hist'
                params['max_bin'] = params.get('max_bin', 256)
                params['grow_policy'] = 'depthwise'
                params.pop('n_jobs', None)  # Retirer n_jobs si present
            elif params['device'] == 'cpu':
                params.pop('tree_method', None)
                params.pop('grow_policy', None)
                if 'n_jobs' not in params:
                    params['n_jobs'] = -1
        
        default_params.update(params)
    
    # Gestion du desequilibre des classes
    scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
    default_params['scale_pos_weight'] = scale_pos_weight
    
    # Creation du pipeline
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', xgb.XGBClassifier(**default_params))
    ])
    
    return pipeline

print("Pipeline XGBoost cree")
print(f"  Mode: {GPU_CONFIG['xgboost']['device'].upper()}")

Pipeline XGBoost cree
  Mode: CUDA


#### 1. Pipeline XGBoost

## üéØ Fonction d'Optimisation Optuna

Cette fonction sera utilis√©e comme objectif pour Optuna.
Elle effectue une validation crois√©e et retourne la m√©trique √† optimiser.

In [36]:
def optuna_objective(trial, model_type='xgboost', metric='roc_auc'):
    """
    Fonction objectif pour Optuna
    
    Args:
        trial: Trial Optuna
        model_type: 'xgboost', 'lightgbm' or 'mlp'
        metric: Metrique a optimiser ('roc_auc', 'f1', 'business_cost', etc.)
    
    Returns:
        Score moyen de validation croisee
    """
    
    # === HYPERPARAMETRES A OPTIMISER ===
    
    if model_type == 'xgboost':
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 300, step=50),
            'max_depth': trial.suggest_int('max_depth', 3, 10),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
            'subsample': trial.suggest_float('subsample', 0.6, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
            'gamma': trial.suggest_float('gamma', 0, 5),
            'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
            'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),
        }
        # Si GPU disponible, optimiser aussi max_bin
        if GPU_CONFIG['xgboost']['available']:
            params['max_bin'] = trial.suggest_int('max_bin', 128, 512, step=64)
        
        pipeline = create_xgboost_pipeline(params)
        
    elif model_type == 'lightgbm':
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 300, step=50),
            'max_depth': trial.suggest_int('max_depth', 3, 15),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
            'num_leaves': trial.suggest_int('num_leaves', 20, 150),
            'subsample': trial.suggest_float('subsample', 0.6, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
            'min_child_samples': trial.suggest_int('min_child_samples', 5, 50),
            'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
            'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),
        }
        pipeline = create_lightgbm_pipeline(params)
        
    elif model_type == 'mlp':
        # Configurer les couches cachees
        n_layers = trial.suggest_int('n_layers', 1, 3)
        hidden_layers = []
        for i in range(n_layers):
            hidden_layers.append(
                trial.suggest_int(f'n_units_l{i}', 50, 200, step=50)
            )
        
        params = {
            'hidden_layer_sizes': tuple(hidden_layers),
            'activation': trial.suggest_categorical('activation', ['relu', 'tanh']),
            'alpha': trial.suggest_float('alpha', 1e-5, 1e-1, log=True),
            'learning_rate_init': trial.suggest_float('learning_rate_init', 1e-4, 1e-2, log=True),
            'max_iter': 300,  # Augmente pour MLP
        }
        pipeline = create_mlp_pipeline(params)
    
    else:
        raise ValueError(f"Unknown model_type: {model_type}")
    
    # === VALIDATION CROISEE ===
    # IMPORTANT: n_jobs=1 pour eviter les conflits avec GPU
    # Le GPU sera utilise en interne par XGBoost/LightGBM
    try:
        cv_results = cross_validate(
            pipeline, 
            X_train, 
            y_train, 
            cv=skf, 
            scoring=scoring,
            n_jobs=1,  # CRITIQUE: 1 seul job pour compatibilite GPU
            return_train_score=False,
            error_score='raise'
        )
        
        # Retourner la metrique moyenne
        mean_score = np.mean(cv_results[f'test_{metric}'])
        
        # Pour business_cost, on veut minimiser (valeurs negatives)
        # Optuna maximise par defaut, donc on retourne tel quel
        return mean_score
        
    except Exception as e:
        print(f"Erreur dans le trial: {e}")
        # Retourner une tres mauvaise valeur en cas d'erreur
        return -np.inf if metric == 'business_cost' else 0.0


print("Fonction objectif Optuna definie")
print("\nNote importante:")
print("  - cross_validate utilise n_jobs=1")
print("  - Evite les conflits avec GPU")
print("  - Le GPU est utilise en interne par XGBoost/LightGBM")

Fonction objectif Optuna definie

Note importante:
  - cross_validate utilise n_jobs=1
  - Evite les conflits avec GPU
  - Le GPU est utilise en interne par XGBoost/LightGBM


## üî¨ Optimisation avec Optuna + MLflow

Fonction pour lancer l'optimisation d'un mod√®le avec tracking MLflow.

In [37]:
def optimize_model(model_type, n_trials=50, metric='roc_auc', timeout=None):
    """
    Optimise un mod√®le avec Optuna et track dans MLflow
    
    Args:
        model_type: 'xgboost', 'lightgbm' or 'mlp'
        n_trials: Nombre de trials Optuna
        metric: M√©trique √† optimiser
        timeout: Timeout en secondes (optionnel)
    
    Returns:
        best_params: Meilleurs hyperparam√®tres
        best_value: Meilleure valeur de la m√©trique
        study: Objet Study Optuna
    """
    
    print(f"\n{'='*80}")
    print(f"OPTIMISATION: {model_type.upper()}")
    print(f"{'='*80}")
    print(f"Metrique: {metric}")
    print(f"Trials: {n_trials}")
    print(f"CV: {N_SPLITS} folds")
    
    # Cr√©er une √©tude Optuna
    study_name = f"{model_type}_{metric}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
    
    # Direction: maximize pour roc_auc, f1, recall
    # Pour business_cost (valeurs n√©gatives), on veut maximiser (moins n√©gatif = meilleur)
    direction = 'maximize'
    
    study = optuna.create_study(
        study_name=study_name,
        direction=direction,
        sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE)
    )
    
    # Callback MLflow pour logger chaque trial
    mlflow_callback = MLflowCallback(
        tracking_uri=mlflow.get_tracking_uri(),
        metric_name=metric,
        create_experiment=False,
        mlflow_kwargs={
            "experiment_id": mlflow.get_experiment_by_name("Advanced Models - Optuna Optimization").experiment_id,
            "nested": True
        }
    )
    
    # Lancer l'optimisation avec parent run MLflow
    with mlflow.start_run(run_name=f"{model_type.upper()} - Optuna {n_trials} trials"):
        
        # Tags pour organisation
        mlflow.set_tags({
            "author": "Data Science Team",
            "project": "Home Credit Default Risk",
            "phase": "optimization",
            "model_type": model_type,
            "optimizer": "optuna",
            "framework": model_type if model_type != 'mlp' else 'sklearn',
            "environment": "development"
        })
        
        mlflow.set_tag("mlflow.note.content", f"""
OPTIMISATION HYPERPARAMETRES - {model_type.upper()}

Configuration:
- Optimiseur: Optuna (TPE Sampler)
- Nombre de trials: {n_trials}
- Metrique objectif: {metric}
- Validation: StratifiedKFold ({N_SPLITS} folds)
- Echantillons: {len(X_train):,}

Metrique metier:
- FN coute 10x plus que FP
- Objectif: minimiser le cout total

Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}
        """)
        
        # Logger les param√®tres de configuration
        mlflow.log_param("model_type", model_type)
        mlflow.log_param("n_trials", n_trials)
        mlflow.log_param("metric", metric)
        mlflow.log_param("cv_folds", N_SPLITS)
        mlflow.log_param("n_samples", len(X_train))
        mlflow.log_param("n_features", X_train.shape[1])
        
        # Optimiser
        objective_fn = lambda trial: optuna_objective(trial, model_type, metric)
        
        study.optimize(
            objective_fn,
            n_trials=n_trials,
            timeout=timeout,
            callbacks=[mlflow_callback],
            show_progress_bar=True
        )
        
        # R√©sultats
        best_params = study.best_params
        best_value = study.best_value
        
        print(f"\n{'='*80}")
        print(f"OPTIMISATION TERMINEE")
        print(f"{'='*80}")
        print(f"Meilleur {metric}: {best_value:.4f}")
        print(f"\nMeilleurs hyperparametres:")
        for param, value in best_params.items():
            print(f"   {param:25s}: {value}")
        
        # Logger les meilleurs r√©sultats
        mlflow.log_metric(f"best_{metric}", best_value)
        for param, value in best_params.items():
            mlflow.log_param(f"best_{param}", value)
        
        # Entra√Æner le mod√®le final avec les meilleurs param√®tres
        print(f"\nEntrainement du modele final avec les meilleurs parametres...")
        
        if model_type == 'xgboost':
            best_pipeline = create_xgboost_pipeline(best_params)
        elif model_type == 'lightgbm':
            best_pipeline = create_lightgbm_pipeline(best_params)
        elif model_type == 'mlp':
            best_pipeline = create_mlp_pipeline(best_params)
        
        # Validation crois√©e finale
        final_cv = cross_validate(
            best_pipeline,
            X_train,
            y_train,
            cv=skf,
            scoring=scoring,
            n_jobs=1,
            return_train_score=False
        )
        
        # Logger toutes les m√©triques finales
        for metric_name in scoring.keys():
            scores = final_cv[f'test_{metric_name}']
            mean_val = np.mean(scores)
            std_val = np.std(scores)
            
            mlflow.log_metric(f"{metric_name}_mean", mean_val)
            mlflow.log_metric(f"{metric_name}_std", std_val)
            
            print(f"   {metric_name:20s}: {mean_val:.4f} (¬±{std_val:.4f})")
        
        # Entra√Æner sur toutes les donn√©es
        best_pipeline.fit(X_train, y_train)
        
        # Sauvegarder le mod√®le
        signature = mlflow.models.signature.infer_signature(
            X_train, 
            best_pipeline.predict_proba(X_train)[:, 1]
        )
        input_example = X_train.head(3)
        
        if model_type in ['xgboost', 'lightgbm']:
            # Sauvegarder avec le logger sp√©cifique
            if model_type == 'xgboost':
                mlflow.xgboost.log_model(
                    best_pipeline.named_steps['classifier'],
                    "model",
                    signature=signature,
                    input_example=best_pipeline.named_steps['scaler'].transform(input_example)
                )
            else:
                mlflow.lightgbm.log_model(
                    best_pipeline.named_steps['classifier'],
                    "model",
                    signature=signature,
                    input_example=best_pipeline.named_steps['scaler'].transform(input_example)
                )
        else:
            # MLP via sklearn
            mlflow.sklearn.log_model(
                best_pipeline,
                "model",
                signature=signature,
                input_example=input_example
            )
        
        # Sauvegarder l'√©tude Optuna
        import joblib
        study_path = f"optuna_study_{model_type}.pkl"
        joblib.dump(study, study_path)
        mlflow.log_artifact(study_path)
        os.remove(study_path)
        
        print(f"Modele et etude sauvegardes dans MLflow")
    
    return best_params, best_value, study


print("Fonction optimize_model() definie")

Fonction optimize_model() definie


## üöÄ Lancement des Optimisations

**‚ö†Ô∏è IMPORTANT:** Ajustez `n_trials` selon vos ressources :
- **Rapide** : 20-30 trials (~5-10 min par mod√®le)
- **Normal** : 50-100 trials (~15-30 min par mod√®le)
- **Complet** : 100-200 trials (~30-60 min par mod√®le)

Vous pouvez aussi utiliser `timeout` (en secondes) pour limiter le temps.

In [38]:
# Configuration d'optimisation
N_TRIALS = 30  # Ajustez selon votre temps disponible
OPTIMIZATION_METRIC = 'roc_auc'  # ou 'f1', 'recall_minority', 'business_cost'

# Stocker les r√©sultats
optimization_results = {}

print(f"Configuration:")
print(f"   Trials par modele: {N_TRIALS}")
print(f"   Metrique: {OPTIMIZATION_METRIC}")
print(f"   Temps estime: {N_TRIALS * 3 // 60}-{N_TRIALS * 5 // 60} min par modele")

Configuration:
   Trials par modele: 30
   Metrique: roc_auc
   Temps estime: 1-2 min par modele


### üî∑ XGBoost

In [39]:
# Optimiser XGBoost
xgb_params, xgb_score, xgb_study = optimize_model(
    model_type='xgboost',
    n_trials=N_TRIALS,
    metric=OPTIMIZATION_METRIC
)

optimization_results['xgboost'] = {
    'params': xgb_params,
    'score': xgb_score,
    'study': xgb_study
}

[32m[I 2026-02-11 09:21:16,833][0m A new study created in memory with name: xgboost_roc_auc_20260211_092116[0m



OPTIMISATION: XGBOOST
Metrique: roc_auc
Trials: 30
CV: 3 folds


  0%|          | 0/30 [00:00<?, ?it/s]

[32m[I 2026-02-11 09:22:40,096][0m Trial 0 finished with value: 0.765563182425239 and parameters: {'n_estimators': 150, 'max_depth': 10, 'learning_rate': 0.1205712628744377, 'subsample': 0.8394633936788146, 'colsample_bytree': 0.6624074561769746, 'min_child_weight': 2, 'gamma': 0.2904180608409973, 'reg_alpha': 8.661761457749352, 'reg_lambda': 6.011150117432088, 'max_bin': 384}. Best is trial 0 with value: 0.765563182425239.[0m
[32m[I 2026-02-11 09:23:39,891][0m Trial 1 finished with value: 0.7542774530943177 and parameters: {'n_estimators': 50, 'max_depth': 10, 'learning_rate': 0.16967533607196555, 'subsample': 0.6849356442713105, 'colsample_bytree': 0.6727299868828402, 'min_child_weight': 2, 'gamma': 1.5212112147976886, 'reg_alpha': 5.247564316322379, 'reg_lambda': 4.319450186421157, 'max_bin': 256}. Best is trial 0 with value: 0.765563182425239.[0m
[32m[I 2026-02-11 09:24:33,867][0m Trial 2 finished with value: 0.7672172927965977 and parameters: {'n_estimators': 200, 'max_dep



Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

  "inputs": [
    [
      -1.3881618795034276,
      -0.7179102157733304,
      -0.6645376240045622,
      -0.5775338388214382,
      0.14212934945239714,
      -0.47809924638088896,
      -0.16607071806401402,
      -0.5056651144332955,
      -0.14944356532973233,
      1.506889195430464,
      0.5712186590186303,
      0.3798404716660183,
      0.5791516426282154,
      -0.455280390087387,
      0.0018033216760626552,
      0.468700579159064,
      -0.49899677178931173,
      0.04324482600534641,
      1.599373915711423,
      -0.24521694449987905,
      -1.2656787409538146,
      -0.10305921959839813,
      -0.06269285269710279,
      -0.6318129933579317,
      -0.12400503805664809,
      -0.23126821234121328,
      -0.205870046800924,
      -0.29120992932718553,
      -0.5472401405847414,
      -0.4678178605479511,
      -0.47652013951245636,
      -1.301587502329712,
      -1.007543063202894,
      -0.34518157593741267,
      0.003069721449246951,
      0.9609843022921758,
      1

Modele et etude sauvegardes dans MLflow


### üî∂ LightGBM

In [40]:
# Optimiser LightGBM
lgb_params, lgb_score, lgb_study = optimize_model(
    model_type='lightgbm',
    n_trials=N_TRIALS,
    metric=OPTIMIZATION_METRIC
)

optimization_results['lightgbm'] = {
    'params': lgb_params,
    'score': lgb_score,
    'study': lgb_study
}

[32m[I 2026-02-11 09:58:20,070][0m A new study created in memory with name: lightgbm_roc_auc_20260211_095820[0m



OPTIMISATION: LIGHTGBM
Metrique: roc_auc
Trials: 30
CV: 3 folds


  0%|          | 0/30 [00:00<?, ?it/s]

[32m[I 2026-02-11 10:00:01,335][0m Trial 0 finished with value: 0.7782723488990384 and parameters: {'n_estimators': 150, 'max_depth': 15, 'learning_rate': 0.1205712628744377, 'num_leaves': 98, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'min_child_samples': 7, 'reg_alpha': 8.661761457749352, 'reg_lambda': 6.011150117432088}. Best is trial 0 with value: 0.7782723488990384.[0m
[32m[I 2026-02-11 10:01:04,716][0m Trial 1 finished with value: 0.7810293655472217 and parameters: {'n_estimators': 250, 'max_depth': 3, 'learning_rate': 0.2708160864249968, 'num_leaves': 129, 'subsample': 0.6849356442713105, 'colsample_bytree': 0.6727299868828402, 'min_child_samples': 13, 'reg_alpha': 3.0424224295953772, 'reg_lambda': 5.247564316322379}. Best is trial 1 with value: 0.7810293655472217.[0m
[32m[I 2026-02-11 10:02:16,968][0m Trial 2 finished with value: 0.7821710081241818 and parameters: {'n_estimators': 150, 'max_depth': 6, 'learning_rate': 0.08012737503998542, 'n



Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

  "inputs": [
    [
      -1.3881618795034276,
      -0.7179102157733304,
      -0.6645376240045622,
      -0.5775338388214382,
      0.14212934945239714,
      -0.47809924638088896,
      -0.16607071806401402,
      -0.5056651144332955,
      -0.14944356532973233,
      1.506889195430464,
      0.5712186590186303,
      0.3798404716660183,
      0.5791516426282154,
      -0.455280390087387,
      0.0018033216760626552,
      0.468700579159064,
      -0.49899677178931173,
      0.04324482600534641,
      1.599373915711423,
      -0.24521694449987905,
      -1.2656787409538146,
      -0.10305921959839813,
      -0.06269285269710279,
      -0.6318129933579317,
      -0.12400503805664809,
      -0.23126821234121328,
      -0.205870046800924,
      -0.29120992932718553,
      -0.5472401405847414,
      -0.4678178605479511,
      -0.47652013951245636,
      -1.301587502329712,
      -1.007543063202894,
      -0.34518157593741267,
      0.003069721449246951,
      0.9609843022921758,
      1

Modele et etude sauvegardes dans MLflow


### Lancer l'optimisation avec undersampling

In [42]:
def optimize_model_undersampled(model_type, X_resampled, y_resampled, n_trials=50, metric='roc_auc', timeout=None):
    """
    Optimise un modele avec Optuna sur donnees sous-echantillonnees
    
    Args:
        model_type: 'xgboost', 'lightgbm' or 'mlp'
        X_resampled: Features sous-echantillonnees
        y_resampled: Target sous-echantillonnes
        n_trials: Nombre de trials Optuna
        metric: Metrique a optimiser
        timeout: Timeout en secondes (optionnel)
    
    Returns:
        best_params: Meilleurs hyperparametres
        best_value: Meilleure valeur de la metrique
        study: Objet Study Optuna
    """
    
    print(f"\n{'='*80}")
    print(f"OPTIMISATION UNDERSAMPLED: {model_type.upper()}")
    print(f"{'='*80}")
    print(f"Metrique: {metric}")
    print(f"Trials: {n_trials}")
    print(f"CV: {N_SPLITS} folds")
    print(f"Dataset: {len(X_resampled):,} echantillons (apres undersampling)")
    
    # Creer une etude Optuna
    study_name = f"{model_type}_undersampled_{metric}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
    
    direction = 'maximize'
    
    study = optuna.create_study(
        study_name=study_name,
        direction=direction,
        sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE)
    )
    
    # Callback MLflow
    mlflow_callback = MLflowCallback(
        tracking_uri=mlflow.get_tracking_uri(),
        metric_name=metric,
        create_experiment=False,
        mlflow_kwargs={
            "experiment_id": mlflow.get_experiment_by_name("Advanced Models - Optuna Optimization").experiment_id,
            "nested": True
        }
    )
    
    # Lancer l'optimisation
    with mlflow.start_run(run_name=f"{model_type.upper()} - Undersampled - Optuna {n_trials} trials"):
        
        mlflow.set_tags({
            "author": "Data Science Team",
            "project": "Home Credit Default Risk",
            "phase": "optimization",
            "model_type": model_type,
            "optimizer": "optuna",
            "framework": model_type if model_type != 'mlp' else 'sklearn',
            "environment": "development",
            "sampling_strategy": "undersampling",
            "sampling_ratio": f"1:{1/sampling_strategy:.0f}"
        })
        
        mlflow.set_tag("mlflow.note.content", f"""
OPTIMISATION AVEC UNDERSAMPLING - {model_type.upper()}

Configuration:
- Optimiseur: Optuna (TPE Sampler)
- Nombre de trials: {n_trials}
- Metrique objectif: {metric}
- Validation: StratifiedKFold ({N_SPLITS} folds)
- Echantillons: {len(X_resampled):,} (apres undersampling)
- Sampling ratio: 1:{1/sampling_strategy:.0f}

Strategie:
- RandomUnderSampler de la classe majoritaire
- Objectif: ameliorer la detection des defauts (recall)

Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}
        """)
        
        # Logger les parametres
        mlflow.log_param("model_type", model_type)
        mlflow.log_param("n_trials", n_trials)
        mlflow.log_param("metric", metric)
        mlflow.log_param("cv_folds", N_SPLITS)
        mlflow.log_param("n_samples_original", len(X_train))
        mlflow.log_param("n_samples_resampled", len(X_resampled))
        mlflow.log_param("sampling_strategy", "undersampling")
        mlflow.log_param("sampling_ratio", f"1:{1/sampling_strategy:.0f}")
        mlflow.log_param("n_features", X_resampled.shape[1])
        
        # Fonction objectif adaptee
        def objective_undersampled(trial):
            if model_type == 'lightgbm':
                params = {
                    'n_estimators': trial.suggest_int('n_estimators', 50, 300, step=50),
                    'max_depth': trial.suggest_int('max_depth', 3, 15),
                    'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
                    'num_leaves': trial.suggest_int('num_leaves', 20, 150),
                    'subsample': trial.suggest_float('subsample', 0.6, 1.0),
                    'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
                    'min_child_samples': trial.suggest_int('min_child_samples', 5, 50),
                    'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
                    'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),
                }
                
                # Configuration GPU/CPU
                if GPU_CONFIG['lightgbm']['available']:
                    params['device'] = 'gpu'
                    params['gpu_use_dp'] = False
                    params['max_bin'] = 255
                else:
                    params['device'] = 'cpu'
                    params['n_jobs'] = -1
                
                # PAS de is_unbalance car donnees deja equilibrees
                params['random_state'] = RANDOM_STATE
                params['verbose'] = -1
                
                pipeline = Pipeline([
                    ('scaler', StandardScaler()),
                    ('classifier', lgb.LGBMClassifier(**params))
                ])
            else:
                raise ValueError(f"Seul LightGBM est supporte pour l'instant")
            
            try:
                cv_results = cross_validate(
                    pipeline, 
                    X_resampled, 
                    y_resampled, 
                    cv=skf, 
                    scoring=scoring,
                    n_jobs=1,
                    return_train_score=False,
                    error_score='raise'
                )
                
                mean_score = np.mean(cv_results[f'test_{metric}'])
                return mean_score
                
            except Exception as e:
                print(f"Erreur dans le trial: {e}")
                return -np.inf if metric == 'business_cost' else 0.0
        
        # Optimiser
        study.optimize(
            objective_undersampled,
            n_trials=n_trials,
            timeout=timeout,
            callbacks=[mlflow_callback],
            show_progress_bar=True
        )
        
        # Resultats
        best_params = study.best_params
        best_value = study.best_value
        
        print(f"\n{'='*80}")
        print(f"OPTIMISATION TERMINEE")
        print(f"{'='*80}")
        print(f"Meilleur {metric}: {best_value:.4f}")
        print(f"\nMeilleurs hyperparametres:")
        for param, value in best_params.items():
            print(f"   {param:25s}: {value}")
        
        # Logger les meilleurs resultats
        mlflow.log_metric(f"best_{metric}", best_value)
        for param, value in best_params.items():
            mlflow.log_param(f"best_{param}", value)
        
        # Entrainer le modele final
        print(f"\nEntrainement du modele final...")
        
        if model_type == 'lightgbm':
            final_params = best_params.copy()
            if GPU_CONFIG['lightgbm']['available']:
                final_params['device'] = 'gpu'
                final_params['gpu_use_dp'] = False
                final_params['max_bin'] = 255
            else:
                final_params['device'] = 'cpu'
                final_params['n_jobs'] = -1
            
            final_params['random_state'] = RANDOM_STATE
            final_params['verbose'] = -1
            
            best_pipeline = Pipeline([
                ('scaler', StandardScaler()),
                ('classifier', lgb.LGBMClassifier(**final_params))
            ])
        
        # Validation croisee finale
        final_cv = cross_validate(
            best_pipeline,
            X_resampled,
            y_resampled,
            cv=skf,
            scoring=scoring,
            n_jobs=1,
            return_train_score=False
        )
        
        # Logger toutes les metriques finales
        for metric_name in scoring.keys():
            scores = final_cv[f'test_{metric_name}']
            mean_val = np.mean(scores)
            std_val = np.std(scores)
            
            mlflow.log_metric(f"{metric_name}_mean", mean_val)
            mlflow.log_metric(f"{metric_name}_std", std_val)
            
            print(f"   {metric_name:20s}: {mean_val:.4f} (¬±{std_val:.4f})")
        
        # Entrainer sur toutes les donnees resamplees
        best_pipeline.fit(X_resampled, y_resampled)
        
        # Evaluer sur le dataset ORIGINAL (validation realiste)
        print(f"\n{'='*80}")
        print(f"EVALUATION SUR DATASET ORIGINAL (sans undersampling)")
        print(f"{'='*80}")
        
        y_pred_proba_original = best_pipeline.predict_proba(X_train)[:, 1]
        y_pred_original = best_pipeline.predict(X_train)
        
        auc_original = roc_auc_score(y_train, y_pred_proba_original)
        recall_original = recall_score(y_train, y_pred_original)
        f1_original = f1_score(y_train, y_pred_original)
        
        print(f"   ROC-AUC: {auc_original:.4f}")
        print(f"   Recall: {recall_original:.4f}")
        print(f"   F1-Score: {f1_original:.4f}")
        
        mlflow.log_metric("roc_auc_on_original", auc_original)
        mlflow.log_metric("recall_on_original", recall_original)
        mlflow.log_metric("f1_on_original", f1_original)
        
        # Sauvegarder le modele
        mlflow.sklearn.log_model(
            best_pipeline,
            "model",
            signature=mlflow.models.signature.infer_signature(X_resampled, y_pred_proba_original[:len(X_resampled)]),
            input_example=X_resampled[:1]
        )
        
        print(f"Modele sauvegarde dans MLflow")
    
    return best_params, best_value, study

print("Fonction optimize_model_undersampled() definie")

Fonction optimize_model_undersampled() definie


### Optimisation LightGBM avec Undersampling

In [44]:


from imblearn.under_sampling import RandomUnderSampler

# Configuration de l'undersampling
# Ratio 1:2 (2 exemples classe 0 pour 1 exemple classe 1)
sampling_strategy = 1  # 0.5 = 1:2, 1.0 = 1:1

rus = RandomUnderSampler(
    sampling_strategy=sampling_strategy,
    random_state=RANDOM_STATE
)

# Appliquer l'undersampling
X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)

print("Dataset original:")
print(f"   Classe 0: {(y_train == 0).sum():,} ({(y_train == 0).mean()*100:.1f}%)")
print(f"   Classe 1: {(y_train == 1).sum():,} ({(y_train == 1).mean()*100:.1f}%)")
print(f"   Total: {len(y_train):,} echantillons")

print(f"\nDataset apres undersampling (ratio 1:{1/sampling_strategy:.0f}):")
print(f"   Classe 0: {(y_train_resampled == 0).sum():,} ({(y_train_resampled == 0).mean()*100:.1f}%)")
print(f"   Classe 1: {(y_train_resampled == 1).sum():,} ({(y_train_resampled == 1).mean()*100:.1f}%)")
print(f"   Total: {len(y_train_resampled):,} echantillons")
print(f"\nReduction: {(1 - len(y_train_resampled)/len(y_train))*100:.1f}% des donnees")

Dataset original:
   Classe 0: 282,682 (91.9%)
   Classe 1: 24,825 (8.1%)
   Total: 307,507 echantillons

Dataset apres undersampling (ratio 1:1):
   Classe 0: 24,825 (50.0%)
   Classe 1: 24,825 (50.0%)
   Total: 49,650 echantillons

Reduction: 83.9% des donnees


In [45]:
# Optimiser LightGBM avec undersampling
lgb_under_params, lgb_under_score, lgb_under_study = optimize_model_undersampled(
    model_type='lightgbm',
    X_resampled=X_train_resampled,
    y_resampled=y_train_resampled,
    n_trials=N_TRIALS,
    metric=OPTIMIZATION_METRIC
)

# Stocker les resultats
optimization_results['lightgbm_undersampled'] = {
    'params': lgb_under_params,
    'score': lgb_under_score,
    'study': lgb_under_study
}

[32m[I 2026-02-11 14:16:01,204][0m A new study created in memory with name: lightgbm_undersampled_roc_auc_20260211_141601[0m



OPTIMISATION UNDERSAMPLED: LIGHTGBM
Metrique: roc_auc
Trials: 30
CV: 3 folds
Dataset: 49,650 echantillons (apres undersampling)


  0%|          | 0/30 [00:00<?, ?it/s]

[32m[I 2026-02-11 14:16:52,668][0m Trial 0 finished with value: 0.774858946766337 and parameters: {'n_estimators': 150, 'max_depth': 15, 'learning_rate': 0.1205712628744377, 'num_leaves': 98, 'subsample': 0.6624074561769746, 'colsample_bytree': 0.662397808134481, 'min_child_samples': 7, 'reg_alpha': 8.661761457749352, 'reg_lambda': 6.011150117432088}. Best is trial 0 with value: 0.774858946766337.[0m
[32m[I 2026-02-11 14:17:08,270][0m Trial 1 finished with value: 0.7770433444991071 and parameters: {'n_estimators': 250, 'max_depth': 3, 'learning_rate': 0.2708160864249968, 'num_leaves': 129, 'subsample': 0.6849356442713105, 'colsample_bytree': 0.6727299868828402, 'min_child_samples': 13, 'reg_alpha': 3.0424224295953772, 'reg_lambda': 5.247564316322379}. Best is trial 1 with value: 0.7770433444991071.[0m
[32m[I 2026-02-11 14:17:32,070][0m Trial 2 finished with value: 0.7780711603581566 and parameters: {'n_estimators': 150, 'max_depth': 6, 'learning_rate': 0.08012737503998542, 'num



Downloading artifacts:   0%|          | 0/7 [00:00<?, ?it/s]

  "dataframe_split": {
    "columns": [
      "CODE_GENDER",
      "FLAG_OWN_CAR",
      "FLAG_OWN_REALTY",
      "CNT_CHILDREN",
      "AMT_INCOME_TOTAL",
      "AMT_CREDIT",
      "AMT_ANNUITY",
      "AMT_GOODS_PRICE",
      "REGION_POPULATION_RELATIVE",
      "DAYS_BIRTH",
      "DAYS_EMPLOYED",
      "DAYS_REGISTRATION",
      "DAYS_ID_PUBLISH",
      "OWN_CAR_AGE",
      "FLAG_MOBIL",
      "FLAG_EMP_PHONE",
      "FLAG_WORK_PHONE",
      "FLAG_CONT_MOBILE",
      "FLAG_PHONE",
      "FLAG_EMAIL",
      "CNT_FAM_MEMBERS",
      "REGION_RATING_CLIENT",
      "REGION_RATING_CLIENT_W_CITY",
      "HOUR_APPR_PROCESS_START",
      "REG_REGION_NOT_LIVE_REGION",
      "REG_REGION_NOT_WORK_REGION",
      "LIVE_REGION_NOT_WORK_REGION",
      "REG_CITY_NOT_LIVE_CITY",
      "REG_CITY_NOT_WORK_CITY",
      "LIVE_CITY_NOT_WORK_CITY",
      "EXT_SOURCE_1",
      "EXT_SOURCE_2",
      "EXT_SOURCE_3",
      "APARTMENTS_AVG",
      "BASEMENTAREA_AVG",
      "YEARS_BEGINEXPLUATATION_AVG",
      "

Modele sauvegarde dans MLflow


## Approche alternative : Undersampling

### Pourquoi l'undersampling ?

Le dataset est tres desequilibre :
- Classe 0 (pas de defaut) : 91.9%
- Classe 1 (defaut) : 8.1%
- Ratio : 1:11.4

**Strategie d'undersampling :**
- Reduire le nombre d'exemples de la classe majoritaire (0)
- Equilibrer le dataset : ratio 1:1 ou 1:2
- Peut ameliorer la detection de la classe minoritaire (recall)

**Avantages :**
- Entrainement plus rapide (moins de donnees)
- Meilleure detection des defauts
- Reduit le biais vers la classe majoritaire

**Inconvenients :**
- Perte d'information (suppression de donnees)
- Risque de sous-representer la classe majoritaire
- Performance globale peut baisser

### üß† MLP (Multi-Layer Perceptron)

In [41]:
# Optimiser MLP
mlp_params, mlp_score, mlp_study = optimize_model(
    model_type='mlp',
    n_trials=N_TRIALS,
    metric=OPTIMIZATION_METRIC
)

optimization_results['mlp'] = {
    'params': mlp_params,
    'score': mlp_score,
    'study': mlp_study
}

[32m[I 2026-02-11 11:04:51,541][0m A new study created in memory with name: mlp_roc_auc_20260211_110451[0m



OPTIMISATION: MLP
Metrique: roc_auc
Trials: 30
CV: 3 folds


  0%|          | 0/30 [00:00<?, ?it/s]

[32m[I 2026-02-11 11:13:28,807][0m Trial 0 finished with value: 0.5193488428233669 and parameters: {'n_layers': 2, 'n_units_l0': 200, 'n_units_l1': 150, 'activation': 'relu', 'alpha': 4.207053950287933e-05, 'learning_rate_init': 0.00013066739238053285}. Best is trial 0 with value: 0.5193488428233669.[0m
[32m[I 2026-02-11 11:19:15,497][0m Trial 1 finished with value: 0.5187799977340202 and parameters: {'n_layers': 3, 'n_units_l0': 150, 'n_units_l1': 150, 'n_units_l2': 50, 'activation': 'relu', 'alpha': 7.068974950624602e-05, 'learning_rate_init': 0.0002310201887845295}. Best is trial 0 with value: 0.5193488428233669.[0m
[32m[I 2026-02-11 11:24:46,765][0m Trial 2 finished with value: 0.6009923010015823 and parameters: {'n_layers': 1, 'n_units_l0': 100, 'activation': 'relu', 'alpha': 0.0001461896279370495, 'learning_rate_init': 0.0016738085788752138}. Best is trial 2 with value: 0.6009923010015823.[0m
[32m[I 2026-02-11 11:27:45,762][0m Trial 3 finished with value: 0.64870560504

TypeError: MLPClassifier.__init__() got an unexpected keyword argument 'n_layers'

## üìä Comparaison des R√©sultats

In [None]:
# Cr√©er un tableau de comparaison
comparison_df = pd.DataFrame({
    'Model': list(optimization_results.keys()),
    f'Best {OPTIMIZATION_METRIC}': [
        results['score'] for results in optimization_results.values()
    ]
}).sort_values(f'Best {OPTIMIZATION_METRIC}', ascending=False)

print(f"\n{'='*80}")
print(f"COMPARAISON DES MODELES")
print(f"{'='*80}\n")
print(comparison_df.to_string(index=False))
print(f"\n{'='*80}")
print(f"MEILLEUR MODELE: {comparison_df.iloc[0]['Model'].upper()}")
print(f"   {OPTIMIZATION_METRIC}: {comparison_df.iloc[0][f'Best {OPTIMIZATION_METRIC}']:.4f}")
print(f"{'='*80}\n")

## üìà Visualisation des √âtudes Optuna

In [None]:
import matplotlib.pyplot as plt
from optuna.visualization import (
    plot_optimization_history,
    plot_param_importances,
    plot_slice
)

# Cr√©er les visualisations pour chaque mod√®le
for model_name, results in optimization_results.items():
    study = results['study']
    
    print(f"\n{'='*80}")
    print(f"VISUALISATIONS: {model_name.upper()}")
    print(f"{'='*80}\n")
    
    # 1. Historique d'optimisation
    fig = plot_optimization_history(study)
    fig.update_layout(title=f"{model_name.upper()} - Optimization History")
    fig.show()
    
    # 2. Importance des hyperparam√®tres
    try:
        fig = plot_param_importances(study)
        fig.update_layout(title=f"{model_name.upper()} - Hyperparameter Importances")
        fig.show()
    except Exception as e:
        print(f"Impossible de generer plot_param_importances: {e}")
    
    # 3. Distribution des hyperparam√®tres
    try:
        fig = plot_slice(study)
        fig.update_layout(title=f"{model_name.upper()} - Hyperparameter Slices")
        fig.show()
    except Exception as e:
        print(f"Impossible de generer plot_slice: {e}")

## üíæ Sauvegarder le Meilleur Mod√®le

Entra√Æner le meilleur mod√®le sur toutes les donn√©es et le sauvegarder.

In [None]:
# Identifier le meilleur mod√®le
best_model_name = comparison_df.iloc[0]['Model']
best_params = optimization_results[best_model_name]['params']

print(f"Entrainement du meilleur modele: {best_model_name.upper()}")

# Cr√©er le pipeline avec les meilleurs param√®tres
if best_model_name == 'xgboost':
    final_pipeline = create_xgboost_pipeline(best_params)
elif best_model_name == 'lightgbm':
    final_pipeline = create_lightgbm_pipeline(best_params)
elif best_model_name == 'mlp':
    final_pipeline = create_mlp_pipeline(best_params)

# Entra√Æner sur toutes les donn√©es
final_pipeline.fit(X_train, y_train)

# √âvaluer sur le train
y_pred_proba = final_pipeline.predict_proba(X_train)[:, 1]
y_pred = final_pipeline.predict(X_train)

train_auc = roc_auc_score(y_train, y_pred_proba)
train_f1 = f1_score(y_train, y_pred)
train_recall = recall_score(y_train, y_pred)

print(f"\nModele entraine")
print(f"\nPerformances sur le train:")
print(f"   ROC-AUC: {train_auc:.4f}")
print(f"   F1-Score: {train_f1:.4f}")
print(f"   Recall: {train_recall:.4f}")

# Sauvegarder localement
import joblib
model_filename = f'best_model_{best_model_name}.pkl'
joblib.dump(final_pipeline, model_filename)
print(f"\nModele sauvegarde: {model_filename}")

## üéØ Pr√©dictions sur le Test Set

G√©n√©rer les pr√©dictions pour la soumission Kaggle.

In [None]:
# Pr√©dictions sur le test
test_pred_proba = final_pipeline.predict_proba(X_test)[:, 1]

# Cr√©er le fichier de soumission
submission = pd.DataFrame({
    'SK_ID_CURR': test_ids,
    'TARGET': test_pred_proba
})

submission_filename = f'submission_{best_model_name}_{datetime.now().strftime("%Y%m%d_%H%M%S")}.csv'
submission.to_csv(submission_filename, index=False)

print(f"Predictions generees: {len(submission)} lignes")
print(f"Fichier cree: {submission_filename}")
print(f"\nStatistiques des predictions:")
print(submission['TARGET'].describe())

## üìã R√©sum√© Final

In [None]:
print(f"\n{'='*80}")
print(f"OPTIMISATION TERMINEE")
print(f"{'='*80}\n")

print(f"Configuration:")
print(f"   Metrique d'optimisation: {OPTIMIZATION_METRIC}")
print(f"   Trials par modele: {N_TRIALS}")
print(f"   Validation: {N_SPLITS}-fold StratifiedKFold")
print(f"   Echantillons train: {len(X_train):,}")
print(f"   Features: {X_train.shape[1]}")

print(f"\nResultats par modele:")
for model_name, results in optimization_results.items():
    print(f"   {model_name:10s}: {results['score']:.4f}")

print(f"\nMeilleur modele: {best_model_name.upper()}")
print(f"   Score: {optimization_results[best_model_name]['score']:.4f}")

print(f"\nFichiers generes:")
print(f"   - Modele: {model_filename}")
print(f"   - Soumission: {submission_filename}")

print(f"\nMLflow:")
print(f"   Tracking URI: {mlflow.get_tracking_uri()}")
print(f"   Experience: Advanced Models - Optuna Optimization")
print(f"   Visualisez avec: mlflow ui")

print(f"\n{'='*80}\n")

## üìö Prochaines √âtapes

### üîç Analyse Approfondie
- Analyser les feature importances
- √âtudier les pr√©dictions erron√©es (FP et FN)
- Cr√©er une matrice de confusion d√©taill√©e

### üéØ Am√©lioration
- **Feature Engineering** : Cr√©er de nouvelles features cibl√©es
- **Stacking/Blending** : Combiner les 3 mod√®les
- **Calibration** : Calibrer les probabilit√©s pr√©dites
- **Threshold Optimization** : Trouver le seuil optimal pour minimiser le co√ªt m√©tier

### üöÄ D√©ploiement
- Cr√©er une API FastAPI
- Containeriser avec Docker
- Tests de charge et monitoring
- Validation m√©tier

### üìä MLflow
Pour visualiser toutes vos exp√©riences:
```bash
cd /home/zmxw1768/Documents/oc_mlops
mlflow ui
```
Puis ouvrez: http://localhost:5000