# üöÄ Mod√©lisation Avanc√©e - Home Credit Default Risk

Ce notebook explore des mod√®les avanc√©s avec optimisation d'hyperparam√®tres via **Optuna** :
- **XGBoost** : Gradient Boosting optimis√©
- **LightGBM** : Gradient Boosting ultra-rapide
- **MLP** : Multi-Layer Perceptron (r√©seau de neurones)

## Strat√©gie

1. **Pipelines flexibles** : Preprocessing + mod√®le
2. **Optimisation Optuna** : Recherche bay√©sienne d'hyperparam√®tres
3. **Tracking MLflow** : Versionnement et comparaison
4. **M√©trique m√©tier** : Co√ªt FN = 10x FP
5. **Validation crois√©e** : StratifiedKFold

## üì¶ Imports et Configuration

In [1]:
# Imports standards
import pandas as pd
import numpy as np
import os
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Sklearn
from sklearn.model_selection import StratifiedKFold, cross_validate, train_test_split
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    roc_auc_score, recall_score, f1_score, 
    make_scorer, confusion_matrix, classification_report
)

# Mod√®les
import xgboost as xgb
import lightgbm as lgb
from sklearn.neural_network import MLPClassifier

# Optimisation & Tracking
import optuna
from optuna.integration.mlflow import MLflowCallback
import mlflow
import mlflow.sklearn
import mlflow.xgboost
import mlflow.lightgbm

print("‚úÖ Imports r√©ussis")
print(f"üì¶ Versions:")
print(f"   - XGBoost:  {xgb.__version__}")
print(f"   - LightGBM: {lgb.__version__}")
print(f"   - Optuna:   {optuna.__version__}")
print(f"   - MLflow:   {mlflow.__version__}")

‚úÖ Imports r√©ussis
üì¶ Versions:
   - XGBoost:  3.2.0
   - LightGBM: 4.6.0
   - Optuna:   4.7.0
   - MLflow:   2.22.4


## üóÇÔ∏è Chargement des Donn√©es

In [2]:
# Charger les features engineered
df = pd.read_parquet('../data/features_engineered.parquet')

# S√©parer train et test
train = df[df['TARGET'].notna()].copy()
test = df[df['TARGET'].isna()].copy()

# Pr√©parer pour l'entra√Ænement
X_train = train.drop(['TARGET', 'SK_ID_CURR'], axis=1)
y_train = train['TARGET']

# Pour les pr√©dictions finales
X_test = test.drop(['TARGET', 'SK_ID_CURR'], axis=1)
test_ids = test['SK_ID_CURR']

print(f"‚úÖ Donn√©es charg√©es:")
print(f"   Train : {X_train.shape} ({len(y_train):,} √©chantillons)")
print(f"   Test  : {X_test.shape}")
print(f"\nüìä Distribution des classes:")
print(f"   Classe 0: {(y_train == 0).sum():,} ({(y_train == 0).mean()*100:.1f}%)")
print(f"   Classe 1: {(y_train == 1).sum():,} ({(y_train == 1).mean()*100:.1f}%)")
print(f"   Ratio: 1:{(y_train == 0).sum() / (y_train == 1).sum():.1f}")

‚úÖ Donn√©es charg√©es:
   Train : (307507, 795) (307,507 √©chantillons)
   Test  : (48744, 795)

üìä Distribution des classes:
   Classe 0: 282,682 (91.9%)
   Classe 1: 24,825 (8.1%)
   Ratio: 1:11.4


In [3]:
# Nettoyage des valeurs probl√©matiques
print("üßπ Nettoyage des donn√©es...")

X_train = X_train.replace([np.inf, -np.inf], np.nan).fillna(0)
X_test = X_test.replace([np.inf, -np.inf], np.nan).fillna(0)

print(f"‚úÖ Donn√©es nettoy√©es: {X_train.shape}")

üßπ Nettoyage des donn√©es...
‚úÖ Donn√©es nettoy√©es: (307507, 795)


In [4]:
# Train/Val split pour le test rapide
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, 
    test_size=0.2, 
    stratify=y_train, 
    random_state=42
)

print(f"Train subset: {X_tr.shape}")
print(f"Val subset:   {X_val.shape}")

Train subset: (246005, 795)
Val subset:   (61502, 795)


## ‚öôÔ∏è Configuration MLflow

In [5]:
# Configuration MLflow en local
tracking_uri = os.path.abspath(os.path.join(os.getcwd(), '..', 'mlruns'))
mlflow.set_tracking_uri(f"file://{tracking_uri}")
mlflow.set_experiment("Advanced Models - Optuna Optimization")

print(f"üéØ MLflow Tracking URI: {mlflow.get_tracking_uri()}")
print(f"üìÅ Stockage: {tracking_uri}")
print(f"üß™ Exp√©rience: Advanced Models - Optuna Optimization")

# Fermer toute run active
if mlflow.active_run():
    mlflow.end_run()
    
print("‚úÖ MLflow configur√©")

üéØ MLflow Tracking URI: file:///home/zmxw1768/Documents/oc_mlops/mlruns
üìÅ Stockage: /home/zmxw1768/Documents/oc_mlops/mlruns
üß™ Exp√©rience: Advanced Models - Optuna Optimization
‚úÖ MLflow configur√©


## üìê M√©triques Personnalis√©es

In [6]:
def business_cost_scorer(y_true, y_pred):
    """
    Co√ªt m√©tier : FN (faux n√©gatif) co√ªte 10 fois plus cher que FP (faux positif)
    On retourne le n√©gatif du co√ªt pour maximiser (sklearn maximise les scores)
    """
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
    cost = fp * 1 + fn * 10  # FN co√ªte 10x plus
    return -cost  # N√©gatif car on veut minimiser le co√ªt

# Configuration de la validation crois√©e
N_SPLITS = 3
RANDOM_STATE = 42

skf = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)

# D√©finir les scorers
scoring = {
    'roc_auc': make_scorer(roc_auc_score, response_method='predict_proba'),
    'recall_minority': make_scorer(recall_score, pos_label=1, zero_division=0),
    'f1': make_scorer(f1_score, pos_label=1, zero_division=0),
    'business_cost': make_scorer(business_cost_scorer)
}

print(f"‚úÖ M√©triques configur√©es:")
print(f"   - ROC-AUC")
print(f"   - Recall (classe minoritaire)")
print(f"   - F1-Score")
print(f"   - Co√ªt m√©tier (FN=10x FP)")

‚úÖ M√©triques configur√©es:
   - ROC-AUC
   - Recall (classe minoritaire)
   - F1-Score
   - Co√ªt m√©tier (FN=10x FP)


## üèóÔ∏è Pipelines pour chaque mod√®le

Chaque pipeline inclut :
1. **Scaler** : Normalisation des features
2. **Classifier** : Mod√®le de classification

In [7]:
def create_xgboost_pipeline(params=None):
    """
    Pipeline XGBoost avec preprocessing
    
    Args:
        params: Dictionnaire d'hyperparam√®tres (optionnel)
    """
    default_params = {
        'n_estimators': 100,
        'max_depth': 6,
        'learning_rate': 0.1,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'min_child_weight': 1,
        'gamma': 0,
        'reg_alpha': 0,
        'reg_lambda': 1,
        'random_state': RANDOM_STATE,
        'n_jobs': -1,
        'use_label_encoder': False,
        'eval_metric': 'logloss'
    }
    
    if params:
        default_params.update(params)
    
    # Calcul de scale_pos_weight pour g√©rer le d√©s√©quilibre
    scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
    default_params['scale_pos_weight'] = scale_pos_weight
    
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', xgb.XGBClassifier(**default_params))
    ])
    
    return pipeline


def create_lightgbm_pipeline(params=None):
    """
    Pipeline LightGBM avec preprocessing
    
    Args:
        params: Dictionnaire d'hyperparam√®tres (optionnel)
    """
    default_params = {
        'n_estimators': 100,
        'max_depth': -1,
        'learning_rate': 0.1,
        'num_leaves': 31,
        'subsample': 0.8,
        'colsample_bytree': 0.8,
        'min_child_samples': 20,
        'reg_alpha': 0,
        'reg_lambda': 0,
        'random_state': RANDOM_STATE,
        'n_jobs': -1,
        'verbose': -1
    }
    
    if params:
        default_params.update(params)
    
    # is_unbalance pour g√©rer le d√©s√©quilibre
    default_params['is_unbalance'] = True
    
    pipeline = Pipeline([
        ('scaler', StandardScaler()),
        ('classifier', lgb.LGBMClassifier(**default_params))
    ])
    
    return pipeline


def create_mlp_pipeline(params=None):
    """
    Pipeline MLP (Multi-Layer Perceptron) avec preprocessing
    
    Args:
        params: Dictionnaire d'hyperparam√®tres (optionnel)
    """
    default_params = {
        'hidden_layer_sizes': (100, 50),
        'activation': 'relu',
        'solver': 'adam',
        'alpha': 0.0001,
        'learning_rate_init': 0.001,
        'max_iter': 200,
        'early_stopping': True,
        'validation_fraction': 0.1,
        'random_state': RANDOM_STATE,
        'verbose': False
    }
    
    if params:
        default_params.update(params)
    
    pipeline = Pipeline([
        ('scaler', RobustScaler()),  # RobustScaler pour MLP (plus robuste aux outliers)
        ('classifier', MLPClassifier(**default_params))
    ])
    
    return pipeline


print("‚úÖ Fonctions de cr√©ation de pipelines d√©finies:")
print("   - create_xgboost_pipeline()")
print("   - create_lightgbm_pipeline()")
print("   - create_mlp_pipeline()")

‚úÖ Fonctions de cr√©ation de pipelines d√©finies:
   - create_xgboost_pipeline()
   - create_lightgbm_pipeline()
   - create_mlp_pipeline()


## üéØ Fonction d'Optimisation Optuna

Cette fonction sera utilis√©e comme objectif pour Optuna.
Elle effectue une validation crois√©e et retourne la m√©trique √† optimiser.

In [8]:
def optuna_objective(trial, model_type='xgboost', metric='roc_auc'):
    """
    Fonction objectif pour Optuna
    
    Args:
        trial: Trial Optuna
        model_type: 'xgboost', 'lightgbm' or 'mlp'
        metric: M√©trique √† optimiser ('roc_auc', 'f1', 'business_cost', etc.)
    
    Returns:
        Score moyen de validation crois√©e
    """
    
    # === HYPERPARAM√àTRES √Ä OPTIMISER ===
    
    if model_type == 'xgboost':
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 300, step=50),
            'max_depth': trial.suggest_int('max_depth', 3, 10),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
            'subsample': trial.suggest_float('subsample', 0.6, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
            'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
            'gamma': trial.suggest_float('gamma', 0, 5),
            'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
            'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),
        }
        pipeline = create_xgboost_pipeline(params)
        
    elif model_type == 'lightgbm':
        params = {
            'n_estimators': trial.suggest_int('n_estimators', 50, 300, step=50),
            'max_depth': trial.suggest_int('max_depth', 3, 15),
            'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
            'num_leaves': trial.suggest_int('num_leaves', 20, 150),
            'subsample': trial.suggest_float('subsample', 0.6, 1.0),
            'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
            'min_child_samples': trial.suggest_int('min_child_samples', 5, 50),
            'reg_alpha': trial.suggest_float('reg_alpha', 0, 10),
            'reg_lambda': trial.suggest_float('reg_lambda', 0, 10),
        }
        pipeline = create_lightgbm_pipeline(params)
        
    elif model_type == 'mlp':
        # Configurer les couches cach√©es
        n_layers = trial.suggest_int('n_layers', 1, 3)
        hidden_layers = []
        for i in range(n_layers):
            hidden_layers.append(
                trial.suggest_int(f'n_units_l{i}', 50, 200, step=50)
            )
        
        params = {
            'hidden_layer_sizes': tuple(hidden_layers),
            'activation': trial.suggest_categorical('activation', ['relu', 'tanh']),
            'alpha': trial.suggest_float('alpha', 1e-5, 1e-1, log=True),
            'learning_rate_init': trial.suggest_float('learning_rate_init', 1e-4, 1e-2, log=True),
            'max_iter': 300,  # Augment√© pour MLP
        }
        pipeline = create_mlp_pipeline(params)
    
    else:
        raise ValueError(f"Unknown model_type: {model_type}")
    
    # === VALIDATION CROIS√âE ===
    try:
        cv_results = cross_validate(
            pipeline, 
            X_train, 
            y_train, 
            cv=skf, 
            scoring=scoring,
            n_jobs=1,  # Important pour √©viter les conflits avec Optuna
            return_train_score=False,
            error_score='raise'
        )
        
        # Retourner la m√©trique moyenne
        mean_score = np.mean(cv_results[f'test_{metric}'])
        
        # Pour business_cost, on veut minimiser (valeurs n√©gatives)
        # Optuna maximise par d√©faut, donc on retourne tel quel
        return mean_score
        
    except Exception as e:
        print(f"‚ö†Ô∏è Erreur dans le trial: {e}")
        # Retourner une tr√®s mauvaise valeur en cas d'erreur
        return -np.inf if metric == 'business_cost' else 0.0


print("‚úÖ Fonction objectif Optuna d√©finie")

‚úÖ Fonction objectif Optuna d√©finie


## üî¨ Optimisation avec Optuna + MLflow

Fonction pour lancer l'optimisation d'un mod√®le avec tracking MLflow.

In [9]:
def optimize_model(model_type, n_trials=50, metric='roc_auc', timeout=None):
    """
    Optimise un mod√®le avec Optuna et track dans MLflow
    
    Args:
        model_type: 'xgboost', 'lightgbm' or 'mlp'
        n_trials: Nombre de trials Optuna
        metric: M√©trique √† optimiser
        timeout: Timeout en secondes (optionnel)
    
    Returns:
        best_params: Meilleurs hyperparam√®tres
        best_value: Meilleure valeur de la m√©trique
        study: Objet Study Optuna
    """
    
    print(f"\n{'='*80}")
    print(f"üéØ OPTIMISATION: {model_type.upper()}")
    print(f"{'='*80}")
    print(f"M√©trique: {metric}")
    print(f"Trials: {n_trials}")
    print(f"CV: {N_SPLITS} folds")
    
    # Cr√©er une √©tude Optuna
    study_name = f"{model_type}_{metric}_{datetime.now().strftime('%Y%m%d_%H%M%S')}"
    
    # Direction: maximize pour roc_auc, f1, recall
    # Pour business_cost (valeurs n√©gatives), on veut maximiser (moins n√©gatif = meilleur)
    direction = 'maximize'
    
    study = optuna.create_study(
        study_name=study_name,
        direction=direction,
        sampler=optuna.samplers.TPESampler(seed=RANDOM_STATE)
    )
    
    # Callback MLflow pour logger chaque trial
    mlflow_callback = MLflowCallback(
        tracking_uri=mlflow.get_tracking_uri(),
        metric_name=metric,
        create_experiment=False,
        mlflow_kwargs={
            "experiment_id": mlflow.get_experiment_by_name("Advanced Models - Optuna Optimization").experiment_id,
            "nested": True
        }
    )
    
    # Lancer l'optimisation avec parent run MLflow
    with mlflow.start_run(run_name=f"{model_type.upper()} - Optuna {n_trials} trials"):
        
        # Tags pour organisation
        mlflow.set_tags({
            "author": "Data Science Team",
            "project": "Home Credit Default Risk",
            "phase": "optimization",
            "model_type": model_type,
            "optimizer": "optuna",
            "framework": model_type if model_type != 'mlp' else 'sklearn',
            "environment": "development"
        })
        
        mlflow.set_tag("mlflow.note.content", f"""
üéØ OPTIMISATION HYPERPARAM√àTRES - {model_type.upper()}

üìä Configuration:
- Optimiseur: Optuna (TPE Sampler)
- Nombre de trials: {n_trials}
- M√©trique objectif: {metric}
- Validation: StratifiedKFold ({N_SPLITS} folds)
- √âchantillons: {len(X_train):,}

üí∞ M√©trique m√©tier:
- FN co√ªte 10x plus que FP
- Objectif: minimiser le co√ªt total

üìÖ Date: {datetime.now().strftime('%Y-%m-%d %H:%M')}
        """)
        
        # Logger les param√®tres de configuration
        mlflow.log_param("model_type", model_type)
        mlflow.log_param("n_trials", n_trials)
        mlflow.log_param("metric", metric)
        mlflow.log_param("cv_folds", N_SPLITS)
        mlflow.log_param("n_samples", len(X_train))
        mlflow.log_param("n_features", X_train.shape[1])
        
        # Optimiser
        objective_fn = lambda trial: optuna_objective(trial, model_type, metric)
        
        study.optimize(
            objective_fn,
            n_trials=n_trials,
            timeout=timeout,
            callbacks=[mlflow_callback],
            show_progress_bar=True
        )
        
        # R√©sultats
        best_params = study.best_params
        best_value = study.best_value
        
        print(f"\n{'='*80}")
        print(f"‚úÖ OPTIMISATION TERMIN√âE")
        print(f"{'='*80}")
        print(f"Meilleur {metric}: {best_value:.4f}")
        print(f"\nMeilleurs hyperparam√®tres:")
        for param, value in best_params.items():
            print(f"   {param:25s}: {value}")
        
        # Logger les meilleurs r√©sultats
        mlflow.log_metric(f"best_{metric}", best_value)
        for param, value in best_params.items():
            mlflow.log_param(f"best_{param}", value)
        
        # Entra√Æner le mod√®le final avec les meilleurs param√®tres
        print(f"\nüì¶ Entra√Ænement du mod√®le final avec les meilleurs param√®tres...")
        
        if model_type == 'xgboost':
            best_pipeline = create_xgboost_pipeline(best_params)
        elif model_type == 'lightgbm':
            best_pipeline = create_lightgbm_pipeline(best_params)
        elif model_type == 'mlp':
            best_pipeline = create_mlp_pipeline(best_params)
        
        # Validation crois√©e finale
        final_cv = cross_validate(
            best_pipeline,
            X_train,
            y_train,
            cv=skf,
            scoring=scoring,
            n_jobs=1,
            return_train_score=False
        )
        
        # Logger toutes les m√©triques finales
        for metric_name in scoring.keys():
            scores = final_cv[f'test_{metric_name}']
            mean_val = np.mean(scores)
            std_val = np.std(scores)
            
            mlflow.log_metric(f"{metric_name}_mean", mean_val)
            mlflow.log_metric(f"{metric_name}_std", std_val)
            
            print(f"   {metric_name:20s}: {mean_val:.4f} (¬±{std_val:.4f})")
        
        # Entra√Æner sur toutes les donn√©es
        best_pipeline.fit(X_train, y_train)
        
        # Sauvegarder le mod√®le
        signature = mlflow.models.signature.infer_signature(
            X_train, 
            best_pipeline.predict_proba(X_train)[:, 1]
        )
        input_example = X_train.head(3)
        
        if model_type in ['xgboost', 'lightgbm']:
            # Sauvegarder avec le logger sp√©cifique
            if model_type == 'xgboost':
                mlflow.xgboost.log_model(
                    best_pipeline.named_steps['classifier'],
                    "model",
                    signature=signature,
                    input_example=best_pipeline.named_steps['scaler'].transform(input_example)
                )
            else:
                mlflow.lightgbm.log_model(
                    best_pipeline.named_steps['classifier'],
                    "model",
                    signature=signature,
                    input_example=best_pipeline.named_steps['scaler'].transform(input_example)
                )
        else:
            # MLP via sklearn
            mlflow.sklearn.log_model(
                best_pipeline,
                "model",
                signature=signature,
                input_example=input_example
            )
        
        # Sauvegarder l'√©tude Optuna
        import joblib
        study_path = f"optuna_study_{model_type}.pkl"
        joblib.dump(study, study_path)
        mlflow.log_artifact(study_path)
        os.remove(study_path)
        
        print(f"‚úÖ Mod√®le et √©tude sauvegard√©s dans MLflow")
    
    return best_params, best_value, study


print("‚úÖ Fonction optimize_model() d√©finie")

‚úÖ Fonction optimize_model() d√©finie


## üöÄ Lancement des Optimisations

**‚ö†Ô∏è IMPORTANT:** Ajustez `n_trials` selon vos ressources :
- **Rapide** : 20-30 trials (~5-10 min par mod√®le)
- **Normal** : 50-100 trials (~15-30 min par mod√®le)
- **Complet** : 100-200 trials (~30-60 min par mod√®le)

Vous pouvez aussi utiliser `timeout` (en secondes) pour limiter le temps.

In [10]:
# Configuration d'optimisation
N_TRIALS = 30  # Ajustez selon votre temps disponible
OPTIMIZATION_METRIC = 'roc_auc'  # ou 'f1', 'recall_minority', 'business_cost'

# Stocker les r√©sultats
optimization_results = {}

print(f"üéØ Configuration:")
print(f"   Trials par mod√®le: {N_TRIALS}")
print(f"   M√©trique: {OPTIMIZATION_METRIC}")
print(f"   Temps estim√©: {N_TRIALS * 3 // 60}-{N_TRIALS * 5 // 60} min par mod√®le")

üéØ Configuration:
   Trials par mod√®le: 30
   M√©trique: roc_auc
   Temps estim√©: 1-2 min par mod√®le


### üî∑ XGBoost

In [None]:
# Optimiser XGBoost
xgb_params, xgb_score, xgb_study = optimize_model(
    model_type='xgboost',
    n_trials=N_TRIALS,
    metric=OPTIMIZATION_METRIC
)

optimization_results['xgboost'] = {
    'params': xgb_params,
    'score': xgb_score,
    'study': xgb_study
}

[32m[I 2026-02-10 16:13:23,814][0m A new study created in memory with name: xgboost_roc_auc_20260210_161323[0m



üéØ OPTIMISATION: XGBOOST
M√©trique: roc_auc
Trials: 30
CV: 3 folds


  0%|          | 0/30 [00:00<?, ?it/s]

[32m[I 2026-02-10 16:17:38,418][0m Trial 0 finished with value: 0.7657867335777139 and parameters: {'n_estimators': 150, 'max_depth': 10, 'learning_rate': 0.1205712628744377, 'subsample': 0.8394633936788146, 'colsample_bytree': 0.6624074561769746, 'min_child_weight': 2, 'gamma': 0.2904180608409973, 'reg_alpha': 8.661761457749352, 'reg_lambda': 6.011150117432088}. Best is trial 0 with value: 0.7657867335777139.[0m
[32m[I 2026-02-10 16:19:33,289][0m Trial 1 finished with value: 0.7820952300419332 and parameters: {'n_estimators': 250, 'max_depth': 3, 'learning_rate': 0.2708160864249968, 'subsample': 0.9329770563201687, 'colsample_bytree': 0.6849356442713105, 'min_child_weight': 2, 'gamma': 0.9170225492671691, 'reg_alpha': 3.0424224295953772, 'reg_lambda': 5.247564316322379}. Best is trial 1 with value: 0.7820952300419332.[0m
[32m[I 2026-02-10 16:21:30,931][0m Trial 2 finished with value: 0.7799954538984601 and parameters: {'n_estimators': 150, 'max_depth': 5, 'learning_rate': 0.08

### üî∂ LightGBM

In [None]:
# Optimiser LightGBM
lgb_params, lgb_score, lgb_study = optimize_model(
    model_type='lightgbm',
    n_trials=N_TRIALS,
    metric=OPTIMIZATION_METRIC
)

optimization_results['lightgbm'] = {
    'params': lgb_params,
    'score': lgb_score,
    'study': lgb_study
}

### üß† MLP (Multi-Layer Perceptron)

In [None]:
# Optimiser MLP
mlp_params, mlp_score, mlp_study = optimize_model(
    model_type='mlp',
    n_trials=N_TRIALS,
    metric=OPTIMIZATION_METRIC
)

optimization_results['mlp'] = {
    'params': mlp_params,
    'score': mlp_score,
    'study': mlp_study
}

## üìä Comparaison des R√©sultats

In [None]:
# Cr√©er un tableau de comparaison
comparison_df = pd.DataFrame({
    'Model': list(optimization_results.keys()),
    f'Best {OPTIMIZATION_METRIC}': [
        results['score'] for results in optimization_results.values()
    ]
}).sort_values(f'Best {OPTIMIZATION_METRIC}', ascending=False)

print(f"\n{'='*80}")
print(f"üìà COMPARAISON DES MOD√àLES")
print(f"{'='*80}\n")
print(comparison_df.to_string(index=False))
print(f"\n{'='*80}")
print(f"üèÜ MEILLEUR MOD√àLE: {comparison_df.iloc[0]['Model'].upper()}")
print(f"   {OPTIMIZATION_METRIC}: {comparison_df.iloc[0][f'Best {OPTIMIZATION_METRIC}']:.4f}")
print(f"{'='*80}\n")

## üìà Visualisation des √âtudes Optuna

In [None]:
import matplotlib.pyplot as plt
from optuna.visualization import (
    plot_optimization_history,
    plot_param_importances,
    plot_slice
)

# Cr√©er les visualisations pour chaque mod√®le
for model_name, results in optimization_results.items():
    study = results['study']
    
    print(f"\n{'='*80}")
    print(f"üìä VISUALISATIONS: {model_name.upper()}")
    print(f"{'='*80}\n")
    
    # 1. Historique d'optimisation
    fig = plot_optimization_history(study)
    fig.update_layout(title=f"{model_name.upper()} - Optimization History")
    fig.show()
    
    # 2. Importance des hyperparam√®tres
    try:
        fig = plot_param_importances(study)
        fig.update_layout(title=f"{model_name.upper()} - Hyperparameter Importances")
        fig.show()
    except Exception as e:
        print(f"‚ö†Ô∏è Impossible de g√©n√©rer plot_param_importances: {e}")
    
    # 3. Distribution des hyperparam√®tres
    try:
        fig = plot_slice(study)
        fig.update_layout(title=f"{model_name.upper()} - Hyperparameter Slices")
        fig.show()
    except Exception as e:
        print(f"‚ö†Ô∏è Impossible de g√©n√©rer plot_slice: {e}")

## üíæ Sauvegarder le Meilleur Mod√®le

Entra√Æner le meilleur mod√®le sur toutes les donn√©es et le sauvegarder.

In [None]:
# Identifier le meilleur mod√®le
best_model_name = comparison_df.iloc[0]['Model']
best_params = optimization_results[best_model_name]['params']

print(f"üèÜ Entra√Ænement du meilleur mod√®le: {best_model_name.upper()}")

# Cr√©er le pipeline avec les meilleurs param√®tres
if best_model_name == 'xgboost':
    final_pipeline = create_xgboost_pipeline(best_params)
elif best_model_name == 'lightgbm':
    final_pipeline = create_lightgbm_pipeline(best_params)
elif best_model_name == 'mlp':
    final_pipeline = create_mlp_pipeline(best_params)

# Entra√Æner sur toutes les donn√©es
final_pipeline.fit(X_train, y_train)

# √âvaluer sur le train
y_pred_proba = final_pipeline.predict_proba(X_train)[:, 1]
y_pred = final_pipeline.predict(X_train)

train_auc = roc_auc_score(y_train, y_pred_proba)
train_f1 = f1_score(y_train, y_pred)
train_recall = recall_score(y_train, y_pred)

print(f"\n‚úÖ Mod√®le entra√Æn√©")
print(f"\nüìä Performances sur le train:")
print(f"   ROC-AUC: {train_auc:.4f}")
print(f"   F1-Score: {train_f1:.4f}")
print(f"   Recall: {train_recall:.4f}")

# Sauvegarder localement
import joblib
model_filename = f'best_model_{best_model_name}.pkl'
joblib.dump(final_pipeline, model_filename)
print(f"\nüíæ Mod√®le sauvegard√©: {model_filename}")

## üéØ Pr√©dictions sur le Test Set

G√©n√©rer les pr√©dictions pour la soumission Kaggle.

In [None]:
# Pr√©dictions sur le test
test_pred_proba = final_pipeline.predict_proba(X_test)[:, 1]

# Cr√©er le fichier de soumission
submission = pd.DataFrame({
    'SK_ID_CURR': test_ids,
    'TARGET': test_pred_proba
})

submission_filename = f'submission_{best_model_name}_{datetime.now().strftime("%Y%m%d_%H%M%S")}.csv'
submission.to_csv(submission_filename, index=False)

print(f"‚úÖ Pr√©dictions g√©n√©r√©es: {len(submission)} lignes")
print(f"üíæ Fichier cr√©√©: {submission_filename}")
print(f"\nüìä Statistiques des pr√©dictions:")
print(submission['TARGET'].describe())

## üìã R√©sum√© Final

In [None]:
print(f"\n{'='*80}")
print(f"‚úÖ OPTIMISATION TERMIN√âE")
print(f"{'='*80}\n")

print(f"üéØ Configuration:")
print(f"   M√©trique d'optimisation: {OPTIMIZATION_METRIC}")
print(f"   Trials par mod√®le: {N_TRIALS}")
print(f"   Validation: {N_SPLITS}-fold StratifiedKFold")
print(f"   √âchantillons train: {len(X_train):,}")
print(f"   Features: {X_train.shape[1]}")

print(f"\nüìä R√©sultats par mod√®le:")
for model_name, results in optimization_results.items():
    print(f"   {model_name:10s}: {results['score']:.4f}")

print(f"\nüèÜ Meilleur mod√®le: {best_model_name.upper()}")
print(f"   Score: {optimization_results[best_model_name]['score']:.4f}")

print(f"\nüíæ Fichiers g√©n√©r√©s:")
print(f"   - Mod√®le: {model_filename}")
print(f"   - Soumission: {submission_filename}")

print(f"\nüìÅ MLflow:")
print(f"   Tracking URI: {mlflow.get_tracking_uri()}")
print(f"   Exp√©rience: Advanced Models - Optuna Optimization")
print(f"   Visualisez avec: mlflow ui")

print(f"\n{'='*80}\n")

## üìö Prochaines √âtapes

### üîç Analyse Approfondie
- Analyser les feature importances
- √âtudier les pr√©dictions erron√©es (FP et FN)
- Cr√©er une matrice de confusion d√©taill√©e

### üéØ Am√©lioration
- **Feature Engineering** : Cr√©er de nouvelles features cibl√©es
- **Stacking/Blending** : Combiner les 3 mod√®les
- **Calibration** : Calibrer les probabilit√©s pr√©dites
- **Threshold Optimization** : Trouver le seuil optimal pour minimiser le co√ªt m√©tier

### üöÄ D√©ploiement
- Cr√©er une API FastAPI
- Containeriser avec Docker
- Tests de charge et monitoring
- Validation m√©tier

### üìä MLflow
Pour visualiser toutes vos exp√©riences:
```bash
cd /home/zmxw1768/Documents/oc_mlops
mlflow ui
```
Puis ouvrez: http://localhost:5000