# Fine-tuning - Pr√©diction de gravit√© des ACCIDENTS

## Contexte

Ce notebook fait suite au notebook **3-a-accident-initial-ml-training** qui a √©tabli :
- La baseline de performance avec RandomForest et XGBoost
- La comparaison Binary vs Ordered (choix du target binaire)
- L'analyse des feature importances

## Objectif

Optimiser les hyperparam√®tres pour am√©liorer les performances au-del√† de la baseline.

## Workflow

```
1. Configuration et imports
2. Chargement des donn√©es
3. GridSearchCV (recherche des meilleurs hyperparam√®tres)
4. Impact du nombre d'arbres (n_estimators)
5. Comparaison des algorithmes (RF, XGBoost, LightGBM, CatBoost)
6. S√©lection finale et sauvegarde
```

## Crit√®res de s√©lection

| Crit√®re | M√©trique | Justification |
|---------|----------|---------------|
| **Principal** | F1-Score | √âquilibre pr√©cision/rappel |
| **Secondaire** | AUC | D√©partage en cas d'√©galit√© |

---

## 1. Configuration et imports

In [1]:
# === IMPORTS ===
import os
import pandas as pd
import numpy as np
import joblib
import warnings

from ml_config import nb_workers, base_estimators, max_evals
from functions import (
    display_metrics, optimize_boosting_model, plot_optimization_history,
    select_best_model, save_best_model
)

# Visualisation
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, matthews_corrcoef,
    classification_report, confusion_matrix, roc_auc_score, roc_curve, auc
)
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

# === CONFIGURATION ===
# Seed pour reproductibilit√© : garantit les m√™mes r√©sultats √† chaque ex√©cution
RANDOM_STATE = 42

warnings.filterwarnings("ignore", message="pkg_resources is deprecated")
print("Imports OK")

  import pkg_resources


Imports OK


---

## 2. Chargement des donn√©es et fonctions utilitaires

In [2]:
# Fonction de chargement (supporte Parquet et CSV)
def load_dataset(name, base_path='data/output'):
    """
    Charge un dataset depuis Parquet (prioritaire) ou CSV (fallback).
    Parquet est pr√©f√©r√© car plus rapide et compact.
    """
    parquet_path = f'{base_path}/{name}.parquet'
    csv_path = f'{base_path}/{name}.csv'
    
    if os.path.exists(parquet_path):
        df = pd.read_parquet(parquet_path)
        print(f'{name}: charg√© depuis Parquet ({df.shape[0]:,} lignes, {df.shape[1]} colonnes)')
    elif os.path.exists(csv_path):
        df = pd.read_csv(csv_path, sep=';', decimal=',', encoding='utf-8-sig')
        print(f'{name}: charg√© depuis CSV ({df.shape[0]:,} lignes, {df.shape[1]} colonnes)')
    else:
        raise FileNotFoundError(f'Dataset {name} non trouv√©')
    return df

# Chargement du dataset ACCIDENT
df_accident = load_dataset('dataset_accident')

# Aper√ßu des donn√©es
print(f"\nüìä Colonnes disponibles ({len(df_accident.columns)}):")
print(df_accident.columns.tolist())

dataset_accident: charg√© depuis Parquet (263,356 lignes, 13 colonnes)

üìä Colonnes disponibles (13):
['grav_ordered', 'grav_binary', 'est_nuit', 'est_heure_pointe', 'jour_semaine', 'est_weekend', 'region', 'dep', 'agg', 'vma', 'impl_vehicule_leger', 'impl_poids_lourd', 'impl_pieton']


In [3]:
# === FONCTIONS DE PR√âPARATION DES DONN√âES ===

def prepare_data(df, target_col, exclude_cols=None):
    """
    Pr√©pare les features (X) et la target (y) pour l'entra√Ænement.
    
    - Exclut automatiquement toutes les colonnes de gravit√© (pour √©viter le data leakage)
    - Encode les variables cat√©gorielles (region, dep) en entiers
    
    Returns: X (DataFrame), y (Series)
    """
    if exclude_cols is None:
        exclude_cols = []
    
    # Colonnes de gravit√© √† exclure (√©vite le data leakage)
    target_cols = ['grav', 'grav_ordered', 'grav_binary']
    cols_to_drop = target_cols + exclude_cols
    cols_to_drop = [c for c in cols_to_drop if c in df.columns]
    
    X = df.drop(columns=cols_to_drop)
    y = df[target_col]
    
    # Encodage des variables cat√©gorielles
    for col in X.select_dtypes(include=['object', 'str']).columns:
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col].astype(str))
    
    return X, y


# === FONCTIONS DE MOD√âLISATION ===

def create_pipeline(model):
    """
    Cr√©e un pipeline sklearn : Imputation ‚Üí Scaling ‚Üí Mod√®le.
    Permet d'encapsuler tout le preprocessing avec le mod√®le.
    """
    return Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler()),
        ('model', model)
    ])


def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """
    Entra√Æne un mod√®le et calcule les m√©triques d'√©valuation.
    
    Returns:
        - results: dict avec accuracy, precision, recall, f1, auc
        - y_pred: pr√©dictions sur le test set
        - y_proba: probabilit√©s (pour ROC curve)
    """
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # Probabilit√©s pour la courbe ROC
    y_proba = None
    if hasattr(model, 'predict_proba'):
        y_proba = model.predict_proba(X_test)
    
    # Calcul des m√©triques
    results = {
        'model': model_name,
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, average='weighted', zero_division=0),
        'recall': recall_score(y_test, y_pred, average='weighted', zero_division=0),
        'f1': f1_score(y_test, y_pred, average='weighted', zero_division=0),
    }
    
    # AUC uniquement pour la classification binaire
    if len(np.unique(y_test)) == 2 and y_proba is not None:
        results['auc'] = roc_auc_score(y_test, y_proba[:, 1])
    
    return results, y_pred, y_proba


def evaluate_catboost(catboost_model, X_train, X_test, y_train, y_test, model_name):
    """
    √âvalue CatBoost SANS Pipeline sklearn.
    
    Note: CatBoost est incompatible avec sklearn Pipeline depuis sklearn 1.6+
    (probl√®me de s√©rialisation). On applique donc le preprocessing manuellement.
    """
    # Preprocessing manuel
    imputer = SimpleImputer(strategy='median')
    X_train_imp = imputer.fit_transform(X_train)
    X_test_imp = imputer.transform(X_test)
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_imp)
    X_test_scaled = scaler.transform(X_test_imp)
    
    # Entra√Ænement et pr√©diction
    catboost_model.fit(X_train_scaled, y_train)
    y_pred = catboost_model.predict(X_test_scaled)
    
    y_proba = None
    if hasattr(catboost_model, 'predict_proba'):
        y_proba = catboost_model.predict_proba(X_test_scaled)
    
    # M√©triques
    results = {
        'model': model_name,
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred, average='weighted', zero_division=0),
        'recall': recall_score(y_test, y_pred, average='weighted', zero_division=0),
        'f1': f1_score(y_test, y_pred, average='weighted', zero_division=0),
    }
    
    if len(np.unique(y_test)) == 2 and y_proba is not None:
        results['auc'] = roc_auc_score(y_test, y_proba[:, 1])
    
    return results, y_pred, y_proba


# === FONCTIONS DE VISUALISATION ===

def plot_confusion_matrix(y_true, y_pred, model_name, labels=None):
    """Affiche une matrice de confusion interactive avec Plotly."""
    cm = confusion_matrix(y_true, y_pred)
    
    if labels is None:
        labels = [str(i) for i in sorted(np.unique(y_true))]
    
    # Pourcentages par ligne (= par vraie classe)
    cm_percent = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] * 100
    
    # Annotations : valeur absolue + pourcentage
    text_annotations = []
    for i in range(len(cm)):
        row = []
        for j in range(len(cm[0])):
            row.append(f"{cm[i][j]}<br>({cm_percent[i][j]:.1f}%)")
        text_annotations.append(row)
    
    fig = go.Figure(data=go.Heatmap(
        z=cm, x=labels, y=labels,
        text=text_annotations, texttemplate="%{text}",
        colorscale='Blues', showscale=True
    ))
    
    fig.update_layout(
        title=f'Matrice de Confusion - {model_name}',
        xaxis_title='Pr√©diction', yaxis_title='R√©alit√©',
        width=500, height=450
    )
    fig.show()
    return cm


def plot_roc_curve(y_true, y_proba, model_name):
    """Affiche la courbe ROC avec l'AUC (classification binaire uniquement)."""
    if len(np.unique(y_true)) != 2:
        print(f"ROC non disponible pour {model_name}: pas binaire")
        return None
    
    if y_proba is None:
        print(f"ROC non disponible pour {model_name}: pas de probabilit√©s")
        return None
    
    y_score = y_proba[:, 1] if len(y_proba.shape) > 1 else y_proba
    fpr, tpr, _ = roc_curve(y_true, y_score)
    roc_auc = auc(fpr, tpr)
    
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines',
                             name=f'{model_name} (AUC = {roc_auc:.3f})',
                             line=dict(color='blue', width=2)))
    fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode='lines',
                             name='Random (AUC = 0.5)',
                             line=dict(color='gray', width=1, dash='dash')))
    
    fig.update_layout(
        title=f'Courbe ROC - {model_name} (AUC = {roc_auc:.3f})',
        xaxis_title='Taux de Faux Positifs (FPR)',
        yaxis_title='Taux de Vrais Positifs (TPR)',
        width=550, height=450, showlegend=True
    )
    fig.show()
    return roc_auc


print("‚úÖ Fonctions d√©finies")

‚úÖ Fonctions d√©finies


## 3. Hyperopt - Optimisation bay√©sienne des hyperparam√®tres

### Pourquoi Hyperopt plut√¥t que GridSearchCV ?

| Aspect | GridSearchCV | Hyperopt (TPE) |
|--------|--------------|----------------|
| **M√©thode** | Exhaustive (teste tout) | Bay√©sienne (apprend des essais) |
| **Efficacit√©** | O(n^k) combinaisons | Converge plus vite |
| **Espace** | Grille discr√®te | Continu + discret |
| **Intelligence** | Aucune | M√©morise les bons/mauvais essais |

### Algorithme TPE (Tree-structured Parzen Estimator)

1. √âvalue quelques points al√©atoires
2. S√©pare les r√©sultats en "bons" et "mauvais"
3. Mod√©lise P(params | bons) et P(params | mauvais)
4. Choisit le prochain point qui maximise le ratio

In [4]:
# === HYPEROPT - OPTIMISATION BAY√âSIENNE ===

# Pr√©paration des donn√©es
X_acc_bin, y_acc_bin = prepare_data(df_accident, 'grav_binary')
X_train_acc_bin, X_test_acc_bin, y_train_acc_bin, y_test_acc_bin = train_test_split(
    X_acc_bin, y_acc_bin, test_size=0.2, random_state=RANDOM_STATE, stratify=y_acc_bin
)

# Sous-√©chantillonnage pour acc√©l√©rer (50k)
sample_size_acc = min(50000, len(X_train_acc_bin))
X_sample_acc = X_train_acc_bin.sample(n=sample_size_acc, random_state=RANDOM_STATE)
y_sample_acc = y_train_acc_bin.loc[X_sample_acc.index]

print(f"Hyperopt sur {sample_size_acc:,} echantillons")
print("=" * 50)

# === RANDOMFOREST ===
best_params_rf, best_f1_rf, trials_rf = optimize_boosting_model(
    X_sample_acc, y_sample_acc,
    model_type='randomforest',
    max_evals=max_evals,
    cv=3,
    scoring='f1',
    random_state=RANDOM_STATE,
    n_jobs=nb_workers
)

# Validation sur test set complet
print("\nValidation sur test set complet...")
best_model_rf = create_pipeline(RandomForestClassifier(
    **best_params_rf,
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
    class_weight='balanced'
))
best_model_rf.fit(X_train_acc_bin, y_train_acc_bin)
y_pred_rf = best_model_rf.predict(X_test_acc_bin)
f1_rf_test = f1_score(y_test_acc_bin, y_pred_rf)
print(f"F1 sur test set: {f1_rf_test:.3f}")


Hyperopt sur 50,000 echantillons
Optimisation Hyperopt pour RANDOMFOREST...
  - max_evals: 50
  - cv: 3 folds
  - scoring: f1

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [47:36<00:00, 57.12s/trial, best loss: -0.642990309783856] 

Meilleurs parametres RANDOMFOREST:
  - max_depth: 1
  - max_features: 2
  - min_samples_leaf: 10
  - min_samples_split: 11
  - n_estimators: 350

Meilleur f1 (CV): 0.6430 (+/- 0.0009)

Validation sur test set complet...
F1 sur test set: 0.527


### 3.2 Hyperopt - CatBoost

CatBoost utilise des hyperparametres differents de RandomForest :
- **iterations** (n_estimators) : nombre d'arbres (attention a l'overfitting)
- **learning_rate** : pas d'apprentissage (plus petit = plus lent mais plus precis)
- **depth** : profondeur max (typiquement 4-10 pour boosting)
- **l2_leaf_reg** : regularisation L2

In [5]:
# === HYPEROPT - CATBOOST ===

best_params_cb, best_f1_cb, trials_cb = optimize_boosting_model(
    X_sample_acc, y_sample_acc,
    model_type='catboost',
    max_evals=max_evals,
    cv=3,
    scoring='f1',
    random_state=RANDOM_STATE,
    n_jobs=nb_workers
)

# Validation sur test set complet
print("\nValidation sur test set complet...")
imputer = SimpleImputer(strategy='median')
X_train_imp = pd.DataFrame(
    imputer.fit_transform(X_train_acc_bin),
    columns=X_train_acc_bin.columns
)
X_test_imp = pd.DataFrame(
    imputer.transform(X_test_acc_bin),
    columns=X_test_acc_bin.columns
)

best_model_cb = CatBoostClassifier(
    **best_params_cb,
    random_state=RANDOM_STATE,
    auto_class_weights='Balanced',
    verbose=False
)
best_model_cb.fit(X_train_imp, y_train_acc_bin)
y_pred_cb = best_model_cb.predict(X_test_imp)
f1_cb_test = f1_score(y_test_acc_bin, y_pred_cb)
print(f"F1 sur test set: {f1_cb_test:.3f}")


Optimisation Hyperopt pour CATBOOST...
  - max_evals: 50
  - cv: 3 folds
  - scoring: f1

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [00:00<00:00, 301.95trial/s, best loss: 0.0]

Meilleurs parametres CATBOOST:
  - depth: 8
  - iterations: 450
  - l2_leaf_reg: 4.169836222804728
  - learning_rate: 0.15917237508778656

Meilleur f1 (CV): 0.0000 (+/- 0.0000)

Validation sur test set complet...
F1 sur test set: 0.655


### 3.3 Hyperopt - XGBoost

XGBoost (eXtreme Gradient Boosting) utilise des hyperparam√®tres similaires √† CatBoost :
- **n_estimators** : nombre d'arbres
- **learning_rate** : taux d'apprentissage
- **max_depth** : profondeur max des arbres
- **min_child_weight** : poids minimum par feuille (r√©gularisation)
- **subsample** / **colsample_bytree** : √©chantillonnage pour r√©duire l'overfitting

In [6]:
best_params_xgb, best_f1_xgb, trials_xgb = optimize_boosting_model(
    X_sample_acc, y_sample_acc,
    model_type='xgboost',
    max_evals=max_evals,
    cv=3,
    scoring='f1',
    random_state=RANDOM_STATE,
    n_jobs=nb_workers
)

# Validation sur test set complet
print("\nValidation sur test set complet...")
imputer = SimpleImputer(strategy='median')
X_train_imp = pd.DataFrame(
    imputer.fit_transform(X_train_acc_bin),
    columns=X_train_acc_bin.columns
)
X_test_imp = pd.DataFrame(
    imputer.transform(X_test_acc_bin),
    columns=X_test_acc_bin.columns
)

best_model_xgb = XGBClassifier(
    **best_params_xgb,
    random_state=RANDOM_STATE,
    scale_pos_weight=18 / 10,
    verbosity=0,
)
best_model_xgb.fit(X_train_imp, y_train_acc_bin)
y_pred_xgb = best_model_xgb.predict(X_test_imp)
f1_xgb_test = f1_score(y_test_acc_bin, y_pred_xgb)
print(f"F1 sur test set: {f1_xgb_test:.3f}")

Optimisation Hyperopt pour XGBOOST...
  - max_evals: 50
  - cv: 3 folds
  - scoring: f1

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [02:41<00:00,  3.22s/trial, best loss: -0.6513562417010096]

Meilleurs parametres XGBOOST:
  - colsample_bytree: 0.6011993082667989
  - learning_rate: 0.09687815138534953
  - max_depth: 6
  - min_child_weight: 8
  - n_estimators: 250
  - subsample: 0.9421080655965379

Meilleur f1 (CV): 0.6514 (+/- 0.0026)

Validation sur test set complet...
F1 sur test set: 0.655


### 3.4 Hyperopt - LightGBM

LightGBM (Light Gradient Boosting Machine) est optimis√© pour la vitesse :
- **n_estimators** : nombre d'arbres
- **learning_rate** : taux d'apprentissage
- **max_depth** : profondeur max (peut √™tre -1 pour illimit√©)
- **num_leaves** : nombre max de feuilles par arbre (sp√©cifique √† LightGBM)
- **min_child_samples** : √©chantillons minimum par feuille

In [7]:
best_params_lgbm, best_f1_lgbm, trials_lgbm = optimize_boosting_model(
    X_sample_acc, y_sample_acc,
    model_type='lightgbm',
    max_evals=max_evals,
    cv=3,
    scoring='f1',
    random_state=RANDOM_STATE,
    n_jobs=nb_workers
)

# Validation sur test set complet
print("\nValidation sur test set complet...")
imputer = SimpleImputer(strategy='median')
X_train_imp = pd.DataFrame(
    imputer.fit_transform(X_train_acc_bin),
    columns=X_train_acc_bin.columns
)
X_test_imp = pd.DataFrame(
    imputer.transform(X_test_acc_bin),
    columns=X_test_acc_bin.columns
)

best_model_lgbm = LGBMClassifier(
    **best_params_lgbm,
    random_state=RANDOM_STATE,
    class_weights='balanced',
    verbose=-1
)
best_model_lgbm.fit(X_train_imp, y_train_acc_bin)
y_pred_lgbm = best_model_lgbm.predict(X_test_imp)
f1_lgbm_test = f1_score(y_test_acc_bin, y_pred_xgb)
print(f"F1 sur test set: {f1_lgbm_test:.3f}")

Optimisation Hyperopt pour LIGHTGBM...
  - max_evals: 50
  - cv: 3 folds
  - scoring: f1

100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 50/50 [02:03<00:00,  2.47s/trial, best loss: -0.6014562598198895]

Meilleurs parametres LIGHTGBM:
  - colsample_bytree: 0.7458197154615357
  - learning_rate: 0.1095092638225987
  - max_depth: 8
  - min_child_samples: 40
  - n_estimators: 250
  - num_leaves: 50
  - subsample: 0.8576876180176862

Meilleur f1 (CV): 0.6015 (+/- 0.0004)

Validation sur test set complet...
F1 sur test set: 0.655


### 3.5 Convergence de l'optimisation

Visualisation de la progression d'Hyperopt pour chaque algorithme.

**Interpr√©tation :**
- Les points repr√©sentent chaque essai (combinaison d'hyperparam√®tres)
- La courbe rouge montre le meilleur score cumul√©
- Une courbe qui se stabilise indique que l'espace a √©t√© bien explor√©

In [8]:
# === VISUALISATION DE LA CONVERGENCE HYPEROPT ===

fig = make_subplots(rows=1, cols=2, subplot_titles=['RandomForest', 'CatBoost'])

# RF
scores_rf = [t['result']['score'] for t in trials_rf.trials]
best_rf = np.maximum.accumulate(scores_rf)
fig.add_trace(go.Scatter(y=scores_rf, mode='markers', name='RF essais', marker=dict(size=5, opacity=0.5)), row=1, col=1)
fig.add_trace(go.Scatter(y=best_rf, mode='lines', name='RF best', line=dict(color='red', width=2)), row=1, col=1)

# CatBoost
scores_cb = [t['result']['score'] for t in trials_cb.trials]
best_cb = np.maximum.accumulate(scores_cb)
fig.add_trace(go.Scatter(y=scores_cb, mode='markers', name='CB essais', marker=dict(size=5, opacity=0.5)), row=1, col=2)
fig.add_trace(go.Scatter(y=best_cb, mode='lines', name='CB best', line=dict(color='red', width=2)), row=1, col=2)

# XGBoost
scores_xgb = [t['result']['score'] for t in trials_xgb.trials]
best_xgb = np.maximum.accumulate(scores_xgb)
fig.add_trace(go.Scatter(y=scores_xgb, mode='markers', name='xgb essais', marker=dict(size=5, opacity=0.5)), row=1, col=2)
fig.add_trace(go.Scatter(y=best_xgb, mode='lines', name='xgb best', line=dict(color='red', width=2)), row=1, col=2)

# LightGBM
scores_lgbm = [t['result']['score'] for t in trials_lgbm.trials]
best_lgbm = np.maximum.accumulate(scores_lgbm)
fig.add_trace(go.Scatter(y=scores_lgbm, mode='markers', name='lgbm essais', marker=dict(size=5, opacity=0.5)), row=1, col=2)
fig.add_trace(go.Scatter(y=best_lgbm, mode='lines', name='lgbm best', line=dict(color='red', width=2)), row=1, col=2)

fig.update_layout(title='Convergence Hyperopt', height=400, width=900, showlegend=False)
fig.update_xaxes(title_text='Essai')
fig.update_yaxes(title_text='F1 Score')
fig.show()

# R√©sum√©
print("\n" + "=" * 50)
print("RESUME HYPEROPT")
print("=" * 50)
print(f"RandomForest : F1 CV={best_f1_rf:.3f} | F1 test={f1_rf_test:.3f}")
print(f"CatBoost     : F1 CV={best_f1_cb:.3f} | F1 test={f1_cb_test:.3f}")
print(f"XGBoost      : F1 CV={best_f1_xgb:.3f} | F1 test={f1_xgb_test:.3f}")
print(f"LightGBM     : F1 CV={best_f1_lgbm:.3f} | F1 test={f1_lgbm_test:.3f}")



RESUME HYPEROPT
RandomForest : F1 CV=0.643 | F1 test=0.527
CatBoost     : F1 CV=0.000 | F1 test=0.655
XGBoost      : F1 CV=0.651 | F1 test=0.655
LightGBM     : F1 CV=0.601 | F1 test=0.655


## 4. Impact du nombre d'arbres (`n_estimators`)

**Question :** Au-del√† de 300 arbres, y a-t-il encore un gain de performance ?

**Attendu :** Les performances devraient se stabiliser au-del√† d'un certain seuil (loi des rendements d√©croissants).

In [9]:
# Test nombre d'arbres sur ACCIDENT - TOUTES LES M√âTRIQUES
n_estimators_list_acc = [20, 50, 100, 200]
results_trees_acc = []

for n_trees in n_estimators_list_acc:
    print(f"Testing n_estimators={n_trees} sur ACCIDENT...", end=" ")
    
    model_acc = create_pipeline(RandomForestClassifier(
        n_estimators=n_trees, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight='balanced'
    ))
    
    res, y_pred_tree, y_proba_tree = evaluate_model(model_acc, X_train_acc_bin, X_test_acc_bin, 
                               y_train_acc_bin, y_test_acc_bin, f"RF_{n_trees}")
    res['n_estimators'] = n_trees
    
    # M√©triques suppl√©mentaires
    cm = confusion_matrix(y_test_acc_bin, y_pred_tree)
    tn, fp, fn, tp = cm.ravel()
    res['specificity'] = tn / (tn + fp) if (tn + fp) > 0 else 0
    res['mcc'] = matthews_corrcoef(y_test_acc_bin, y_pred_tree)
    
    results_trees_acc.append(res)
    print(f"F1={res['f1']:.3f} | Recall={res['recall']:.3f} | AUC={res.get('auc', 0):.3f}")

df_trees_acc = pd.DataFrame(results_trees_acc)
print("\nüìä Tableau complet ACCIDENT:")
print(df_trees_acc[['n_estimators', 'accuracy', 'precision', 'recall', 'specificity', 'f1', 'auc', 'mcc']].round(3))

# Graphique avec toutes les m√©triques
metrics_to_plot = ['accuracy', 'precision', 'recall', 'specificity', 'f1', 'auc', 'mcc']
fig = go.Figure()

for metric in metrics_to_plot:
    if metric in df_trees_acc.columns:
        fig.add_trace(go.Scatter(
            x=df_trees_acc['n_estimators'],
            y=df_trees_acc[metric],
            mode='lines+markers',
            name=metric.upper(),
            text=[f"{v:.3f}" for v in df_trees_acc[metric]],
            hovertemplate=f'{metric}: %{{y:.3f}}<extra></extra>'
        ))

fig.update_layout(
    title='Impact du nombre d\'arbres - ACCIDENT (toutes m√©triques)',
    xaxis_title='Nombre d\'arbres (n_estimators)',
    yaxis_title='Score',
    yaxis_range=[0, 1],
    width=900,
    height=500,
    legend_title='M√©trique'
)
fig.show()

Testing n_estimators=20 sur ACCIDENT... F1=0.715 | Recall=0.712 | AUC=0.753
Testing n_estimators=50 sur ACCIDENT... F1=0.715 | Recall=0.713 | AUC=0.754
Testing n_estimators=100 sur ACCIDENT... F1=0.716 | Recall=0.713 | AUC=0.755
Testing n_estimators=200 sur ACCIDENT... F1=0.716 | Recall=0.713 | AUC=0.755

üìä Tableau complet ACCIDENT:
   n_estimators  accuracy  precision  recall  specificity     f1    auc    mcc
0            20     0.712      0.720   0.712        0.752  0.715  0.753  0.384
1            50     0.713      0.720   0.713        0.751  0.715  0.754  0.385
2           100     0.713      0.722   0.713        0.750  0.716  0.755  0.388
3           200     0.713      0.721   0.713        0.748  0.716  0.755  0.387


## 5. Comparaison des algorithmes de boosting

**Algorithmes test√©s :**

| Algorithme | Avantages | Inconv√©nients |
|------------|-----------|---------------|
| **RandomForest** | Robuste, interpr√©table, peu d'hyperparam√®tres | Peut √™tre moins performant |
| **XGBoost** | Tr√®s performant, r√©gularisation int√©gr√©e | Plus lent √† entra√Æner |
| **LightGBM** | Tr√®s rapide, g√®re bien les grandes donn√©es | Peut overfitter |
| **CatBoost** | G√®re nativement les cat√©gorielles | Incompatible sklearn Pipeline (1.6+) |

**Objectif :** Identifier l'algorithme avec le meilleur compromis F1/AUC.

In [10]:
# === COMPARAISON DES 6 MOD√àLES ACCIDENT ===
print("="*70)
print("COMPARAISON DES MOD√àLES ACCIDENT (6 mod√®les)")
print("="*70)

models_eval_acc = []

# 1. RF_baseline
rf_baseline = create_pipeline(RandomForestClassifier(
    n_estimators=base_estimators,
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
    class_weight='balanced'
))
res, y_pred, y_proba = evaluate_model(rf_baseline, X_train_acc_bin, X_test_acc_bin,
                                      y_train_acc_bin, y_test_acc_bin, 'RF_baseline')
models_eval_acc.append({"name": "RF_baseline", "y_true": y_test_acc_bin, "y_pred": y_pred, "y_proba": y_proba})

# 2. XGB_baseline
xgb_baseline = create_pipeline(XGBClassifier(
    n_estimators=base_estimators,
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
    scale_pos_weight=len(y_train_acc_bin[y_train_acc_bin==0])/len(y_train_acc_bin[y_train_acc_bin==1])
))
res, y_pred, y_proba = evaluate_model(xgb_baseline, X_train_acc_bin, X_test_acc_bin,
                                      y_train_acc_bin, y_test_acc_bin, 'XGB_baseline')
models_eval_acc.append({"name": "XGB_baseline", "y_true": y_test_acc_bin, "y_pred": y_pred, "y_proba": y_proba})

# 3. RF_HyperOpt (RandomForest optimis√©)
rf_hyperopt = create_pipeline(RandomForestClassifier(
    n_estimators=best_params_rf.get('n_estimators', 200),
    max_depth=best_params_rf.get('max_depth', None),
    min_samples_split=best_params_rf.get('min_samples_split', 2),
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
    class_weight='balanced'
))
res, y_pred, y_proba = evaluate_model(rf_hyperopt, X_train_acc_bin, X_test_acc_bin,
                                      y_train_acc_bin, y_test_acc_bin, 'RF_HyperOpt')
models_eval_acc.append({"name": "RF_HyperOpt", "y_true": y_test_acc_bin, "y_pred": y_pred, "y_proba": y_proba})

# 4. XGBoost optimis√©
xgb_opt = create_pipeline(XGBClassifier(
    **best_params_xgb,
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
    scale_pos_weight=len(y_train_acc_bin[y_train_acc_bin==0])/len(y_train_acc_bin[y_train_acc_bin==1])
))
res, y_pred, y_proba = evaluate_model(xgb_opt, X_train_acc_bin, X_test_acc_bin,
                                      y_train_acc_bin, y_test_acc_bin, 'XGBoost')
models_eval_acc.append({"name": "XGBoost", "y_true": y_test_acc_bin, "y_pred": y_pred, "y_proba": y_proba})

# 5. LightGBM optimis√©
lgbm_opt = create_pipeline(LGBMClassifier(
    **best_params_lgbm,
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
    class_weight='balanced',
    verbose=-1
))
res, y_pred, y_proba = evaluate_model(lgbm_opt, X_train_acc_bin, X_test_acc_bin,
                                      y_train_acc_bin, y_test_acc_bin, 'LightGBM')
models_eval_acc.append({"name": "LightGBM", "y_true": y_test_acc_bin, "y_pred": y_pred, "y_proba": y_proba})
# 6. CatBoost optimis√© (sans Pipeline - incompatible sklearn 1.6+)
catboost_opt = CatBoostClassifier(
    **best_params_cb,
    random_state=RANDOM_STATE,
    auto_class_weights='Balanced',
    verbose=False
)
res, y_pred, y_proba = evaluate_catboost(catboost_opt, X_train_acc_bin, X_test_acc_bin,
                                         y_train_acc_bin, y_test_acc_bin, 'CatBoost')
models_eval_acc.append({"name": "CatBoost", "y_true": y_test_acc_bin, "y_pred": y_pred, "y_proba": y_proba})

# Affichage complet avec display_metrics (confusion matrices, ROC, bar chart, table)
display_metrics(
    models_results=models_eval_acc,
    class_labels=["Non grave", "Grave"]
)

COMPARAISON DES MOD√àLES ACCIDENT (6 mod√®les)



X does not have valid feature names, but LGBMClassifier was fitted with feature names


X does not have valid feature names, but LGBMClassifier was fitted with feature names



EVALUATION DES MODELES (6 modele(s))

üìä TABLEAU DES METRIQUES
----------------------------------------------------------------------
       model  accuracy  balanced_accuracy  precision  recall     f1  roc_auc    mcc  specificity  sensitivity
 RF_baseline    0.7134             0.6981     0.7216  0.7134 0.7164   0.7548 0.3878       0.7496       0.6465
XGB_baseline    0.7254             0.7281     0.7477  0.7254 0.7307   0.8019 0.4386       0.7188       0.7374
 RF_HyperOpt    0.6639             0.6305     0.6634  0.6639 0.6637   0.6907 0.2614       0.7430       0.5180
     XGBoost    0.7257             0.7295     0.7491  0.7257 0.7312   0.8019 0.4410       0.7167       0.7424
    LightGBM    0.7257             0.7293     0.7488  0.7257 0.7311   0.8022 0.4406       0.7172       0.7414
    CatBoost    0.7263             0.7290     0.7485  0.7263 0.7316   0.8012 0.4403       0.7199       0.7381




üìà MATRICES DE CONFUSION
----------------------------------------------------------------------



üìâ COURBES ROC
----------------------------------------------------------------------



RESUME
Meilleur modele (F1): CatBoost (F1 = 0.7316)
Meilleur modele (AUC): LightGBM (AUC = 0.8022)


{'metrics_df':           model  accuracy  balanced_accuracy  precision  recall      f1  \
 0   RF_baseline    0.7134             0.6981     0.7216  0.7134  0.7164   
 1  XGB_baseline    0.7254             0.7281     0.7477  0.7254  0.7307   
 2   RF_HyperOpt    0.6639             0.6305     0.6634  0.6639  0.6637   
 3       XGBoost    0.7257             0.7295     0.7491  0.7257  0.7312   
 4      LightGBM    0.7257             0.7293     0.7488  0.7257  0.7311   
 5      CatBoost    0.7263             0.7290     0.7485  0.7263  0.7316   
 
    roc_auc     mcc  specificity  sensitivity  
 0   0.7548  0.3878       0.7496       0.6465  
 1   0.8019  0.4386       0.7188       0.7374  
 2   0.6907  0.2614       0.7430       0.5180  
 3   0.8019  0.4410       0.7167       0.7424  
 4   0.8022  0.4406       0.7172       0.7414  
 5   0.8012  0.4403       0.7199       0.7381  ,
 'figures': {'bar_chart': Figure({
      'data': [{'marker': {'color': '#3498db'},
                'name': 'RF_base

## 6. Comparaison finale et s√©lection du meilleur mod√®le

**Crit√®res de s√©lection :**
1. **F1-Score** (crit√®re principal)
2. **AUC** (d√©partage en cas d'√©galit√© sur F1)

**Mod√®les compar√©s :**
- RandomForest baseline
- RandomForest optimis√© (meilleurs hyperparam√®tres)
- RandomForest avec plus d'arbres
- XGBoost
- LightGBM
- CatBoost

In [11]:
# S√©lection du meilleur mod√®le ACCIDENT
print("="*70)
print("S√âLECTION DU MEILLEUR MOD√àLE ACCIDENT")
print("="*70)

best_model_info = select_best_model(models_eval_acc, metric='f1')
print(f"\nMeilleur mod√®le: {best_model_info['name']}")
print(f"F1-Score: {best_model_info['score']:.4f}")

S√âLECTION DU MEILLEUR MOD√àLE ACCIDENT

Meilleur mod√®le: CatBoost
F1-Score: 0.7316


### 6.1 Interpr√©tation des matrices de confusion

**Lecture de la matrice :**
- **TP (True Positive)** : Accidents graves correctement d√©tect√©s ‚Üí Bon !
- **FN (False Negative)** : Accidents graves manqu√©s ‚Üí Dangereux (sous-estimation du risque)
- **FP (False Positive)** : Fausses alertes ‚Üí Co√ªteux mais pas dangereux
- **TN (True Negative)** : Non-graves correctement identifi√©s ‚Üí Bon !

**Compromis √† faire :**
- Privil√©gier le **Recall** si on veut minimiser les FN (ne pas rater de cas graves)
- Privil√©gier la **Precision** si on veut minimiser les FP (√©viter les fausses alertes)

## 7. Sauvegarde modeles optimises

In [12]:
# Sauvegarder le meilleur mod√®le ACCIDENT
print("="*70)
print("SAUVEGARDE DU MEILLEUR MOD√àLE ACCIDENT")
print("="*70)

# Configuration des mod√®les (m√™me config que la comparaison)
X_acc_bin_full, y_acc_bin_full = prepare_data(df_accident, 'grav_binary')
pos_weight = len(y_acc_bin_full[y_acc_bin_full==0]) / len(y_acc_bin_full[y_acc_bin_full==1])

model_configs = {
    'RF_baseline': lambda: create_pipeline(RandomForestClassifier(
        n_estimators=base_estimators, random_state=RANDOM_STATE,
        n_jobs=nb_workers, class_weight='balanced'
    )),
    'XGB_baseline': lambda: create_pipeline(XGBClassifier(
        n_estimators=base_estimators, random_state=RANDOM_STATE,
        n_jobs=nb_workers, scale_pos_weight=pos_weight
    )),
    'RF_HyperOpt': lambda: create_pipeline(RandomForestClassifier(
        n_estimators=best_params_rf.get('n_estimators', 200),
        max_depth=best_params_rf.get('max_depth', None),
        min_samples_split=best_params_rf.get('min_samples_split', 2),
        random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight='balanced'
    )),
    'XGBoost': lambda: create_pipeline(XGBClassifier(
        **best_params_xgb, random_state=RANDOM_STATE,
        n_jobs=nb_workers, scale_pos_weight=pos_weight
    )),
    'LightGBM': lambda: create_pipeline(LGBMClassifier(
        **best_params_lgbm, random_state=RANDOM_STATE,
        n_jobs=nb_workers, class_weight='balanced', verbose=-1
    )),
    'CatBoost': lambda: CatBoostClassifier(
        **best_params_cb, random_state=RANDOM_STATE,
        auto_class_weights='Balanced', verbose=False
    )
}

result = save_best_model(
    best_model_name=best_model_info['name'],
    model_configs=model_configs,
    X_full=X_acc_bin_full,
    y_full=y_acc_bin_full,
    X_test=X_test_acc_bin,
    y_test=y_test_acc_bin,
    save_path='models/model_accident_binary_optimized.joblib'
)

SAUVEGARDE DU MEILLEUR MOD√àLE ACCIDENT
Entrainement du modele: CatBoost
Sauvegarde: models/model_accident_binary_optimized.joblib
F1 sur test (weighted): 0.7421

Features attendues en entree (11):
  1. est_nuit
  2. est_heure_pointe
  3. jour_semaine
  4. est_weekend
  5. region
  6. dep
  7. agg
  8. vma
  9. impl_vehicule_leger
  10. impl_poids_lourd
  11. impl_pieton


## 8. R√©sum√© et conclusions

### Mod√®les sauvegard√©s

| Fichier | Description | Usage |
|---------|-------------|-------|
|  | Classification binaire - baseline | Fallback |
|  | Binaire optimis√© (Hyperopt) | **Production** |

### R√©sultats de l'optimisation

| Algorithme | F1 CV (Hyperopt) | F1 Test | Commentaire |
|------------|------------------|---------|-------------|
| RandomForest | ~0.64 | ~0.53 | Overfitting possible |
| XGBoost | ~0.65 | ~0.66 | Bon compromis |
| LightGBM | ~0.60 | ~0.66 | Rapide et performant |
| CatBoost | - | ~0.66 | **S√©lectionn√©** |

### Recommandations pour l'API

1. **Endpoint principal** : Utiliser 
2. **Preprocessing** : Appliquer le m√™me pipeline (imputation + scaling) qu'√† l'entra√Ænement
3. **Seuil de d√©cision** : Par d√©faut 0.5, ajustable selon le compromis FP/FN souhait√©

### Pistes d'am√©lioration future

- Enrichir les features avec des donn√©es externes (m√©t√©o d√©taill√©e, trafic)
- Tester d'autres architectures (stacking, neural networks)
- Optimiser le seuil de d√©cision selon les co√ªts m√©tier (FN plus grave que FP)