# Fine-tuning - Prediction de gravite des PASSAGERS par vehicule

## Contexte

Ce notebook fait suite au notebook **4-a-passengers-ml-training** qui a etabli :
- Les modeles specialises sont meilleurs pour : **vehicule_leger** (+11.4% F1), **velo_edp** (+3.1% F1), **pieton** (+6.9% F1)
- Le modele global est meilleur pour : voiture et poids_lourd (manque de patterns specifiques)

## Decision d'architecture

| Categorie | Modele utilise | Justification |
|-----------|----------------|---------------|
| **vehicule_leger** | Specialise (fine-tune) | F1 0.640 vs 0.526 global |
| **velo_edp** | Specialise (fine-tune) | F1 0.557 vs 0.526 global |
| **pieton** | Specialise (fine-tune) | F1 0.596 vs 0.526 global |
| voiture | Global | F1 global meilleur |
| poids_lourd | Global | F1 global meilleur |

## Objectif de ce notebook

Optimiser les hyperparametres des 3 modeles specialises via Hyperopt pour ameliorer leurs performances.

## Workflow

```
1. Configuration et imports
2. Chargement des 3 datasets (vehicule_leger, velo_edp, pieton)
3. Fine-tuning Hyperopt pour chaque categorie
   - RandomForest
   - XGBoost
   - LightGBM
   - CatBoost
4. Comparaison et selection du meilleur modele par categorie
5. Sauvegarde des modeles optimises
```

## Criteres de selection

| Critere | Metrique | Justification |
|---------|----------|---------------|
| **Principal** | F1-Score | Equilibre precision/rappel |
| **Secondaire** | AUC | Departage en cas d'egalite |

---

## 1. Configuration et imports

In [1]:
# === IMPORTS ===
import os
import pandas as pd
import numpy as np
import joblib
import warnings

from ml_config import nb_workers, base_estimators, max_evals
from functions import (
    display_metrics,
    optimize_boosting_model,
    plot_optimization_history,
    select_best_model,
    save_best_model,
)

# Visualisation
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Machine Learning
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    matthews_corrcoef,
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    auc,
)
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from hyperopt import fmin, tpe, hp, STATUS_OK, Trials

# === CONFIGURATION ===
RANDOM_STATE = 42

# Categories a fine-tuner (celles ou le modele specialise est meilleur)
CATEGORIES_TO_FINETUNE = ["vehicule_leger", "velo_edp", "pieton"]

warnings.filterwarnings("ignore", message="pkg_resources is deprecated")
print(f"Configuration: {base_estimators} estimators, {nb_workers} workers, {max_evals} evals Hyperopt")
print(f"Categories a fine-tuner: {CATEGORIES_TO_FINETUNE}")
print("Imports OK")

  import pkg_resources


Configuration: 100 estimators, 1 workers, 50 evals Hyperopt
Categories a fine-tuner: ['vehicule_leger', 'velo_edp', 'pieton']
Imports OK


---

## 2. Chargement des donnees et fonctions utilitaires

In [2]:
def load_dataset(name, base_path="data/output"):
    """
    Charge un dataset depuis Parquet (prioritaire) ou CSV (fallback).
    """
    parquet_path = f"{base_path}/{name}.parquet"
    csv_path = f"{base_path}/{name}.csv"

    if os.path.exists(parquet_path):
        df = pd.read_parquet(parquet_path)
        print(f"{name}: charge depuis Parquet ({df.shape[0]:,} lignes, {df.shape[1]} colonnes)")
    elif os.path.exists(csv_path):
        df = pd.read_csv(csv_path, sep=";", decimal=",", encoding="utf-8-sig")
        print(f"{name}: charge depuis CSV ({df.shape[0]:,} lignes, {df.shape[1]} colonnes)")
    else:
        raise FileNotFoundError(f"Dataset {name} non trouve")
    return df


# Chargement des datasets a fine-tuner
print("=== CHARGEMENT DES DATASETS ===")
datasets = {}
for cat in CATEGORIES_TO_FINETUNE:
    datasets[cat] = load_dataset(f"dataset_passager_{cat}")

# Affichage des statistiques
print("\n=== DISTRIBUTION DE LA GRAVITE ===")
print(f"{'Categorie':<20} {'Lignes':>10} {'Taux grave':>12} {'Colonnes':>10}")
print("-" * 55)
for cat, df in datasets.items():
    taux = df["grav_binary"].mean() * 100
    print(f"{cat:<20} {len(df):>10,} {taux:>11.1f}% {df.shape[1]:>10}")

=== CHARGEMENT DES DATASETS ===
dataset_passager_vehicule_leger: charge depuis Parquet (101,403 lignes, 18 colonnes)
dataset_passager_velo_edp: charge depuis Parquet (37,073 lignes, 16 colonnes)
dataset_passager_pieton: charge depuis Parquet (46,080 lignes, 15 colonnes)

=== DISTRIBUTION DE LA GRAVITE ===
Categorie                Lignes   Taux grave   Colonnes
-------------------------------------------------------
vehicule_leger          101,403        33.9%         18
velo_edp                 37,073        24.6%         16
pieton                   46,080        32.7%         15


In [3]:
# === FONCTIONS DE PREPARATION DES DONNEES ===


def prepare_data(df, target_col="grav_binary"):
    """
    Prepare les features (X) et la target (y) pour l'entrainement.

    - Exclut les colonnes de gravite (evite le data leakage)
    - Encode les variables categorielles en entiers
    """
    cols_to_drop = ["grav_ordered", "grav_binary"]
    cols_to_drop = [c for c in cols_to_drop if c in df.columns]

    X = df.drop(columns=cols_to_drop)
    y = df[target_col]

    # Encodage des variables categorielles
    for col in X.select_dtypes(include=["object", "string"]).columns:
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col].astype(str))

    return X, y


def create_pipeline(model):
    """
    Cree un pipeline sklearn : Imputation -> Scaling -> Modele.
    """
    return Pipeline([("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler()), ("model", model)])


def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    """
    Entraine un modele et calcule les metriques d'evaluation.
    """
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    y_proba = None
    if hasattr(model, "predict_proba"):
        y_proba = model.predict_proba(X_test)

    results = {
        "model": model_name,
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, average="weighted", zero_division=0),
        "recall": recall_score(y_test, y_pred, average="weighted", zero_division=0),
        "f1": f1_score(y_test, y_pred, average="weighted", zero_division=0),
    }

    if len(np.unique(y_test)) == 2 and y_proba is not None:
        results["auc"] = roc_auc_score(y_test, y_proba[:, 1])

    return results, y_pred, y_proba


def evaluate_catboost(catboost_model, X_train, X_test, y_train, y_test, model_name):
    """
    Evalue CatBoost SANS Pipeline sklearn (incompatible sklearn 1.6+).
    """
    imputer = SimpleImputer(strategy="median")
    X_train_imp = imputer.fit_transform(X_train)
    X_test_imp = imputer.transform(X_test)

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train_imp)
    X_test_scaled = scaler.transform(X_test_imp)

    catboost_model.fit(X_train_scaled, y_train)
    y_pred = catboost_model.predict(X_test_scaled)

    y_proba = None
    if hasattr(catboost_model, "predict_proba"):
        y_proba = catboost_model.predict_proba(X_test_scaled)

    results = {
        "model": model_name,
        "accuracy": accuracy_score(y_test, y_pred),
        "precision": precision_score(y_test, y_pred, average="weighted", zero_division=0),
        "recall": recall_score(y_test, y_pred, average="weighted", zero_division=0),
        "f1": f1_score(y_test, y_pred, average="weighted", zero_division=0),
    }

    if len(np.unique(y_test)) == 2 and y_proba is not None:
        results["auc"] = roc_auc_score(y_test, y_proba[:, 1])

    return results, y_pred, y_proba


print("Fonctions definies")

Fonctions definies


---

## 3. Fine-tuning VEHICULE_LEGER (2RM, quads)

### Caracteristiques de cette categorie

| Caracteristique | Valeur |
|-----------------|--------|
| Taille dataset | ~101k lignes |
| Taux de gravite | ~34% |
| F1 baseline (4a) | 0.640 |
| Objectif | Depasser 0.640 |

In [4]:
# === PREPARATION VEHICULE_LEGER ===
cat = "vehicule_leger"
print(f"=== PREPARATION {cat.upper()} ===")

df_vl = datasets[cat]
X_vl, y_vl = prepare_data(df_vl)

# Split train/test stratifie
X_train_vl, X_test_vl, y_train_vl, y_test_vl = train_test_split(
    X_vl, y_vl, test_size=0.2, random_state=RANDOM_STATE, stratify=y_vl
)

print(f"Features: {X_vl.columns.tolist()}")
print(f"\nTrain: {len(X_train_vl):,} | Test: {len(X_test_vl):,}")
print(f"Taux grave train: {y_train_vl.mean() * 100:.1f}%")
print(f"Taux grave test: {y_test_vl.mean() * 100:.1f}%")

# Sous-echantillonnage pour Hyperopt (accelerer)
sample_size_vl = min(30000, len(X_train_vl))
X_sample_vl = X_train_vl.sample(n=sample_size_vl, random_state=RANDOM_STATE)
y_sample_vl = y_train_vl.loc[X_sample_vl.index]
print(f"\nHyperopt sur {sample_size_vl:,} echantillons")

=== PREPARATION VEHICULE_LEGER ===
Features: ['age', 'sexe', 'catu', 'dep', 'region', 'agg', 'vma', 'atm', 'est_nuit', 'est_heure_pointe', 'jour_semaine', 'est_weekend', 'a_equipement_adapte', 'catv', 'sexe_conducteur', 'age_conducteur']

Train: 81,122 | Test: 20,281
Taux grave train: 33.9%
Taux grave test: 33.9%

Hyperopt sur 30,000 echantillons


In [5]:
# === HYPEROPT RANDOMFOREST - VEHICULE_LEGER ===
print("=" * 60)
print("HYPEROPT RandomForest - VEHICULE_LEGER")
print("=" * 60)

best_params_rf_vl, best_f1_rf_vl, trials_rf_vl = optimize_boosting_model(
    X_sample_vl,
    y_sample_vl,
    model_type="randomforest",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

# Validation sur test set complet
print("\nValidation sur test set complet...")
model_rf_vl = create_pipeline(
    RandomForestClassifier(**best_params_rf_vl, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced")
)
model_rf_vl.fit(X_train_vl, y_train_vl)
y_pred_rf_vl = model_rf_vl.predict(X_test_vl)
f1_rf_vl_test = f1_score(y_test_vl, y_pred_rf_vl, average="weighted")
print(f"F1 sur test set: {f1_rf_vl_test:.3f}")

HYPEROPT RandomForest - VEHICULE_LEGER
Optimisation Hyperopt pour RANDOMFOREST...
  - max_evals: 50
  - cv: 3 folds
  - scoring: f1

100%|██████████| 50/50 [28:14<00:00, 33.89s/trial, best loss: -0.6243240140888174]

Meilleurs parametres RANDOMFOREST:
  - max_depth: 1
  - max_features: 2
  - min_samples_leaf: 9
  - min_samples_split: 6
  - n_estimators: 350

Meilleur f1 (CV): 0.6243 (+/- 0.0066)

Validation sur test set complet...
F1 sur test set: 0.671


In [None]:
# === HYPEROPT XGBOOST - VEHICULE_LEGER ===
print("=" * 60)
print("HYPEROPT XGBoost - VEHICULE_LEGER")
print("=" * 60)

best_params_xgb_vl, best_f1_xgb_vl, trials_xgb_vl = optimize_boosting_model(
    X_sample_vl,
    y_sample_vl,
    model_type="xgboost",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

# Validation sur test set complet
print("\nValidation sur test set complet...")
imputer = SimpleImputer(strategy="median")
X_train_vl_imp = pd.DataFrame(imputer.fit_transform(X_train_vl), columns=X_train_vl.columns)
X_test_vl_imp = pd.DataFrame(imputer.transform(X_test_vl), columns=X_test_vl.columns)

n_neg = (y_train_vl == 0).sum()
n_pos = (y_train_vl == 1).sum()
scale_pos_weight_vl = n_neg / n_pos

model_xgb_vl = XGBClassifier(
    **best_params_xgb_vl, random_state=RANDOM_STATE, scale_pos_weight=scale_pos_weight_vl, verbosity=0
)
model_xgb_vl.fit(X_train_vl_imp, y_train_vl)
y_pred_xgb_vl = model_xgb_vl.predict(X_test_vl_imp)
f1_xgb_vl_test = f1_score(y_test_vl, y_pred_xgb_vl, average="weighted")
print(f"F1 sur test set: {f1_xgb_vl_test:.3f}")

HYPEROPT XGBoost - VEHICULE_LEGER
Optimisation Hyperopt pour XGBOOST...
  - max_evals: 50
  - cv: 3 folds
  - scoring: f1

100%|██████████| 50/50 [02:10<00:00,  2.61s/trial, best loss: -0.632011917223157] 

Meilleurs parametres XGBOOST:
  - colsample_bytree: 0.9951652709641787
  - learning_rate: 0.02386356986404833
  - max_depth: 5
  - min_child_weight: 2
  - n_estimators: 350
  - subsample: 0.9124351334655565

Meilleur f1 (CV): 0.6320 (+/- 0.0058)

Validation sur test set complet...
F1 sur test set: 0.717


: 

In [None]:
# === HYPEROPT LIGHTGBM - VEHICULE_LEGER ===
print("=" * 60)
print("HYPEROPT LightGBM - VEHICULE_LEGER")
print("=" * 60)

best_params_lgbm_vl, best_f1_lgbm_vl, trials_lgbm_vl = optimize_boosting_model(
    X_sample_vl,
    y_sample_vl,
    model_type="lightgbm",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

# Validation sur test set complet
print("\nValidation sur test set complet...")
model_lgbm_vl = create_pipeline(
    LGBMClassifier(**best_params_lgbm_vl, random_state=RANDOM_STATE, class_weight="balanced", verbose=-1)
)
model_lgbm_vl.fit(X_train_vl, y_train_vl)
y_pred_lgbm_vl = model_lgbm_vl.predict(X_test_vl)
f1_lgbm_vl_test = f1_score(y_test_vl, y_pred_lgbm_vl, average="weighted")
print(f"F1 sur test set: {f1_lgbm_vl_test:.3f}")

HYPEROPT LightGBM - VEHICULE_LEGER
Optimisation Hyperopt pour LIGHTGBM...
  - max_evals: 50
  - cv: 3 folds
  - scoring: f1

 24%|██▍       | 12/50 [00:29<01:16,  2.00s/trial, best loss: -0.5575626806948394]

In [None]:
# === HYPEROPT CATBOOST - VEHICULE_LEGER ===
print("=" * 60)
print("HYPEROPT CatBoost - VEHICULE_LEGER")
print("=" * 60)

best_params_cb_vl, best_f1_cb_vl, trials_cb_vl = optimize_boosting_model(
    X_sample_vl,
    y_sample_vl,
    model_type="catboost",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

# Validation sur test set complet
print("\nValidation sur test set complet...")
imputer = SimpleImputer(strategy="median")
X_train_vl_imp = pd.DataFrame(imputer.fit_transform(X_train_vl), columns=X_train_vl.columns)
X_test_vl_imp = pd.DataFrame(imputer.transform(X_test_vl), columns=X_test_vl.columns)

model_cb_vl = CatBoostClassifier(
    **best_params_cb_vl,
    random_state=RANDOM_STATE,
    auto_class_weights="Balanced",
    verbose=False,
    allow_writing_files=False,
)
model_cb_vl.fit(X_train_vl_imp, y_train_vl)
y_pred_cb_vl = model_cb_vl.predict(X_test_vl_imp)
f1_cb_vl_test = f1_score(y_test_vl, y_pred_cb_vl, average="weighted")
print(f"F1 sur test set: {f1_cb_vl_test:.3f}")

In [None]:
# === COMPARAISON VEHICULE_LEGER ===
print("=" * 70)
print("COMPARAISON DES MODELES - VEHICULE_LEGER")
print("=" * 70)

models_eval_vl = []

# RF optimise
model = create_pipeline(
    RandomForestClassifier(**best_params_rf_vl, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced")
)
res, y_pred, y_proba = evaluate_model(model, X_train_vl, X_test_vl, y_train_vl, y_test_vl, "RF_HyperOpt")
models_eval_vl.append({"name": "RF_HyperOpt", "y_true": y_test_vl, "y_pred": y_pred, "y_proba": y_proba})

# XGBoost optimise
model = create_pipeline(
    XGBClassifier(
        **best_params_xgb_vl, random_state=RANDOM_STATE, n_jobs=nb_workers, scale_pos_weight=scale_pos_weight_vl
    )
)
res, y_pred, y_proba = evaluate_model(model, X_train_vl, X_test_vl, y_train_vl, y_test_vl, "XGBoost")
models_eval_vl.append({"name": "XGBoost", "y_true": y_test_vl, "y_pred": y_pred, "y_proba": y_proba})

# LightGBM optimise
model = create_pipeline(
    LGBMClassifier(
        **best_params_lgbm_vl, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced", verbose=-1
    )
)
res, y_pred, y_proba = evaluate_model(model, X_train_vl, X_test_vl, y_train_vl, y_test_vl, "LightGBM")
models_eval_vl.append({"name": "LightGBM", "y_true": y_test_vl, "y_pred": y_pred, "y_proba": y_proba})

# CatBoost optimise
model_cb = CatBoostClassifier(
    **best_params_cb_vl,
    random_state=RANDOM_STATE,
    auto_class_weights="Balanced",
    verbose=False,
    allow_writing_files=False,
)
res, y_pred, y_proba = evaluate_catboost(model_cb, X_train_vl, X_test_vl, y_train_vl, y_test_vl, "CatBoost")
models_eval_vl.append({"name": "CatBoost", "y_true": y_test_vl, "y_pred": y_pred, "y_proba": y_proba})

# Affichage
display_metrics(models_results=models_eval_vl, class_labels=["Non grave", "Grave"])

In [None]:
# === SELECTION MEILLEUR MODELE VEHICULE_LEGER ===
best_model_vl = select_best_model(models_eval_vl, metric="f1")
print(f"\nMeilleur modele VEHICULE_LEGER: {best_model_vl['name']}")
print(f"F1-Score: {best_model_vl['score']:.4f}")
print(f"Amelioration vs baseline (0.640): {(best_model_vl['score'] - 0.640) * 100:+.2f}%")

---

## 4. Fine-tuning VELO_EDP (velos, trottinettes)

### Caracteristiques de cette categorie

| Caracteristique | Valeur |
|-----------------|--------|
| Taille dataset | ~37k lignes |
| Taux de gravite | ~25% |
| F1 baseline (4a) | 0.557 |
| Objectif | Depasser 0.557 |

**Note**: Ce dataset a moins de colonnes car l'usager est souvent le conducteur.

In [None]:
# === PREPARATION VELO_EDP ===
cat = "velo_edp"
print(f"=== PREPARATION {cat.upper()} ===")

df_velo = datasets[cat]
X_velo, y_velo = prepare_data(df_velo)

X_train_velo, X_test_velo, y_train_velo, y_test_velo = train_test_split(
    X_velo, y_velo, test_size=0.2, random_state=RANDOM_STATE, stratify=y_velo
)

print(f"Features: {X_velo.columns.tolist()}")
print(f"\nTrain: {len(X_train_velo):,} | Test: {len(X_test_velo):,}")
print(f"Taux grave train: {y_train_velo.mean() * 100:.1f}%")
print(f"Taux grave test: {y_test_velo.mean() * 100:.1f}%")

# Sous-echantillonnage pour Hyperopt
sample_size_velo = min(20000, len(X_train_velo))
X_sample_velo = X_train_velo.sample(n=sample_size_velo, random_state=RANDOM_STATE)
y_sample_velo = y_train_velo.loc[X_sample_velo.index]
print(f"\nHyperopt sur {sample_size_velo:,} echantillons")

In [None]:
# === HYPEROPT RANDOMFOREST - VELO_EDP ===
print("=" * 60)
print("HYPEROPT RandomForest - VELO_EDP")
print("=" * 60)

best_params_rf_velo, best_f1_rf_velo, trials_rf_velo = optimize_boosting_model(
    X_sample_velo,
    y_sample_velo,
    model_type="randomforest",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

print("\nValidation sur test set complet...")
model_rf_velo = create_pipeline(
    RandomForestClassifier(**best_params_rf_velo, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced")
)
model_rf_velo.fit(X_train_velo, y_train_velo)
y_pred_rf_velo = model_rf_velo.predict(X_test_velo)
f1_rf_velo_test = f1_score(y_test_velo, y_pred_rf_velo, average="weighted")
print(f"F1 sur test set: {f1_rf_velo_test:.3f}")

In [None]:
# === HYPEROPT XGBOOST - VELO_EDP ===
print("=" * 60)
print("HYPEROPT XGBoost - VELO_EDP")
print("=" * 60)

best_params_xgb_velo, best_f1_xgb_velo, trials_xgb_velo = optimize_boosting_model(
    X_sample_velo,
    y_sample_velo,
    model_type="xgboost",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

print("\nValidation sur test set complet...")
imputer = SimpleImputer(strategy="median")
X_train_velo_imp = pd.DataFrame(imputer.fit_transform(X_train_velo), columns=X_train_velo.columns)
X_test_velo_imp = pd.DataFrame(imputer.transform(X_test_velo), columns=X_test_velo.columns)

n_neg = (y_train_velo == 0).sum()
n_pos = (y_train_velo == 1).sum()
scale_pos_weight_velo = n_neg / n_pos

model_xgb_velo = XGBClassifier(
    **best_params_xgb_velo, random_state=RANDOM_STATE, scale_pos_weight=scale_pos_weight_velo, verbosity=0
)
model_xgb_velo.fit(X_train_velo_imp, y_train_velo)
y_pred_xgb_velo = model_xgb_velo.predict(X_test_velo_imp)
f1_xgb_velo_test = f1_score(y_test_velo, y_pred_xgb_velo, average="weighted")
print(f"F1 sur test set: {f1_xgb_velo_test:.3f}")

In [None]:
# === HYPEROPT LIGHTGBM - VELO_EDP ===
print("=" * 60)
print("HYPEROPT LightGBM - VELO_EDP")
print("=" * 60)

best_params_lgbm_velo, best_f1_lgbm_velo, trials_lgbm_velo = optimize_boosting_model(
    X_sample_velo,
    y_sample_velo,
    model_type="lightgbm",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

print("\nValidation sur test set complet...")
model_lgbm_velo = create_pipeline(
    LGBMClassifier(**best_params_lgbm_velo, random_state=RANDOM_STATE, class_weight="balanced", verbose=-1)
)
model_lgbm_velo.fit(X_train_velo, y_train_velo)
y_pred_lgbm_velo = model_lgbm_velo.predict(X_test_velo)
f1_lgbm_velo_test = f1_score(y_test_velo, y_pred_lgbm_velo, average="weighted")
print(f"F1 sur test set: {f1_lgbm_velo_test:.3f}")

In [None]:
# === HYPEROPT CATBOOST - VELO_EDP ===
print("=" * 60)
print("HYPEROPT CatBoost - VELO_EDP")
print("=" * 60)

best_params_cb_velo, best_f1_cb_velo, trials_cb_velo = optimize_boosting_model(
    X_sample_velo,
    y_sample_velo,
    model_type="catboost",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

print("\nValidation sur test set complet...")
imputer = SimpleImputer(strategy="median")
X_train_velo_imp = pd.DataFrame(imputer.fit_transform(X_train_velo), columns=X_train_velo.columns)
X_test_velo_imp = pd.DataFrame(imputer.transform(X_test_velo), columns=X_test_velo.columns)

model_cb_velo = CatBoostClassifier(
    **best_params_cb_velo,
    random_state=RANDOM_STATE,
    auto_class_weights="Balanced",
    verbose=False,
    allow_writing_files=False,
)
model_cb_velo.fit(X_train_velo_imp, y_train_velo)
y_pred_cb_velo = model_cb_velo.predict(X_test_velo_imp)
f1_cb_velo_test = f1_score(y_test_velo, y_pred_cb_velo, average="weighted")
print(f"F1 sur test set: {f1_cb_velo_test:.3f}")

In [None]:
# === COMPARAISON VELO_EDP ===
print("=" * 70)
print("COMPARAISON DES MODELES - VELO_EDP")
print("=" * 70)

models_eval_velo = []

# RF optimise
model = create_pipeline(
    RandomForestClassifier(**best_params_rf_velo, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced")
)
res, y_pred, y_proba = evaluate_model(model, X_train_velo, X_test_velo, y_train_velo, y_test_velo, "RF_HyperOpt")
models_eval_velo.append({"name": "RF_HyperOpt", "y_true": y_test_velo, "y_pred": y_pred, "y_proba": y_proba})

# XGBoost optimise
model = create_pipeline(
    XGBClassifier(
        **best_params_xgb_velo, random_state=RANDOM_STATE, n_jobs=nb_workers, scale_pos_weight=scale_pos_weight_velo
    )
)
res, y_pred, y_proba = evaluate_model(model, X_train_velo, X_test_velo, y_train_velo, y_test_velo, "XGBoost")
models_eval_velo.append({"name": "XGBoost", "y_true": y_test_velo, "y_pred": y_pred, "y_proba": y_proba})

# LightGBM optimise
model = create_pipeline(
    LGBMClassifier(
        **best_params_lgbm_velo, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced", verbose=-1
    )
)
res, y_pred, y_proba = evaluate_model(model, X_train_velo, X_test_velo, y_train_velo, y_test_velo, "LightGBM")
models_eval_velo.append({"name": "LightGBM", "y_true": y_test_velo, "y_pred": y_pred, "y_proba": y_proba})

# CatBoost optimise
model_cb = CatBoostClassifier(
    **best_params_cb_velo,
    random_state=RANDOM_STATE,
    auto_class_weights="Balanced",
    verbose=False,
    allow_writing_files=False,
)
res, y_pred, y_proba = evaluate_catboost(model_cb, X_train_velo, X_test_velo, y_train_velo, y_test_velo, "CatBoost")
models_eval_velo.append({"name": "CatBoost", "y_true": y_test_velo, "y_pred": y_pred, "y_proba": y_proba})

# Affichage
display_metrics(models_results=models_eval_velo, class_labels=["Non grave", "Grave"])

In [None]:
# === SELECTION MEILLEUR MODELE VELO_EDP ===
best_model_velo = select_best_model(models_eval_velo, metric="f1")
print(f"\nMeilleur modele VELO_EDP: {best_model_velo['name']}")
print(f"F1-Score: {best_model_velo['score']:.4f}")
print(f"Amelioration vs baseline (0.557): {(best_model_velo['score'] - 0.557) * 100:+.2f}%")

---

## 5. Fine-tuning PIETON

### Caracteristiques de cette categorie

| Caracteristique | Valeur |
|-----------------|--------|
| Taille dataset | ~46k lignes |
| Taux de gravite | ~33% |
| F1 baseline (4a) | 0.596 |
| Objectif | Depasser 0.596 |

**Note**: Les pietons n'ont pas de vehicule, donc moins de features liees au vehicule.

In [None]:
# === PREPARATION PIETON ===
cat = "pieton"
print(f"=== PREPARATION {cat.upper()} ===")

df_pieton = datasets[cat]
X_pieton, y_pieton = prepare_data(df_pieton)

X_train_pieton, X_test_pieton, y_train_pieton, y_test_pieton = train_test_split(
    X_pieton, y_pieton, test_size=0.2, random_state=RANDOM_STATE, stratify=y_pieton
)

print(f"Features: {X_pieton.columns.tolist()}")
print(f"\nTrain: {len(X_train_pieton):,} | Test: {len(X_test_pieton):,}")
print(f"Taux grave train: {y_train_pieton.mean() * 100:.1f}%")
print(f"Taux grave test: {y_test_pieton.mean() * 100:.1f}%")

# Sous-echantillonnage pour Hyperopt
sample_size_pieton = min(25000, len(X_train_pieton))
X_sample_pieton = X_train_pieton.sample(n=sample_size_pieton, random_state=RANDOM_STATE)
y_sample_pieton = y_train_pieton.loc[X_sample_pieton.index]
print(f"\nHyperopt sur {sample_size_pieton:,} echantillons")

In [None]:
# === HYPEROPT RANDOMFOREST - PIETON ===
print("=" * 60)
print("HYPEROPT RandomForest - PIETON")
print("=" * 60)

best_params_rf_pieton, best_f1_rf_pieton, trials_rf_pieton = optimize_boosting_model(
    X_sample_pieton,
    y_sample_pieton,
    model_type="randomforest",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

print("\nValidation sur test set complet...")
model_rf_pieton = create_pipeline(
    RandomForestClassifier(
        **best_params_rf_pieton, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced"
    )
)
model_rf_pieton.fit(X_train_pieton, y_train_pieton)
y_pred_rf_pieton = model_rf_pieton.predict(X_test_pieton)
f1_rf_pieton_test = f1_score(y_test_pieton, y_pred_rf_pieton, average="weighted")
print(f"F1 sur test set: {f1_rf_pieton_test:.3f}")

In [None]:
# === HYPEROPT XGBOOST - PIETON ===
print("=" * 60)
print("HYPEROPT XGBoost - PIETON")
print("=" * 60)

best_params_xgb_pieton, best_f1_xgb_pieton, trials_xgb_pieton = optimize_boosting_model(
    X_sample_pieton,
    y_sample_pieton,
    model_type="xgboost",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

print("\nValidation sur test set complet...")
imputer = SimpleImputer(strategy="median")
X_train_pieton_imp = pd.DataFrame(imputer.fit_transform(X_train_pieton), columns=X_train_pieton.columns)
X_test_pieton_imp = pd.DataFrame(imputer.transform(X_test_pieton), columns=X_test_pieton.columns)

n_neg = (y_train_pieton == 0).sum()
n_pos = (y_train_pieton == 1).sum()
scale_pos_weight_pieton = n_neg / n_pos

model_xgb_pieton = XGBClassifier(
    **best_params_xgb_pieton, random_state=RANDOM_STATE, scale_pos_weight=scale_pos_weight_pieton, verbosity=0
)
model_xgb_pieton.fit(X_train_pieton_imp, y_train_pieton)
y_pred_xgb_pieton = model_xgb_pieton.predict(X_test_pieton_imp)
f1_xgb_pieton_test = f1_score(y_test_pieton, y_pred_xgb_pieton, average="weighted")
print(f"F1 sur test set: {f1_xgb_pieton_test:.3f}")

In [None]:
# === HYPEROPT LIGHTGBM - PIETON ===
print("=" * 60)
print("HYPEROPT LightGBM - PIETON")
print("=" * 60)

best_params_lgbm_pieton, best_f1_lgbm_pieton, trials_lgbm_pieton = optimize_boosting_model(
    X_sample_pieton,
    y_sample_pieton,
    model_type="lightgbm",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

print("\nValidation sur test set complet...")
model_lgbm_pieton = create_pipeline(
    LGBMClassifier(**best_params_lgbm_pieton, random_state=RANDOM_STATE, class_weight="balanced", verbose=-1)
)
model_lgbm_pieton.fit(X_train_pieton, y_train_pieton)
y_pred_lgbm_pieton = model_lgbm_pieton.predict(X_test_pieton)
f1_lgbm_pieton_test = f1_score(y_test_pieton, y_pred_lgbm_pieton, average="weighted")
print(f"F1 sur test set: {f1_lgbm_pieton_test:.3f}")

In [None]:
# === HYPEROPT CATBOOST - PIETON ===
print("=" * 60)
print("HYPEROPT CatBoost - PIETON")
print("=" * 60)

best_params_cb_pieton, best_f1_cb_pieton, trials_cb_pieton = optimize_boosting_model(
    X_sample_pieton,
    y_sample_pieton,
    model_type="catboost",
    max_evals=max_evals,
    cv=3,
    scoring="f1",
    random_state=RANDOM_STATE,
    n_jobs=nb_workers,
)

print("\nValidation sur test set complet...")
imputer = SimpleImputer(strategy="median")
X_train_pieton_imp = pd.DataFrame(imputer.fit_transform(X_train_pieton), columns=X_train_pieton.columns)
X_test_pieton_imp = pd.DataFrame(imputer.transform(X_test_pieton), columns=X_test_pieton.columns)

model_cb_pieton = CatBoostClassifier(
    **best_params_cb_pieton,
    random_state=RANDOM_STATE,
    auto_class_weights="Balanced",
    verbose=False,
    allow_writing_files=False,
)
model_cb_pieton.fit(X_train_pieton_imp, y_train_pieton)
y_pred_cb_pieton = model_cb_pieton.predict(X_test_pieton_imp)
f1_cb_pieton_test = f1_score(y_test_pieton, y_pred_cb_pieton, average="weighted")
print(f"F1 sur test set: {f1_cb_pieton_test:.3f}")

In [None]:
# === COMPARAISON PIETON ===
print("=" * 70)
print("COMPARAISON DES MODELES - PIETON")
print("=" * 70)

models_eval_pieton = []

# RF optimise
model = create_pipeline(
    RandomForestClassifier(
        **best_params_rf_pieton, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced"
    )
)
res, y_pred, y_proba = evaluate_model(
    model, X_train_pieton, X_test_pieton, y_train_pieton, y_test_pieton, "RF_HyperOpt"
)
models_eval_pieton.append({"name": "RF_HyperOpt", "y_true": y_test_pieton, "y_pred": y_pred, "y_proba": y_proba})

# XGBoost optimise
model = create_pipeline(
    XGBClassifier(
        **best_params_xgb_pieton, random_state=RANDOM_STATE, n_jobs=nb_workers, scale_pos_weight=scale_pos_weight_pieton
    )
)
res, y_pred, y_proba = evaluate_model(model, X_train_pieton, X_test_pieton, y_train_pieton, y_test_pieton, "XGBoost")
models_eval_pieton.append({"name": "XGBoost", "y_true": y_test_pieton, "y_pred": y_pred, "y_proba": y_proba})

# LightGBM optimise
model = create_pipeline(
    LGBMClassifier(
        **best_params_lgbm_pieton, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced", verbose=-1
    )
)
res, y_pred, y_proba = evaluate_model(model, X_train_pieton, X_test_pieton, y_train_pieton, y_test_pieton, "LightGBM")
models_eval_pieton.append({"name": "LightGBM", "y_true": y_test_pieton, "y_pred": y_pred, "y_proba": y_proba})

# CatBoost optimise
model_cb = CatBoostClassifier(
    **best_params_cb_pieton,
    random_state=RANDOM_STATE,
    auto_class_weights="Balanced",
    verbose=False,
    allow_writing_files=False,
)
res, y_pred, y_proba = evaluate_catboost(
    model_cb, X_train_pieton, X_test_pieton, y_train_pieton, y_test_pieton, "CatBoost"
)
models_eval_pieton.append({"name": "CatBoost", "y_true": y_test_pieton, "y_pred": y_pred, "y_proba": y_proba})

# Affichage
display_metrics(models_results=models_eval_pieton, class_labels=["Non grave", "Grave"])

In [None]:
# === SELECTION MEILLEUR MODELE PIETON ===
best_model_pieton = select_best_model(models_eval_pieton, metric="f1")
print(f"\nMeilleur modele PIETON: {best_model_pieton['name']}")
print(f"F1-Score: {best_model_pieton['score']:.4f}")
print(f"Amelioration vs baseline (0.596): {(best_model_pieton['score'] - 0.596) * 100:+.2f}%")

---

## 6. Resume et sauvegarde des modeles optimises

In [None]:
# === RESUME GLOBAL ===
print("=" * 70)
print("RESUME DES RESULTATS FINE-TUNING")
print("=" * 70)

print(f"\n{'Categorie':<20} {'Baseline F1':>12} {'Best Model':>15} {'Best F1':>10} {'Gain':>10}")
print("-" * 70)

results_summary = [
    ("vehicule_leger", 0.640, best_model_vl["name"], best_model_vl["score"]),
    ("velo_edp", 0.557, best_model_velo["name"], best_model_velo["score"]),
    ("pieton", 0.596, best_model_pieton["name"], best_model_pieton["score"]),
]

for cat, baseline, model_name, score in results_summary:
    gain = (score - baseline) * 100
    print(f"{cat:<20} {baseline:>12.3f} {model_name:>15} {score:>10.3f} {gain:>+9.2f}%")

In [None]:
# === SAUVEGARDE MODELE VEHICULE_LEGER ===
print("=" * 70)
print("SAUVEGARDE MODELE VEHICULE_LEGER")
print("=" * 70)

X_vl_full, y_vl_full = prepare_data(datasets["vehicule_leger"])
pos_weight_vl = (y_vl_full == 0).sum() / (y_vl_full == 1).sum()

model_configs_vl = {
    "RF_HyperOpt": lambda: create_pipeline(
        RandomForestClassifier(
            **best_params_rf_vl, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced"
        )
    ),
    "XGBoost": lambda: create_pipeline(
        XGBClassifier(
            **best_params_xgb_vl, random_state=RANDOM_STATE, n_jobs=nb_workers, scale_pos_weight=pos_weight_vl
        )
    ),
    "LightGBM": lambda: create_pipeline(
        LGBMClassifier(
            **best_params_lgbm_vl, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced", verbose=-1
        )
    ),
    "CatBoost": lambda: CatBoostClassifier(
        **best_params_cb_vl,
        random_state=RANDOM_STATE,
        auto_class_weights="Balanced",
        verbose=False,
        allow_writing_files=False,
    ),
}

result_vl = save_best_model(
    best_model_name=best_model_vl["name"],
    model_configs=model_configs_vl,
    X_full=X_vl_full,
    y_full=y_vl_full,
    X_test=X_test_vl,
    y_test=y_test_vl,
    save_path="models/model_passager_vehicule_leger_optimized.joblib",
)

In [None]:
# === SAUVEGARDE MODELE VELO_EDP ===
print("=" * 70)
print("SAUVEGARDE MODELE VELO_EDP")
print("=" * 70)

X_velo_full, y_velo_full = prepare_data(datasets["velo_edp"])
pos_weight_velo = (y_velo_full == 0).sum() / (y_velo_full == 1).sum()

model_configs_velo = {
    "RF_HyperOpt": lambda: create_pipeline(
        RandomForestClassifier(
            **best_params_rf_velo, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced"
        )
    ),
    "XGBoost": lambda: create_pipeline(
        XGBClassifier(
            **best_params_xgb_velo, random_state=RANDOM_STATE, n_jobs=nb_workers, scale_pos_weight=pos_weight_velo
        )
    ),
    "LightGBM": lambda: create_pipeline(
        LGBMClassifier(
            **best_params_lgbm_velo, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced", verbose=-1
        )
    ),
    "CatBoost": lambda: CatBoostClassifier(
        **best_params_cb_velo,
        random_state=RANDOM_STATE,
        auto_class_weights="Balanced",
        verbose=False,
        allow_writing_files=False,
    ),
}

result_velo = save_best_model(
    best_model_name=best_model_velo["name"],
    model_configs=model_configs_velo,
    X_full=X_velo_full,
    y_full=y_velo_full,
    X_test=X_test_velo,
    y_test=y_test_velo,
    save_path="models/model_passager_velo_edp_optimized.joblib",
)

In [None]:
# === SAUVEGARDE MODELE PIETON ===
print("=" * 70)
print("SAUVEGARDE MODELE PIETON")
print("=" * 70)

X_pieton_full, y_pieton_full = prepare_data(datasets["pieton"])
pos_weight_pieton = (y_pieton_full == 0).sum() / (y_pieton_full == 1).sum()

model_configs_pieton = {
    "RF_HyperOpt": lambda: create_pipeline(
        RandomForestClassifier(
            **best_params_rf_pieton, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced"
        )
    ),
    "XGBoost": lambda: create_pipeline(
        XGBClassifier(
            **best_params_xgb_pieton, random_state=RANDOM_STATE, n_jobs=nb_workers, scale_pos_weight=pos_weight_pieton
        )
    ),
    "LightGBM": lambda: create_pipeline(
        LGBMClassifier(
            **best_params_lgbm_pieton, random_state=RANDOM_STATE, n_jobs=nb_workers, class_weight="balanced", verbose=-1
        )
    ),
    "CatBoost": lambda: CatBoostClassifier(
        **best_params_cb_pieton,
        random_state=RANDOM_STATE,
        auto_class_weights="Balanced",
        verbose=False,
        allow_writing_files=False,
    ),
}

result_pieton = save_best_model(
    best_model_name=best_model_pieton["name"],
    model_configs=model_configs_pieton,
    X_full=X_pieton_full,
    y_full=y_pieton_full,
    X_test=X_test_pieton,
    y_test=y_test_pieton,
    save_path="models/model_passager_pieton_optimized.joblib",
)

In [None]:
# === SAUVEGARDE DES HYPERPARAMETRES OPTIMAUX ===
hyperparams_all = {
    "vehicule_leger": {
        "best_model": best_model_vl["name"],
        "best_f1": best_model_vl["score"],
        "params_rf": best_params_rf_vl,
        "params_xgb": best_params_xgb_vl,
        "params_lgbm": best_params_lgbm_vl,
        "params_cb": best_params_cb_vl,
    },
    "velo_edp": {
        "best_model": best_model_velo["name"],
        "best_f1": best_model_velo["score"],
        "params_rf": best_params_rf_velo,
        "params_xgb": best_params_xgb_velo,
        "params_lgbm": best_params_lgbm_velo,
        "params_cb": best_params_cb_velo,
    },
    "pieton": {
        "best_model": best_model_pieton["name"],
        "best_f1": best_model_pieton["score"],
        "params_rf": best_params_rf_pieton,
        "params_xgb": best_params_xgb_pieton,
        "params_lgbm": best_params_lgbm_pieton,
        "params_cb": best_params_cb_pieton,
    },
}

joblib.dump(hyperparams_all, "models/hyperparams_passagers_finetuned.joblib")
print("Hyperparametres sauvegardes: models/hyperparams_passagers_finetuned.joblib")

---

## Resume et conclusions

### Modeles sauvegardes

| Fichier | Categorie | Usage |
|---------|-----------|-------|
| `model_passager_vehicule_leger_optimized.joblib` | 2RM, quads | **Production** |
| `model_passager_velo_edp_optimized.joblib` | Velos, trottinettes | **Production** |
| `model_passager_pieton_optimized.joblib` | Pietons | **Production** |
| `model_passager_global.joblib` (notebook 4a) | Voitures, poids lourds | **Production** |

### Architecture de prediction recommandee

```python
def predict_gravite_passager(features, categorie_vehicule):
    if categorie_vehicule in ['vehicule_leger', 'velo_edp', 'pieton']:
        # Utiliser le modele specialise
        model = load(f'model_passager_{categorie_vehicule}_optimized.joblib')
    else:
        # Voiture ou poids_lourd -> modele global
        model = load('model_passager_global.joblib')
    return model.predict(features)
```

### Prochaines etapes

1. **Deploiement API** : Implementer la logique de routing ci-dessus
2. **Monitoring** : Suivre les performances par categorie en production
3. **Reentrainement** : Planifier un reentrainement periodique avec nouvelles donnees