# CY Tech 2026 — Prediction de la sinistralité (baseline robuste)
Objectif: prédire une **prime** (coût attendu) via une approche **2-parties** :
- **Fréquence** : $p(x)=P(	ext{sinistre}>0 \mid x)$
- **Gravité** : $m(x)=E[	ext{montant}\mid 	ext{sinistre}>0, x]$

Prime prédite : $\hat{prime}(x)=\hat{p}(x)\times\hat{m}(x)$

Ce notebook inclut :
- EDA rapide
- split **anti-fuite** (GroupKFold sur `id_client`)
- split **type production** (Group + temps basé sur `index` si pertinent)
- baseline CatBoost + hooks pour calibration & smearing

In [1]:
import numpy as np
import pandas as pd

TRAIN_PATH = "train.csv"
TEST_PATH  = "test.csv"
SUB_PATH   = "prime_pred_sandbox.csv"

train = pd.read_csv(TRAIN_PATH)
test  = pd.read_csv(TEST_PATH)
sub_ex = pd.read_csv(SUB_PATH)

print("Train:", train.shape)
print("Test :", test.shape)
print("Sample submission cols:", list(sub_ex.columns))
train.head()

Train: (50000, 33)
Test : (50000, 28)
Sample submission cols: ['index', 'pred']


Unnamed: 0,index,id_client,id_vehicule,id_contrat,bonus,type_contrat,duree_contrat,anciennete_info,freq_paiement,paiement,...,marque_vehicule,modele_vehicule,debut_vente_vehicule,fin_vente_vehicule,vitesse_vehicule,type_vehicule,prix_vehicule,poids_vehicule,nombre_sinistres,montant_sinistre
0,0,A00000001,V01,A00000001-V01,0.5,Maxi,29,9,Biannual,No,...,PEUGEOT,306,10,9,182,Tourism,20700,1210,0,0.0
1,1,A00000002,V01,A00000002-V01,0.5,Maxi,3,1,Biannual,No,...,MERCEDES BENZ,C220,4,2,229,Tourism,34250,1510,0,0.0
2,2,A00000003,V01,A00000003-V01,0.5,Maxi,2,2,Yearly,No,...,BMW,Z3,12,11,210,Tourism,28661,1270,0,0.0
3,3,A00000004,V01,A00000004-V01,0.5,Median2,22,1,Yearly,No,...,VOLKSWAGEN,GOLF,18,15,180,Tourism,14407,1020,0,0.0
4,4,A00000005,V01,A00000005-V01,0.5,Maxi,16,4,Biannual,No,...,RENAULT,LAGUNA,13,11,195,Tourism,16770,1230,0,0.0


In [2]:

# Colonnes
target_amount = "montant_sinistre"
target_freq   = "nombre_sinistres"  # ici c'est 0/1 (vérifié)
group_col     = "id_client"
id_col        = "index"

# Features communes train/test (on exclut l'id)
common_cols = [c for c in train.columns if c in test.columns]
feature_cols = [c for c in common_cols if c != id_col]

# Categorical
cat_cols = [c for c in feature_cols if train[c].dtype == "object"]

print("n_features:", len(feature_cols))
print("n_cat:", len(cat_cols))

n_features: 27
n_cat: 12


In [3]:

# Cibles
y_freq = (train[target_amount] > 0).astype(int)
y_cost = train[target_amount].astype(float)

print("Claim rate:", y_freq.mean())
print("Cost mean:", y_cost.mean(), "| max:", y_cost.max())

Claim rate: 0.05834
Cost mean: 103.369344 | max: 21826.96


In [4]:

# Missingness utile (on s'attend à beaucoup de NA sur sex_conducteur2)
miss = train[feature_cols].isna().mean().sort_values(ascending=False)
miss.head(20)

sex_conducteur2        0.66812
anciennete_vehicule    0.00002
bonus                  0.00000
anciennete_info        0.00000
freq_paiement          0.00000
paiement               0.00000
utilisation            0.00000
code_postal            0.00000
conducteur2            0.00000
type_contrat           0.00000
duree_contrat          0.00000
age_conducteur2        0.00000
age_conducteur1        0.00000
anciennete_permis1     0.00000
sex_conducteur1        0.00000
anciennete_permis2     0.00000
cylindre_vehicule      0.00000
din_vehicule           0.00000
essence_vehicule       0.00000
marque_vehicule        0.00000
dtype: float64

## Prétraitements "métier" simples
- Si `conducteur2 == "No"` : `age_conducteur2` et `anciennete_permis2` à NA (au lieu de 0)
- Valeurs 0 "techniques" : `poids_vehicule==0` et `cylindre_vehicule==0` -> NA
- CatBoost n'accepte pas NaN dans les variables catégorielles : on remplace par "NA"

In [5]:

def preprocess_for_catboost(df: pd.DataFrame, feature_cols: list[str], cat_cols: list[str]) -> pd.DataFrame:
    df = df.copy()

    if "conducteur2" in df.columns:
        mask_no = df["conducteur2"].astype(str).str.lower().eq("no")
        for col in ["age_conducteur2", "anciennete_permis2"]:
            if col in df.columns:
                df.loc[mask_no, col] = np.nan

    for col in ["poids_vehicule", "cylindre_vehicule"]:
        if col in df.columns:
            df.loc[df[col] == 0, col] = np.nan

    # Cat -> str + NA token
    for col in cat_cols:
        df[col] = df[col].where(df[col].notna(), "NA").astype(str)

    return df[feature_cols]

X_train = preprocess_for_catboost(train, feature_cols, cat_cols)
X_test  = preprocess_for_catboost(test,  feature_cols, cat_cols)

## Split 1 — Anti-fuite (GroupKFold sur `id_client`)
Recommandé si des clients apparaissent plusieurs fois (c'est le cas ici).

In [6]:

from sklearn.model_selection import GroupKFold

def group_kfold_splits(train_df: pd.DataFrame, n_splits=5):
    gkf = GroupKFold(n_splits=n_splits)
    groups = train_df[group_col].values
    for fold, (tr_idx, va_idx) in enumerate(gkf.split(train_df, y_freq, groups)):
        yield fold, tr_idx, va_idx

## Split 2 — "production" (Group + temps)
Si `index` reflète un ordre temporel (train=passé, test=futur), ce split est souvent plus proche du Kaggle private.

Principe:
- on associe à chaque `id_client` un **temps** (ex: max index du client)
- on trie les clients par ce temps
- on fait un split *forward-chaining* (train sur le passé, val sur une fenêtre du futur)

In [7]:

def group_time_splits(train_df: pd.DataFrame, n_splits=5):
    # temps du groupe = max index (modifiable: mean, last, etc.)
    grp_time = train_df.groupby(group_col)[id_col].max().sort_values()
    groups_sorted = grp_time.index.to_numpy()

    fold_sizes = np.full(n_splits, len(groups_sorted)//n_splits, dtype=int)
    fold_sizes[:len(groups_sorted)%n_splits] += 1

    current = 0
    folds = []
    for fs in fold_sizes:
        folds.append(groups_sorted[current:current+fs])
        current += fs

    # forward chaining
    for fold in range(1, n_splits):
        val_groups = folds[fold]
        train_groups = np.concatenate(folds[:fold])
        tr_idx = train_df.index[train_df[group_col].isin(train_groups)].to_numpy()
        va_idx = train_df.index[train_df[group_col].isin(val_groups)].to_numpy()
        yield fold, tr_idx, va_idx

# Exemple
next(group_time_splits(train, n_splits=5))

(1,
 array([    0,     1,     2, ..., 10007, 10008, 10009], shape=(10010,)),
 array([10010, 10011, 10012, ..., 20004, 20005, 20006], shape=(9997,)))

## Modèles — baseline CatBoost (freq + sev(log))
Hooks:
- calibration proba (isotonic / platt) sur les prédictions OOF
- smearing (Duan) pour corriger le biais du retour log

In [8]:

from catboost import CatBoostClassifier, CatBoostRegressor, Pool
from sklearn.metrics import mean_squared_error

def rmse(y_true, y_pred):
    return float(np.sqrt(mean_squared_error(y_true, y_pred)))

def fit_predict_two_part(
    X_train, y_freq, y_cost, cat_cols,
    tr_idx, va_idx,
    seed=42,
    freq_params=None,
    sev_params=None,
    use_smearing=True
):
    freq_params = freq_params or dict(
        loss_function="Logloss",
        iterations=3000,
        learning_rate=0.03,
        depth=7,
        l2_leaf_reg=6,
        random_seed=seed,
        verbose=False,
        od_type="Iter",
        od_wait=200,
    )
    sev_params = sev_params or dict(
        loss_function="RMSE",
        iterations=6000,
        learning_rate=0.03,
        depth=8,
        l2_leaf_reg=8,
        random_seed=seed,
        verbose=False,
        od_type="Iter",
        od_wait=300,
    )

    X_tr, X_va = X_train.iloc[tr_idx], X_train.iloc[va_idx]
    y_trf, y_vaf = y_freq.iloc[tr_idx], y_freq.iloc[va_idx]

    # 1) Fréquence
    clf = CatBoostClassifier(**freq_params)
    clf.fit(Pool(X_tr, y_trf, cat_features=cat_cols),
            eval_set=Pool(X_va, y_vaf, cat_features=cat_cols),
            use_best_model=True)
    p_va = clf.predict_proba(Pool(X_va, cat_features=cat_cols))[:, 1]

    # 2) Gravité (sur sinistrés du train fold)
    pos_tr = y_trf.values == 1
    X_tr_pos = X_tr.loc[pos_tr]
    y_tr_sev = np.log1p(y_cost.iloc[tr_idx].values[pos_tr])

    reg = CatBoostRegressor(**sev_params)
    reg.fit(Pool(X_tr_pos, y_tr_sev, cat_features=cat_cols),
            eval_set=Pool(X_va, np.zeros(len(X_va)), cat_features=cat_cols),
            use_best_model=True)

    z_va = reg.predict(Pool(X_va, cat_features=cat_cols))
    m_va = np.expm1(z_va)

    # Smearing: corrige le biais exp()
    if use_smearing:
        z_tr_pos = reg.predict(Pool(X_tr_pos, cat_features=cat_cols))
        resid = y_tr_sev - z_tr_pos
        smear = float(np.mean(np.exp(resid)))
        m_va = smear * np.exp(z_va) - 1.0
        m_va = np.maximum(m_va, 0.0)

    prime_va = p_va * m_va
    return prime_va, clf, reg

In [9]:

# Choisis un splitter:
# SPLITTER = group_kfold_splits
SPLITTER = group_time_splits

oof = np.zeros(len(train))
fold_scores = []

for fold, tr_idx, va_idx in SPLITTER(train, n_splits=5):
    prime_va, _, _ = fit_predict_two_part(
        X_train, y_freq, y_cost, cat_cols,
        tr_idx, va_idx,
        seed=42+fold,
        use_smearing=True
    )
    oof[va_idx] = prime_va
    score = rmse(y_cost.iloc[va_idx], prime_va)
    fold_scores.append(score)
    print(f"Fold {fold} RMSE: {score:.4f}")

print("\nOOF RMSE:", rmse(y_cost, oof))
print("Fold mean ± std:", float(np.mean(fold_scores)), float(np.std(fold_scores)))
print("Baseline RMSE si pred=0:", rmse(y_cost, np.zeros_like(oof)))

Fold 1 RMSE: 492.7022
Fold 2 RMSE: 610.8778
Fold 3 RMSE: 546.0255
Fold 4 RMSE: 514.5740

OOF RMSE: 544.9952769107592
Fold mean ± std: 541.0449020356975 44.55101173346713
Baseline RMSE si pred=0: 555.106812072628


## Calibration de la fréquence (optionnel mais souvent rentable)
À appliquer en **cross-fit** :
- récupérer les proba OOF du modèle fréquence
- fitter un calibrateur (isotonic ou logistic)
- recalibrer les proba test

⚠️ Ne pas fitter le calibrateur sur les mêmes points que ceux qu’on évalue (sinon fuite).

In [10]:

from sklearn.isotonic import IsotonicRegression

# Exemple: calibration sur OOF (à adapter si tu stockes aussi les oof proba)
# Ici on illustre la mécanique si tu as `oof_proba`.
oof_proba = 544.9952769107592
iso = IsotonicRegression(out_of_bounds="clip")
iso.fit(oof_proba, y_freq)
p_calib = iso.transform(oof_proba)

TypeError: Input should have at least 1 dimension i.e. satisfy `len(x.shape) > 0`, got scalar `array(544.99527691)` instead.

## Entraînement final + submission
On refit sur *tout* le train, puis on prédit sur test, et on écrit un CSV au format de `prime_pred_sandbox.csv`.

In [None]:

# Refit final (à partir des paramètres qui marchent bien en CV)
freq_params = dict(
    loss_function="Logloss",
    iterations=4000,
    learning_rate=0.03,
    depth=7,
    l2_leaf_reg=6,
    random_seed=42,
    verbose=200,
    od_type="Iter",
    od_wait=300,
)

sev_params = dict(
    loss_function="RMSE",
    iterations=6000,
    learning_rate=0.03,
    depth=8,
    l2_leaf_reg=8,
    random_seed=42,
    verbose=200,
    od_type="Iter",
    od_wait=400,
)

# 1) fréquence
clf = CatBoostClassifier(**freq_params)
clf.fit(Pool(X_train, y_freq, cat_features=cat_cols))

p_test = clf.predict_proba(Pool(X_test, cat_features=cat_cols))[:, 1]

# 2) gravité
pos = y_freq.values == 1
y_sev = np.log1p(y_cost.values[pos])

reg = CatBoostRegressor(**sev_params)
reg.fit(Pool(X_train.loc[pos], y_sev, cat_features=cat_cols))

z_test = reg.predict(Pool(X_test, cat_features=cat_cols))

# smearing sur tout le train pos
z_tr = reg.predict(Pool(X_train.loc[pos], cat_features=cat_cols))
smear = float(np.mean(np.exp(y_sev - z_tr)))

m_test = smear * np.exp(z_test) - 1.0
m_test = np.maximum(m_test, 0.0)

prime_test = p_test * m_test

submission = pd.DataFrame({
    "index": test[id_col].astype(int),
    "pred": prime_test.astype(float)
})

submission.to_csv("submission.csv", index=False)
submission.head()

0:	learn: 0.6566592	total: 55.3ms	remaining: 3m 41s
200:	learn: 0.2104486	total: 16.3s	remaining: 5m 8s
400:	learn: 0.2055354	total: 34.3s	remaining: 5m 7s
600:	learn: 0.2013752	total: 52.5s	remaining: 4m 57s
800:	learn: 0.1960677	total: 1m 10s	remaining: 4m 43s
1000:	learn: 0.1912715	total: 1m 29s	remaining: 4m 28s
1200:	learn: 0.1871842	total: 1m 48s	remaining: 4m 12s
1400:	learn: 0.1831123	total: 2m 7s	remaining: 3m 56s
1600:	learn: 0.1794798	total: 2m 26s	remaining: 3m 39s
1800:	learn: 0.1755710	total: 2m 45s	remaining: 3m 21s
2000:	learn: 0.1721246	total: 3m 3s	remaining: 3m 3s
2200:	learn: 0.1685516	total: 3m 22s	remaining: 2m 45s
2400:	learn: 0.1651896	total: 3m 41s	remaining: 2m 27s
2600:	learn: 0.1618124	total: 4m	remaining: 2m 9s
2800:	learn: 0.1583717	total: 4m 18s	remaining: 1m 50s
3000:	learn: 0.1546821	total: 4m 37s	remaining: 1m 32s
3200:	learn: 0.1511783	total: 4m 56s	remaining: 1m 14s
3400:	learn: 0.1481290	total: 5m 15s	remaining: 55.6s
3600:	learn: 0.1448397	total: 5

Unnamed: 0,index,pred
0,50000,46.621052
1,50001,36.674278
2,50002,212.121064
3,50003,115.535895
4,50004,26.412856


: 