# Titanic - 0.80861 avec le Group Survival Trick

**Approche :** Les passagers du Titanic ne voyageaient pas seuls. En exploitant les liens entre passagers (meme billet, meme famille), on peut propager l'information de survie connue du train vers le test.

**Pipeline :**
1. Feature Engineering classique (Title, FamilySize, etc.)
2. Group Survival Trick (ticket groups + family groups)
3. Ensemble de 6 modeles conservateurs (anti-overfitting sur 891 lignes)
4. Simple Average (pas d'optimisation de poids)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostClassifier
import warnings
warnings.filterwarnings('ignore')

SEED = 42
N_FOLDS = 10
np.random.seed(SEED)

train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')
print(f'Train: {train.shape}, Test: {test.shape}')

## EDA rapide

Les deux facteurs dominants sont le sexe et la classe. Leur interaction montre ou se trouvent les cas faciles et les cas difficiles.

In [None]:
pivot = train.pivot_table(values='Survived', index='Sex', columns='Pclass', aggfunc='mean')

fig, ax = plt.subplots(figsize=(7, 3))
sns.heatmap(pivot, annot=True, fmt='.1%', cmap='RdYlGn', ax=ax,
            xticklabels=['1ere', '2eme', '3eme'])
ax.set_title('Taux de survie : Sexe x Classe')
plt.tight_layout()
plt.show()

# Les cas "faciles" : femmes 1ere/2eme (~95%), hommes 2eme/3eme (~14%)
# Les cas "difficiles" : femmes 3eme (50%), hommes 1ere (37%)
# -> C'est sur ces cas que le Group Survival Trick fait la difference

## Feature Engineering

Toutes les imputations et statistiques sont calculees sur le train uniquement pour eviter le data leakage.

In [None]:
# Title extraction
for df in [train, test]:
    df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
    title_map = {
        'Mr': 'Mr', 'Miss': 'Miss', 'Mrs': 'Mrs', 'Master': 'Master',
        'Dr': 'Rare', 'Rev': 'Rare', 'Col': 'Rare', 'Major': 'Rare',
        'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs', 'Countess': 'Rare',
        'Lady': 'Rare', 'Sir': 'Rare', 'Don': 'Rare', 'Dona': 'Rare',
        'Jonkheer': 'Rare', 'Capt': 'Rare'
    }
    df['Title'] = df['Title'].map(title_map).fillna('Rare')

# Imputation Age par mediane du titre (train only)
age_by_title = train.groupby('Title')['Age'].median()
for df in [train, test]:
    for title in df['Title'].unique():
        mask = (df['Age'].isnull()) & (df['Title'] == title)
        df.loc[mask, 'Age'] = age_by_title.get(title, train['Age'].median())

# Imputation Embarked et Fare (train only)
embarked_mode = train['Embarked'].mode()[0]
fare_by_class = train.groupby('Pclass')['Fare'].median()
for df in [train, test]:
    df['Embarked'] = df['Embarked'].fillna(embarked_mode)
    for pclass in [1, 2, 3]:
        mask = (df['Fare'].isnull()) & (df['Pclass'] == pclass)
        df.loc[mask, 'Fare'] = fare_by_class[pclass]

# Features de base
for df in [train, test]:
    df['Sex_enc'] = (df['Sex'] == 'male').astype(int)
    df['Title_enc'] = df['Title'].map({'Master': 0, 'Miss': 1, 'Mr': 2, 'Mrs': 3, 'Rare': 4}).fillna(4).astype(int)
    df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
    df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
    df['HasCabin'] = df['Cabin'].notna().astype(int)
    df['IsChild'] = (df['Age'] <= 12).astype(int)
    df['Age_Pclass'] = df['Age'] * df['Pclass']
    df['LogFare'] = np.log1p(df['Fare'])
    df['Embarked_enc'] = df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2}).fillna(2).astype(int)
    df['Surname'] = df['Name'].str.split(',').str[0]

print('14 features de base OK')

## Group Survival Trick

L'idee : les familles et groupes (meme billet) tendent a survivre ou mourir ensemble. Si je connais la survie des autres membres du groupe dans le train, ca me donne un signal fort pour predire ceux du test.

Trois niveaux de signal :
1. **Ticket Group** : passagers partageant le meme numero de billet
2. **Family Group** : meme nom de famille + meme taille de famille
3. **Sex-specific** : taux de survie des femmes/enfants vs hommes dans chaque groupe de billet

Pour le train, on utilise le **leave-one-out** (taux du groupe sans le passager lui-meme) pour eviter de tricher.

In [None]:
# --- Ticket Group Survival Rate ---
# Leave-one-out pour le train
train_ticket_sum = train.groupby('Ticket')['Survived'].transform('sum')
train_ticket_cnt = train.groupby('Ticket')['Survived'].transform('count')
train['TicketGroupSurvRate'] = np.where(
    train_ticket_cnt > 1,
    (train_ticket_sum - train['Survived']) / (train_ticket_cnt - 1),
    -1
)

# Pour le test : moyenne du groupe dans le train
ticket_surv_map = train.groupby('Ticket')['Survived'].mean()
test['TicketGroupSurvRate'] = test['Ticket'].map(ticket_surv_map).fillna(-1)

# Taille du groupe de billet (info structurelle)
combined_ticket_size = pd.concat([train['Ticket'], test['Ticket']]).value_counts()
for df in [train, test]:
    df['TicketGroupSize'] = df['Ticket'].map(combined_ticket_size)

# --- Family Group Survival Rate ---
for df in [train, test]:
    df['FamilyGroup'] = df['Surname'] + '_' + df['FamilySize'].astype(str)

family_surv_map = train.groupby('FamilyGroup')['Survived'].mean()
train_fg_sum = train.groupby('FamilyGroup')['Survived'].transform('sum')
train_fg_cnt = train.groupby('FamilyGroup')['Survived'].transform('count')
train['FamilyGroupSurvRate'] = np.where(
    train_fg_cnt > 1,
    (train_fg_sum - train['Survived']) / (train_fg_cnt - 1),
    -1
)
test['FamilyGroupSurvRate'] = test['FamilyGroup'].map(family_surv_map).fillna(-1)

print(f'Ticket group info: train {(train["TicketGroupSurvRate"]!=-1).sum()}/{len(train)}, '
      f'test {(test["TicketGroupSurvRate"]!=-1).sum()}/{len(test)}')
print(f'Family group info: train {(train["FamilyGroupSurvRate"]!=-1).sum()}/{len(train)}, '
      f'test {(test["FamilyGroupSurvRate"]!=-1).sum()}/{len(test)}')

In [None]:
# --- Sex-specific group survival ---
# Les femmes/enfants d'un groupe survivent ensemble,
# les hommes d'un groupe meurent ensemble.

for df in [train, test]:
    df['IsWomanChild'] = ((df['Sex'] == 'female') | (df['Age'] <= 12)).astype(int)

wc_train = train[train['IsWomanChild'] == 1]
men_train = train[train['IsWomanChild'] == 0]

wc_ticket_surv = wc_train.groupby('Ticket')['Survived'].mean()
men_ticket_surv = men_train.groupby('Ticket')['Survived'].mean()

for df in [train, test]:
    df['TicketWCSurvRate'] = df['Ticket'].map(wc_ticket_surv).fillna(-1)
    df['TicketMenSurvRate'] = df['Ticket'].map(men_ticket_surv).fillna(-1)

# Leave-one-out pour le train
for idx in train.index:
    ticket = train.loc[idx, 'Ticket']
    is_wc = train.loc[idx, 'IsWomanChild']
    if is_wc == 1:
        group = wc_train[(wc_train['Ticket'] == ticket) & (wc_train.index != idx)]
        train.loc[idx, 'TicketWCSurvRate'] = group['Survived'].mean() if len(group) > 0 else -1
    else:
        group = men_train[(men_train['Ticket'] == ticket) & (men_train.index != idx)]
        train.loc[idx, 'TicketMenSurvRate'] = group['Survived'].mean() if len(group) > 0 else -1

# Signal combine : meilleure source disponible pour chaque passager
for df in [train, test]:
    df['GroupSurvSignal'] = np.where(
        df['IsWomanChild'] == 1,
        np.where(df['TicketWCSurvRate'] >= 0, df['TicketWCSurvRate'],
                 np.where(df['FamilyGroupSurvRate'] >= 0, df['FamilyGroupSurvRate'],
                          np.where(df['TicketGroupSurvRate'] >= 0, df['TicketGroupSurvRate'], -1))),
        np.where(df['TicketMenSurvRate'] >= 0, df['TicketMenSurvRate'],
                 np.where(df['FamilyGroupSurvRate'] >= 0, df['FamilyGroupSurvRate'],
                          np.where(df['TicketGroupSurvRate'] >= 0, df['TicketGroupSurvRate'], -1)))
    )
    df['HasGroupInfo'] = (df['GroupSurvSignal'] >= 0).astype(int)

# Verification : le signal est-il discriminant ?
has = train[train['HasGroupInfo'] == 1]
print(f'\nSignal >= 0.5 -> survie = {has[has["GroupSurvSignal"]>=0.5]["Survived"].mean():.0%} '
      f'(n={len(has[has["GroupSurvSignal"]>=0.5])})')
print(f'Signal <  0.5 -> survie = {has[has["GroupSurvSignal"]<0.5]["Survived"].mean():.0%} '
      f'(n={len(has[has["GroupSurvSignal"]<0.5])})')
print(f'Pas de signal -> survie = {train[train["HasGroupInfo"]==0]["Survived"].mean():.0%} '
      f'(n={len(train[train["HasGroupInfo"]==0])})')

## Features finales

In [None]:
features = [
    # Base
    'Pclass', 'Sex_enc', 'Age', 'SibSp', 'Parch', 'Fare',
    'Title_enc', 'FamilySize', 'IsAlone', 'HasCabin', 'IsChild',
    'Age_Pclass', 'LogFare', 'Embarked_enc',
    # Group Survival
    'TicketGroupSurvRate', 'TicketGroupSize', 'FamilyGroupSurvRate',
    'TicketWCSurvRate', 'TicketMenSurvRate',
    'GroupSurvSignal', 'HasGroupInfo', 'IsWomanChild',
]

X = train[features].values
y = train['Survived'].values.astype(int)
X_test = test[features].values
test_ids = test['PassengerId'].values

print(f'{len(features)} features (14 base + 8 group), 0 NaN')

## Modeling

6 modeles avec des parametres tres conservateurs (forte regularisation, pas de tuning Optuna) pour eviter l'overfitting sur 891 lignes. 10-Fold CV pour une estimation stable.

In [None]:
kf = StratifiedKFold(n_splits=N_FOLDS, shuffle=True, random_state=SEED)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.transform(X_test)

results = {}

def train_model(name, model, X_tr, X_te, fit_kw=None):
    oof = np.zeros(len(y))
    test_preds = np.zeros(len(test_ids))
    scores = []
    for ti, vi in kf.split(X_tr, y):
        if fit_kw:
            model.fit(X_tr[ti], y[ti], **{k: v(X_tr, y, vi) if callable(v) else v for k, v in fit_kw.items()})
        else:
            m = type(model)(**model.get_params()) if hasattr(model, 'get_params') else model
            m.fit(X_tr[ti], y[ti])
            oof[vi] = m.predict_proba(X_tr[vi])[:, 1]
            test_preds += m.predict_proba(X_te)[:, 1] / N_FOLDS
            scores.append(accuracy_score(y[vi], m.predict(X_tr[vi])))
    acc = np.mean(scores)
    print(f'{name:4s}: {acc:.5f} +/- {np.std(scores):.5f}')
    results[name] = {'oof': oof, 'test': test_preds, 'acc': acc}

# LR, RF, SVM
for name, model, Xi, Xti in [
    ('LR', LogisticRegression(C=0.5, max_iter=1000, random_state=SEED), X_scaled, X_test_scaled),
    ('RF', RandomForestClassifier(n_estimators=300, max_depth=5, min_samples_split=10,
           min_samples_leaf=5, max_features='sqrt', random_state=SEED, n_jobs=-1), X, X_test),
    ('SVM', SVC(C=0.8, kernel='rbf', gamma='scale', probability=True, random_state=SEED), X_scaled, X_test_scaled),
]:
    oof = np.zeros(len(y)); tp = np.zeros(len(test_ids)); ss = []
    for ti, vi in kf.split(Xi, y):
        m = type(model)(**model.get_params())
        m.fit(Xi[ti], y[ti])
        oof[vi] = m.predict_proba(Xi[vi])[:, 1]
        tp += m.predict_proba(Xti)[:, 1] / N_FOLDS
        ss.append(accuracy_score(y[vi], m.predict(Xi[vi])))
    print(f'{name:4s}: {np.mean(ss):.5f} +/- {np.std(ss):.5f}')
    results[name] = {'oof': oof, 'test': tp, 'acc': np.mean(ss)}

In [None]:
# LightGBM - conservateur
lgb_params = {
    'objective': 'binary', 'metric': 'binary_logloss', 'boosting_type': 'gbdt',
    'num_leaves': 15, 'learning_rate': 0.05, 'feature_fraction': 0.7,
    'bagging_fraction': 0.7, 'bagging_freq': 5, 'min_child_samples': 30,
    'reg_alpha': 1.0, 'reg_lambda': 5.0, 'n_estimators': 300,
    'verbose': -1, 'random_state': SEED,
}
oof = np.zeros(len(y)); tp = np.zeros(len(test_ids)); ss = []
for ti, vi in kf.split(X, y):
    m = lgb.LGBMClassifier(**lgb_params)
    m.fit(X[ti], y[ti], eval_set=[(X[vi], y[vi])],
          callbacks=[lgb.early_stopping(30, verbose=False), lgb.log_evaluation(0)])
    oof[vi] = m.predict_proba(X[vi])[:, 1]
    tp += m.predict_proba(X_test)[:, 1] / N_FOLDS
    ss.append(accuracy_score(y[vi], m.predict(X[vi])))
print(f'LGB : {np.mean(ss):.5f} +/- {np.std(ss):.5f}')
results['LGB'] = {'oof': oof, 'test': tp, 'acc': np.mean(ss)}

In [None]:
# XGBoost - conservateur
xgb_params = {
    'objective': 'binary:logistic', 'eval_metric': 'logloss',
    'max_depth': 3, 'learning_rate': 0.05, 'subsample': 0.7,
    'colsample_bytree': 0.7, 'min_child_weight': 10,
    'reg_alpha': 1.0, 'reg_lambda': 5.0, 'gamma': 0.5,
    'n_estimators': 300, 'random_state': SEED, 'verbosity': 0,
}
oof = np.zeros(len(y)); tp = np.zeros(len(test_ids)); ss = []
for ti, vi in kf.split(X, y):
    m = xgb.XGBClassifier(**xgb_params)
    m.fit(X[ti], y[ti], eval_set=[(X[vi], y[vi])], verbose=False)
    oof[vi] = m.predict_proba(X[vi])[:, 1]
    tp += m.predict_proba(X_test)[:, 1] / N_FOLDS
    ss.append(accuracy_score(y[vi], m.predict(X[vi])))
print(f'XGB : {np.mean(ss):.5f} +/- {np.std(ss):.5f}')
results['XGB'] = {'oof': oof, 'test': tp, 'acc': np.mean(ss)}

In [None]:
# CatBoost - conservateur
oof = np.zeros(len(y)); tp = np.zeros(len(test_ids)); ss = []
for ti, vi in kf.split(X, y):
    m = CatBoostClassifier(iterations=300, depth=4, learning_rate=0.05,
        l2_leaf_reg=5.0, random_seed=SEED, verbose=0, early_stopping_rounds=30)
    m.fit(X[ti], y[ti], eval_set=(X[vi], y[vi]), verbose=0)
    oof[vi] = m.predict_proba(X[vi])[:, 1]
    tp += m.predict_proba(X_test)[:, 1] / N_FOLDS
    ss.append(accuracy_score(y[vi], m.predict(X[vi])))
print(f'CB  : {np.mean(ss):.5f} +/- {np.std(ss):.5f}')
results['CB'] = {'oof': oof, 'test': tp, 'acc': np.mean(ss)}

## Feature Importance

In [None]:
group_features = ['TicketGroupSurvRate', 'TicketGroupSize', 'FamilyGroupSurvRate',
                  'TicketWCSurvRate', 'TicketMenSurvRate', 'GroupSurvSignal',
                  'HasGroupInfo', 'IsWomanChild']

cb_full = CatBoostClassifier(iterations=300, depth=4, learning_rate=0.05,
    l2_leaf_reg=5.0, random_seed=SEED, verbose=0)
cb_full.fit(X, y)

imp = pd.DataFrame({'feature': features, 'importance': cb_full.feature_importances_})
imp = imp.sort_values('importance', ascending=True)

fig, ax = plt.subplots(figsize=(8, 7))
colors = ['#e74c3c' if f in group_features else '#3498db' for f in imp['feature']]
ax.barh(imp['feature'], imp['importance'], color=colors)
ax.set_title('Feature Importance (bleu=base, rouge=group survival)')
plt.tight_layout()
plt.show()

group_pct = imp[imp['feature'].isin(group_features)]['importance'].sum() / imp['importance'].sum()
print(f'Les group features representent {group_pct:.0%} de l importance totale')

## Ensemble & Soumission

Simple average des 6 modeles. Pas d'optimisation de poids (overfit sur 891 lignes).

In [None]:
all_oof = np.column_stack([results[m]['oof'] for m in results])
all_test = np.column_stack([results[m]['test'] for m in results])

ensemble_acc = accuracy_score(y, (all_oof.mean(axis=1) > 0.5).astype(int))
print(f'Ensemble CV: {ensemble_acc:.5f}')
for name in sorted(results, key=lambda x: results[x]['acc'], reverse=True):
    print(f'  {name}: {results[name]["acc"]:.5f}')

# Soumission
final_preds = (all_test.mean(axis=1) > 0.5).astype(int)
submission = pd.DataFrame({'PassengerId': test_ids.astype(int), 'Survived': final_preds})
submission.to_csv('submission.csv', index=False)

print(f'\nsubmission.csv: {submission.shape}, survival rate = {submission.Survived.mean():.1%}')