# Classical ML Failure Modes — Student Lab (Titanic)

This lab focuses on failure modes: leakage, spurious correlations, and bad validation.

In [4]:
import os
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score
from sklearn.linear_model import LogisticRegression

def check(name: str, cond: bool):
    if not cond:
        raise AssertionError(f'Failed: {name}')
    print(f'OK: {name}')

rng = np.random.default_rng(0)

## Section 0 — Load Titanic (Kaggle) with fallback

Expected path: `data/titanic/train.csv`

If missing, we generate a tiny synthetic dataset so notebook still runs.

In [5]:
def load_titanic_or_synthetic():
    path = os.path.join(os.getcwd(), 'data', 'titanic', 'train.csv')
    if os.path.exists(path):
        df = pd.read_csv(path)
        return 'kaggle', df

    # synthetic fallback (schema resembles Titanic)
    df = pd.DataFrame({
        'Survived': [0,1,1,0,1,0,0,1],
        'Pclass': [3,1,3,3,2,3,2,1],
        'Sex': ['male','female','female','male','female','male','male','female'],
        'Age': [22, 38, 26, 35, 28, 2, 54, 19],
        'SibSp': [1,1,0,1,0,3,0,0],
        'Parch': [0,0,0,0,0,1,0,0],
        'Fare': [7.25, 71.3, 7.92, 53.1, 13.0, 21.1, 51.9, 30.0],
        'Embarked': ['S','C','S','S','S','S','S','C'],
    })
    return 'synthetic', df

mode, df = load_titanic_or_synthetic()
print('mode:', mode, 'rows:', len(df))
df.head()

mode: synthetic rows: 8


Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22,1,0,7.25,S
1,1,1,female,38,1,0,71.3,C
2,1,3,female,26,0,0,7.92,S
3,0,3,male,35,1,0,53.1,S
4,1,2,female,28,0,0,13.0,S


## Section 1 — Baseline + Proper Validation

### Task 1.1: Minimal baseline features

Use only: Pclass, Sex, Age, Fare (simple).

# TODO:
- Create X/y
- Handle missing Age/Fare
- One-hot encode Sex

**Checkpoint:** Why do we start with a minimal baseline?

In [13]:
# TODO
y = df['Survived'].astype(int).values
X = df[['Pclass','Sex','Age','Fare']].copy()

# impute
X['Age'] = X['Age'].fillna(X['Age'].median())
X['Fare'] = X['Fare'].fillna(X['Fare'].median())

# one-hot
X = pd.get_dummies(X, columns=['Sex'], drop_first=True)

Xtr, Xva, ytr, yva = train_test_split(X.values, y, test_size=0.3, random_state=0, stratify=y)

clf = LogisticRegression(max_iter=2000)
clf.fit(Xtr, ytr)
pred = clf.predict(Xva)
proba = clf.predict_proba(Xva)[:,1]
print('acc', accuracy_score(yva, pred))
print('auc', roc_auc_score(yva, proba))

#We start with minimal baseline because we want something simple and quick to give us a lower bound for performance -> a sanity check of sorts.

acc 0.3333333333333333
auc 0.0


## Section 2 — Leakage

### Task 2.1: Create a leaky feature

Intentionally create a feature that encodes the label (e.g., `leak = Survived`).
Train again and observe metric inflation.

**Checkpoint:** How can you detect leakage quickly in an interview?

In [14]:
X_leak = X.copy()
X_leak['leak'] = df['Survived'].values  # leaky on purpose

Xtr, Xva, ytr, yva = train_test_split(X_leak.values, y, test_size=0.3, random_state=0, stratify=y)
clf = LogisticRegression(max_iter=2000)
clf.fit(Xtr, ytr)
print('acc_with_leak', accuracy_score(yva, clf.predict(Xva)))

# If validation/test accuracy is higher than training accuracy. If removing one feature makes the accuracy drop by a large margin or if a feature not available during testing, but used for training


acc_with_leak 0.3333333333333333


## Section 3 — Spurious correlations + slice analysis

### Task 3.1: Evaluate slices

Compute accuracy by groups (Sex, Pclass).

# TODO:
- Make predictions on validation
- Report metrics by slice

**Interview Angle:** Why can overall accuracy hide severe subgroup failures?

In [16]:
# Refit baseline quickly (without leak)
Xbase = df[['Pclass','Sex','Age','Fare']].copy()
Xbase['Age'] = Xbase['Age'].fillna(Xbase['Age'].median())
Xbase['Fare'] = Xbase['Fare'].fillna(Xbase['Fare'].median())
Xbase = pd.get_dummies(Xbase, columns=['Sex'], drop_first=True)

idx = np.arange(len(df))
tr_idx, va_idx = train_test_split(idx, test_size=0.3, random_state=0, stratify=y)

clf = LogisticRegression(max_iter=2000)
clf.fit(Xbase.iloc[tr_idx].values, y[tr_idx])
pred = clf.predict(Xbase.iloc[va_idx].values)

va_df = df.iloc[va_idx].copy()
va_df['pred'] = pred
va_df['correct'] = (va_df['pred'].values == va_df['Survived'].values).astype(int)

print('overall_acc', va_df['correct'].mean())
print('acc_by_sex')
print(va_df.groupby('Sex')['correct'].mean())
print('acc_by_pclass')
print(va_df.groupby('Pclass')['correct'].mean())

# Overall accuracy is over all classes. Especially in case of class imbalance, over accuracy can be high when it is still low for few groups
# beacuse the number of elements for a less common class are not enough to affect the overall accuracy

overall_acc 0.3333333333333333
acc_by_sex
Sex
female    1.0
male      0.0
Name: correct, dtype: float64
acc_by_pclass
Pclass
1    1.0
2    0.0
3    0.0
Name: correct, dtype: float64


## Section 4 — Dataset shift (toy simulation)

### Task 4.1: Simulate a shift in Fare distribution

Create a shifted validation set by multiplying Fare and see how performance changes.

**Checkpoint:** How would you monitor drift in production?

In [19]:
X_shift = Xbase.copy()
# toy shift
if 'Fare' in X_shift.columns:
    X_shift['Fare'] = X_shift['Fare'] * 3.0

pred_shift = clf.predict(X_shift.iloc[va_idx].values)
acc_shift = accuracy_score(y[va_idx], pred_shift)
print('acc_original', accuracy_score(y[va_idx], pred))
print('acc_shifted', acc_shift)
# In production, we can track the data distribution and monitor it -> compute and compare moving window data distribution -> and if it changes too much, then we know a drift has happened


acc_original 0.3333333333333333
acc_shifted 0.3333333333333333


---
## Submission Checklist
- Baseline trained + evaluated
- Leakage demo shown
- Slice metrics reported
- Shift simulation discussed
