# Debug Drill 04: The Overfitting Trap

**Symptom:** Your colleague's LightGBM model has 99% AUC on training data but only 58% on the holdout test set.

**Your task:** Find the bug, fix it, and write a postmortem.

**Time:** 15 minutes

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import lightgbm as lgb
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load data
df = pd.read_csv('https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/shared/data/streamcart_customers.csv')

features = ['tenure_months', 'logins_last_30d', 'orders_last_30d', 
            'support_tickets_last_30d', 'nps_score']
X = df[features].fillna(0)
y = df['churn_30d']

In [None]:
# ===== COLLEAGUE'S CODE (CONTAINS BUG) =====

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train LightGBM with "optimized" parameters
model = lgb.LGBMClassifier(
    n_estimators=2000,      # Lots of trees!
    num_leaves=256,         # Very deep trees!
    max_depth=-1,           # No limit on depth!
    min_child_samples=1,    # Can fit single examples!
    learning_rate=0.3,      # Fast learning!
    random_state=42,
    verbose=-1
)

model.fit(X_train, y_train)

# Check performance
train_pred = model.predict_proba(X_train)[:, 1]
test_pred = model.predict_proba(X_test)[:, 1]

print(f"Training AUC: {roc_auc_score(y_train, train_pred):.4f}")
print(f"Test AUC: {roc_auc_score(y_test, test_pred):.4f}")

## Your Investigation

**Q1:** What's the gap between training and test AUC? What does this indicate?

In [None]:
# TODO: Calculate and interpret the gap


**Q2:** Which hyperparameters are causing overfitting? List at least 3 problematic settings.

In [None]:
# TODO: Identify the problematic parameters
# 1. 
# 2. 
# 3. 

## Fix the Bug

**Q3:** Retrain with proper regularization AND early stopping.

In [None]:
# TODO: Fix the model - use reasonable hyperparameters and early stopping

# Split into train/val/test
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.2, random_state=42)

model_fixed = lgb.LGBMClassifier(
    # TODO: Set reasonable parameters
    n_estimators=1000,       # Still many, but we'll use early stopping
    num_leaves=31,           # TODO: What's a reasonable value?
    max_depth=6,             # TODO: What's a reasonable value?
    min_child_samples=20,    # TODO: What's a reasonable value?
    learning_rate=0.05,      # TODO: What's a reasonable value?
    random_state=42,
    verbose=-1
)

# TODO: Add early stopping
model_fixed.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    callbacks=[lgb.early_stopping(stopping_rounds=50, verbose=False)]
)

# Check performance
train_pred = model_fixed.predict_proba(X_train)[:, 1]
val_pred = model_fixed.predict_proba(X_val)[:, 1]
test_pred = model_fixed.predict_proba(X_test)[:, 1]

print(f"Training AUC: {roc_auc_score(y_train, train_pred):.4f}")
print(f"Validation AUC: {roc_auc_score(y_val, val_pred):.4f}")
print(f"Test AUC: {roc_auc_score(y_test, test_pred):.4f}")
print(f"\nTrees used: {model_fixed.best_iteration_}")

## Self-Check

In [None]:
# Verify fix worked
gap = abs(roc_auc_score(y_train, train_pred) - roc_auc_score(y_test, test_pred))
assert gap < 0.10, f"Gap still too large: {gap:.4f}"
assert roc_auc_score(y_test, test_pred) > 0.65, "Test AUC too low"
print("PASS: Overfitting controlled!")

## Postmortem

Write 3 bullets:
1. **Root cause:** 
2. **How we detected it:** 
3. **Prevention for next time:** 