# Solution: Debug Drill 01 - The Too-Good Model

**Bug:** Preprocessing leakage - StandardScaler was fit on ALL data before train/test split.

---

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

np.random.seed(42)

In [None]:
DATA_URL = 'https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/shared/data/'
df = pd.read_csv(DATA_URL + 'streamcart_customers.csv')

def engineer_features(data):
    df_feat = data.copy()
    df_feat['orders_per_month'] = df_feat['orders_total'] / (df_feat['tenure_months'] + 1)
    df_feat['spend_per_order'] = df_feat['total_spend'] / (df_feat['orders_total'] + 1)
    df_feat['login_intensity'] = df_feat['logins_last_30d'] / 30
    df_feat['engagement_ratio'] = df_feat['logins_per_month_avg'] / (df_feat['orders_per_month'] + 0.1)
    # days_since_last_order already present in raw data
    df_feat['tickets_per_tenure'] = df_feat['support_tickets_total'] / (df_feat['tenure_months'] + 1)
    return df_feat

df_engineered = engineer_features(df)

feature_cols = [
    'tenure_months', 'orders_per_month', 'spend_per_order', 
    'login_intensity', 'engagement_ratio', 'days_since_last_order',
    'tickets_per_tenure', 'avg_order_value'
]

## The Bug: Preprocessing Before Split

In [None]:
# BUGGY PIPELINE
X = df_engineered[feature_cols].fillna(0)
y = df_engineered['churn_30d']

# BUG: Scaler fit on ALL data before split
scaler_buggy = StandardScaler()
X_scaled_buggy = scaler_buggy.fit_transform(X)

X_train_buggy, X_test_buggy, y_train_buggy, y_test_buggy = train_test_split(
    X_scaled_buggy, y, test_size=0.2, random_state=42, stratify=y
)

model_buggy = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model_buggy.fit(X_train_buggy, y_train_buggy)

auc_buggy = roc_auc_score(y_test_buggy, model_buggy.predict_proba(X_test_buggy)[:, 1])
print(f"Buggy AUC: {auc_buggy:.3f} (inflated)")

## The Fix: Split First, Then Preprocess

In [None]:
# FIXED PIPELINE
X_raw = df_engineered[feature_cols].fillna(0)
y = df_engineered['churn_30d']

# Step 1: Split FIRST
X_train_raw, X_test_raw, y_train_fixed, y_test_fixed = train_test_split(
    X_raw, y, test_size=0.2, random_state=42, stratify=y
)

# Step 2: Fit scaler on TRAIN ONLY
scaler_fixed = StandardScaler()
X_train_fixed = scaler_fixed.fit_transform(X_train_raw)  # fit + transform
X_test_fixed = scaler_fixed.transform(X_test_raw)        # transform only

# Step 3: Train and evaluate
model_fixed = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model_fixed.fit(X_train_fixed, y_train_fixed)

auc_fixed = roc_auc_score(y_test_fixed, model_fixed.predict_proba(X_test_fixed)[:, 1])
print(f"Fixed AUC: {auc_fixed:.3f} (realistic)")

In [None]:
print(f"\n=== Comparison ===")
print(f"Buggy AUC:  {auc_buggy:.3f}")
print(f"Fixed AUC:  {auc_fixed:.3f}")
print(f"Difference: {auc_buggy - auc_fixed:.3f} points")
print(f"\nThe buggy pipeline inflated AUC by {(auc_buggy - auc_fixed) / auc_fixed * 100:.1f}%")

## Diagnosis

**The bug is in stage:** Preprocessing (Stage 3)

**The problem is:** StandardScaler was fit on ALL data before the train/test split, so the test set statistics leaked into the training process.

**This is called:** Preprocessing leakage (a type of data leakage)

**Why it causes inflated AUC:** The scaler learned the mean and standard deviation of the test set. This makes the scaled test features "easier" to predict because they're centered using information that wouldn't be available in production.

## Postmortem

### What happened:
- Model achieved 97% AUC, which is unrealistically high for churn prediction
- This triggered investigation of the pipeline

### Root cause:
- Preprocessing leakage: StandardScaler was fit on all data before train/test split
- Test set statistics (mean, std) leaked into the preprocessing step
- The bug was at the Stage 3-4 boundary, not in the model itself

### How to prevent:
- Always split BEFORE any preprocessing that "learns" from data
- Use sklearn Pipelines to enforce correct ordering
- Add a red flag alert for AUC > 0.90 on typical business problems
- Code review checklist: "Is any .fit() called before split?"