# ANSWER KEY: Debug Drill 01 - Data Leakage

**Bug:** Model uses features that wouldn't be available at prediction time:
- `has_cancel_reason` - only known AFTER the customer cancels
- `days_until_churn` - requires knowing the future churn date

**Key Lesson:** Always ask "Would I have this data at prediction time?"

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, precision_score

# Load data
df = pd.read_csv('https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/streamcart_customers.csv')

## The Bug (Colleague's Code)

In [None]:
# ===== BUGGY CODE =====
# These features are LEAKY:

df['has_cancel_reason'] = df['cancel_reason'].notna().astype(int)
# ^ LEAKAGE: cancel_reason only exists AFTER they cancel!

df['days_until_churn'] = pd.to_datetime(df['churn_date']).sub(
    pd.to_datetime(df['snapshot_date'])
).dt.days.fillna(999)
# ^ LEAKAGE: We don't know churn_date at prediction time!

features_buggy = [
    'tenure_months',
    'logins_last_30d',
    'support_tickets_last_30d',
    'has_cancel_reason',      # LEAKY
    'days_until_churn'        # LEAKY
]

## Why This Is Wrong

| Feature | Problem |
|---------|--------|
| `has_cancel_reason` | Only populated AFTER customer cancels. At prediction time, this is always 0 for active customers. |
| `days_until_churn` | Requires knowing the future. We're trying to PREDICT churn, not use the answer! |

**The giveaway:** AUC near 1.0 is almost always leakage. Real-world churn models typically achieve 0.65-0.80.

## The Fix

In [None]:
# ===== FIXED CODE =====
# Only use features available at prediction time

features_fixed = [
    'tenure_months',           # Known at prediction time
    'logins_last_30d',         # Historical - OK
    'support_tickets_last_30d' # Historical - OK
]

X = df[features_fixed].fillna(0)
y = df['churn_30d']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model_fixed = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
model_fixed.fit(X_train, y_train)

y_proba = model_fixed.predict_proba(X_test)[:, 1]
auc_fixed = roc_auc_score(y_test, y_proba)

print(f"Fixed AUC: {auc_fixed:.3f}")
print(f"\nThis is realistic! The model has actual predictive power,")
print(f"not just the answer key hidden in the features.")

In [None]:
# Self-check
assert 0.55 < auc_fixed < 0.90, f"AUC {auc_fixed} seems off"
print("PASS: No leakage detected!")

## Completed Postmortem

### What happened:
- Model achieved 0.99 AUC by using features derived from future information (cancel_reason, churn_date)
- These features perfectly predict churn because they ARE churn

### Root cause:
- No "prediction time" audit of features
- Colleague included all available columns without asking "when would I know this?"

### How to prevent:
- For each feature, explicitly document: "Available at prediction time? Yes/No"
- Be suspicious of AUC > 0.90 on real business problems
- Have a second person review feature engineering before training