# Debug Drill 05: The Leaky Feature

**Symptom:** After adding "smart" engineered features, the model's AUC jumped from 0.72 to 0.98. The PM is thrilled, but something feels wrong.

**Your task:** Find the leaky feature, remove it, and write a postmortem.

**Time:** 15 minutes

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Load data
df = pd.read_csv('https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/shared/data/streamcart_customers.csv')

In [None]:
# ===== COLLEAGUE'S CODE (CONTAINS BUG) =====

# Original features
df_eng = df.copy()

# "Smart" feature engineering
eps = 0.01

# Ratio features (these are fine)
df_eng['orders_per_month'] = df['orders_last_30d'] / (df['tenure_months'] + eps)
df_eng['login_rate'] = df['logins_last_30d'] / 30

# Behavioral features (these are fine)
df_eng['support_heavy'] = (df['support_tickets_last_30d'] > 2).astype(int)
df_eng['low_engagement'] = (df['logins_last_30d'] < 5).astype(int)

# NEW "brilliant" feature (THIS IS THE BUG)
# "Days until cancellation" - colleague thought this captures urgency
df_eng['days_to_cancel'] = pd.to_datetime(df['churn_date']).sub(
    pd.to_datetime(df['snapshot_date'])
).dt.days.fillna(999)

# Another leaky feature - derived from future info
df_eng['will_contact_support'] = (df['cancel_reason'].notna()).astype(int)

features = [
    'tenure_months', 'logins_last_30d', 'orders_last_30d',
    'orders_per_month', 'login_rate', 'support_heavy', 'low_engagement',
    'days_to_cancel',        # LEAKY!
    'will_contact_support'   # LEAKY!
]

X = df_eng[features].fillna(0)
y = df_eng['churn_30d']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

test_pred = model.predict_proba(X_test)[:, 1]
print(f"Test AUC: {roc_auc_score(y_test, test_pred):.4f}")
print("\nWow! Amazing performance!")

## Your Investigation

**Q1:** Look at the feature coefficients. Which features have suspiciously large coefficients?

In [None]:
# TODO: Print coefficients sorted by magnitude
for feat, coef in sorted(zip(features, model.coef_[0]), key=lambda x: -abs(x[1])):
    print(f"{feat:25} {coef:+.4f}")

**Q2:** For each suspicious feature, answer: "Would we know this value at prediction time?"

In [None]:
# TODO: Analyze each feature
# days_to_cancel: Would we know this at prediction time? 
# will_contact_support: Would we know this at prediction time? 

## Fix the Bug

**Q3:** Remove the leaky features and retrain.

In [None]:
# TODO: Create a clean feature list without leaky features
features_clean = [
    # TODO: Keep only safe features
]

X_clean = df_eng[features_clean].fillna(0)
X_train, X_test, y_train, y_test = train_test_split(X_clean, y, test_size=0.2, random_state=42)

model_clean = LogisticRegression(max_iter=1000)
model_clean.fit(X_train, y_train)

test_pred = model_clean.predict_proba(X_test)[:, 1]
print(f"Test AUC (clean): {roc_auc_score(y_test, test_pred):.4f}")
print("\nThis is the REAL performance.")

## Self-Check

In [None]:
# Verify leaky features removed
assert 'days_to_cancel' not in features_clean, "Still using leaky feature!"
assert 'will_contact_support' not in features_clean, "Still using leaky feature!"
assert roc_auc_score(y_test, test_pred) < 0.90, "AUC still suspiciously high"
assert roc_auc_score(y_test, test_pred) > 0.60, "AUC too low - check features"
print("PASS: Leakage removed!")

## Postmortem

Write 3 bullets:
1. **Root cause:** 
2. **How we detected it:** 
3. **Prevention for next time:** 