# Solution: Debug Drill 02 - The Leaky Features

**Bugs:**
1. `churn_risk_score` — Target leakage (uses `churned` directly in formula)
2. `customer_value` — Temporal leakage (uses `total_spend` which includes future purchases)

---

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

np.random.seed(42)

In [None]:
DATA_URL = 'https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/shared/data/'
df = pd.read_csv(DATA_URL + 'streamcart_customers.csv')

def create_features(data):
    df_feat = data.copy()
    df_feat['orders_per_month'] = df_feat['orders_total'] / (df_feat['tenure_months'] + 1)
    df_feat['login_frequency'] = df_feat['logins_last_30d'] / 30
    # BUG 1: Target leakage
    df_feat['churn_risk_score'] = df_feat['days_since_last_purchase'] / 30 + (1 - df_feat['churned']) * 0.5
    # BUG 2: Temporal leakage  
    df_feat['customer_value'] = df_feat['total_spend']
    return df_feat

df_features = create_features(df)

## Buggy Model

In [None]:
feature_cols_buggy = [
    'tenure_months', 'orders_total', 'logins_last_30d',
    'orders_per_month', 'login_frequency',
    'churn_risk_score',   # BUG 1
    'customer_value',     # BUG 2
    'support_tickets_total', 'avg_order_value'
]

X_buggy = df_features[feature_cols_buggy].fillna(0)
y = df_features['churned']

X_train_buggy, X_test_buggy, y_train, y_test = train_test_split(
    X_buggy, y, test_size=0.2, random_state=42, stratify=y
)

model_buggy = GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42)
model_buggy.fit(X_train_buggy, y_train)

auc_buggy = roc_auc_score(y_test, model_buggy.predict_proba(X_test_buggy)[:, 1])
print(f"Buggy AUC: {auc_buggy:.3f} (inflated)")

## Investigation

In [None]:
print("=== Feature Correlations with Target ===")
correlations = df_features[feature_cols_buggy + ['churned']].corr()['churned'].drop('churned')
print(correlations.sort_values(key=abs, ascending=False).round(3))

print("\n=== Feature Importances ===")
for feat, imp in sorted(zip(feature_cols_buggy, model_buggy.feature_importances_), key=lambda x: -x[1]):
    flag = "⚠️ SUSPICIOUS" if imp > 0.2 else ""
    print(f"  {feat:25s}: {imp:.3f} {flag}")

## Fixed Model

In [None]:
feature_cols_fixed = [
    'tenure_months', 'orders_total', 'logins_last_30d',
    'orders_per_month', 'login_frequency',
    # 'churn_risk_score',  # REMOVED - target leakage
    # 'customer_value',    # REMOVED - temporal leakage
    'support_tickets_total', 'avg_order_value'
]

X_fixed = df_features[feature_cols_fixed].fillna(0)

X_train_fixed, X_test_fixed, y_train_fixed, y_test_fixed = train_test_split(
    X_fixed, y, test_size=0.2, random_state=42, stratify=y
)

model_fixed = GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42)
model_fixed.fit(X_train_fixed, y_train_fixed)

auc_fixed = roc_auc_score(y_test_fixed, model_fixed.predict_proba(X_test_fixed)[:, 1])
print(f"Fixed AUC: {auc_fixed:.3f} (realistic)")

In [None]:
print("\n=== Comparison ===")
print(f"Buggy AUC:  {auc_buggy:.3f}")
print(f"Fixed AUC:  {auc_fixed:.3f}")
print(f"Difference: {auc_buggy - auc_fixed:.3f} points")
print(f"\nThe leaky features inflated AUC by {(auc_buggy - auc_fixed) / auc_fixed * 100:.0f}%")

## Diagnosis

### Bug 1 - churn_risk_score
- **Type of leakage:** Target leakage
- **Why it's wrong:** The formula `(1 - churned) * 0.5` directly uses the target variable. The model learns "if churn_risk_score is high, they didn't churn" which is circular.

### Bug 2 - customer_value
- **Type of leakage:** Temporal leakage
- **Why it's wrong:** `total_spend` includes purchases made AFTER the prediction date. Customers who don't churn keep buying (higher value), customers who churn stop buying (lower value). The feature encodes the outcome.

## Postmortem

### What happened:
- Model achieved 0.96 AUC, far above the typical 0.70-0.85 for churn
- Feature importance showed two features dominating

### Root cause:
- Bug 1: `churn_risk_score` used `churned` target in its calculation (target leakage)
- Bug 2: `customer_value` used lifetime `total_spend` including future purchases (temporal leakage)

### How to prevent:
- Add automated check: correlation > 0.7 with target triggers review
- For every feature, apply the Timeline Test: "Would I have this at prediction time?"
- Never derive features from the target variable
- Code review checklist: search for target column name in feature engineering code