# Debug Drill: The Leaky Features

**Scenario:**
Your team built a churn prediction model. A colleague is proud of the feature engineering.

"I created some amazing features!" they say. "The model gets 0.96 AUC!"

You're suspicious. Typical churn models get 0.70-0.85 AUC.

**Your Task:**
1. Run the pipeline
2. Find the leaky features (there are TWO bugs)
3. Fix them
4. Write a 3-bullet postmortem

---

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)

In [None]:
# Load data
DATA_URL = 'https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/shared/data/'

df = pd.read_csv(DATA_URL + 'streamcart_customers.csv')
print(f"Loaded {len(df):,} customers")
print(f"Churn rate: {df['churn_30d'].mean():.1%}")

In [None]:
# ===== COLLEAGUE'S FEATURE ENGINEERING (CONTAINS BUGS) =====

def create_features(data):
    """Engineer features for churn prediction."""
    df_feat = data.copy()
    
    # Basic features (these are fine)
    df_feat['tenure_months'] = df_feat['tenure_months']
    df_feat['orders_total'] = df_feat['orders_total']
    df_feat['logins_last_30d'] = df_feat['logins_last_30d']
    
    # Derived features (some are fine, some are buggy)
    df_feat['orders_per_month'] = df_feat['orders_total'] / (df_feat['tenure_months'] + 1)
    df_feat['login_frequency'] = df_feat['logins_last_30d'] / 30
    
    # "Sophisticated" features (colleague is proud of these)
    # BUG 1: This feature is derived from the target!
    df_feat['churn_risk_score'] = (
        df_feat['days_since_last_order'] / 30 +
        (1 - df_feat['churn_30d']) * 0.5  # <-- Uses the target directly!
    )
    
    # BUG 2: This feature uses future information!
    # "Lifetime value" includes ALL purchases, even ones AFTER the prediction date
    df_feat['customer_value'] = df_feat['total_spend']  # <-- Includes future spend
    
    # More features (these are fine)
    df_feat['support_tickets'] = df_feat['support_tickets_total']
    df_feat['avg_order_value'] = df_feat['avg_order_value']
    
    return df_feat

df_features = create_features(df)
print("Features created!")

In [None]:
# Define feature columns
feature_cols = [
    'tenure_months',
    'orders_total', 
    'logins_last_30d',
    'orders_per_month',
    'login_frequency',
    'churn_risk_score',   # <-- BUG 1
    'customer_value',     # <-- BUG 2
    'support_tickets',
    'avg_order_value'
]

X = df_features[feature_cols].fillna(0)
y = df_features['churn_30d']

# Time-based split (at least they got this right)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train: {len(X_train):,}, Test: {len(X_test):,}")

In [None]:
# Train model
model = GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42)
model.fit(X_train, y_train)

# Evaluate
y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)

print(f"Test AUC: {auc:.3f}")
print(f"\nüéâ Colleague: 'See? 0.96 AUC! This is our best model ever!'")
print(f"\nü§î You: 'That seems... too good. Let me check the features.'")

---

## Your Investigation

0.96 AUC for churn is suspiciously high. Find the leaky features.

### Step 1: Check feature correlations with target

In [None]:
# TODO: Check correlations between features and target
# Suspicious: any feature with correlation > 0.5 with the target

print("=== Feature Correlations with Target (churn_30d) ===")
correlations = df_features[feature_cols + ['churn_30d']].corr()['churn_30d'].drop('churn_30d')
print(correlations.sort_values(key=abs, ascending=False).round(3))

print("\nüîç Which features have suspiciously high correlation?")

In [None]:
# Check feature importances
print("=== Feature Importances ===")
importance_df = pd.DataFrame({
    'feature': feature_cols,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

print(importance_df.to_string(index=False))

print("\nüîç Which features dominate the model? Are they legitimate?")

### Step 2: Investigate the suspicious features

In [None]:
# TODO: Look at how churn_risk_score is calculated
print("=== Investigating churn_risk_score ===")
print("\nFormula:")
print("  churn_risk_score = days_since_last_order/30 + (1 - churn_30d) * 0.5")

print("\nü§î Wait... '(1 - churn_30d)' means:")
print("  - If churn_30d=1 (yes churn): adds 0")
print("  - If churn_30d=0 (no churn): adds 0.5")
print("\n‚ùå This feature USES THE TARGET DIRECTLY!")

In [None]:
# TODO: Think about customer_value
print("=== Investigating customer_value ===")
print("\nFormula:")
print("  customer_value = total_spend")

print("\nü§î Question: Does 'total_spend' include purchases AFTER the prediction date?")
print("\nIf we're predicting churn on Jan 1st:")
print("  - total_spend might include purchases from Feb, March, April...")
print("  - A customer who didn't churn kept buying ‚Üí higher total_spend")
print("  - A customer who churned stopped buying ‚Üí lower total_spend")

print("\n‚ùå This feature LEAKS FUTURE INFORMATION!")

In [None]:
# Write your diagnosis:

diagnosis = """
YOUR DIAGNOSIS HERE:

Bug 1 - churn_risk_score:
- Type of leakage: _______________
- Why it's wrong: _______________

Bug 2 - customer_value:
- Type of leakage: _______________  
- Why it's wrong: _______________

"""
print(diagnosis)

### Step 3: Fix the feature engineering

In [None]:
# TODO: Remove the leaky features and retrain

# Fixed feature list - remove the two buggy features
# Uncomment and complete:

# feature_cols_fixed = [
#     'tenure_months',
#     'orders_total', 
#     'logins_last_30d',
#     'orders_per_month',
#     'login_frequency',
#     # 'churn_risk_score',  # REMOVED - target leakage
#     # 'customer_value',    # REMOVED - future leakage
#     'support_tickets',
#     'avg_order_value'
# ]
#
# X_fixed = df_features[feature_cols_fixed].fillna(0)
# X_train_fixed, X_test_fixed, y_train_fixed, y_test_fixed = train_test_split(
#     X_fixed, y, test_size=0.2, random_state=42, stratify=y
# )
#
# model_fixed = GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=42)
# model_fixed.fit(X_train_fixed, y_train_fixed)
#
# y_pred_proba_fixed = model_fixed.predict_proba(X_test_fixed)[:, 1]
# auc_fixed = roc_auc_score(y_test_fixed, y_pred_proba_fixed)
#
# print(f"Fixed AUC: {auc_fixed:.3f}")
# print(f"\nComparison:")
# print(f"  Buggy AUC:  {auc:.3f} (inflated by leakage)")
# print(f"  Fixed AUC:  {auc_fixed:.3f} (realistic)")
# print(f"  Difference: {auc - auc_fixed:.3f} points")

In [None]:
# ============================================
# SELF-CHECK: Did you fix both bugs?
# ============================================

# Uncomment after fixing:
#
# # Fixed AUC should be much lower (realistic)
# assert auc_fixed < 0.90, "Fixed AUC should be below 0.90 (realistic for churn)"
# assert auc_fixed < auc, "Fixed AUC should be lower than buggy AUC"
# 
# # Check that leaky features are removed
# assert 'churn_risk_score' not in feature_cols_fixed, "Remove churn_risk_score!"
# assert 'customer_value' not in feature_cols_fixed, "Remove customer_value!"
#
# print("‚úì Both leaky features removed!")
# print(f"‚úì AUC dropped from {auc:.3f} to {auc_fixed:.3f}")
# print("‚úì The fixed model is realistic and will work in production.")

### Step 4: Write your postmortem

In [None]:
postmortem = """
## Postmortem: The Leaky Features

### What happened:
- (Your answer: What symptoms indicated a problem?)

### Root cause:
- Bug 1: (Type of leakage and which feature)
- Bug 2: (Type of leakage and which feature)

### How to prevent:
- (Your answer: What checks would catch this?)

"""

print(postmortem)

---

## ‚úÖ Drill Complete!

**Key lessons:**

1. **Target leakage:** Never use the target variable (or anything derived from it) as a feature. The model will "learn" to predict using information it won't have.

2. **Temporal leakage:** Features must be computed using ONLY data from before the prediction time. "Lifetime value" that includes future purchases is cheating.

3. **Red flags:**
   - AUC > 0.90 on a business problem (too good to be true)
   - One feature with >50% importance
   - Feature correlation >0.8 with target

4. **The Timeline Test:** For every feature, ask: "At the moment I make this prediction, would I have this exact value?"

---

## Types of Leakage Found

| Feature | Leakage Type | Why It's Wrong |
|---------|-------------|----------------|
| `churn_risk_score` | Target leakage | Formula includes `churned` directly |
| `customer_value` | Temporal leakage | `total_spend` includes future purchases |

---

## Bonus: How to fix customer_value properly

In [None]:
# If you wanted a "value" feature, compute it as of the prediction date:

print("WRONG:")
print("  customer_value = total_spend  # Includes all-time purchases")

print("\nRIGHT:")
print("  customer_value_as_of_date = sum(purchases WHERE date < prediction_date)")
print("  ")
print("  # In pandas:")
print("  # purchases_df[purchases_df['date'] < prediction_date].groupby('customer_id')['amount'].sum()")

In [None]:
# The general pattern for point-in-time features:

print("=== Point-in-Time Feature Pattern ===")
print("""
def compute_feature(customer_id, prediction_date, events_df):
    # Filter to BEFORE prediction date
    past_events = events_df[
        (events_df['customer_id'] == customer_id) &
        (events_df['date'] < prediction_date)  # CRITICAL
    ]
    
    # Compute aggregation on filtered data
    return past_events['amount'].sum()
""")