# Module 1: Is This Even an ML Problem?

**Goal:** Learn to frame ML problems correctly and avoid the most common mistake (data leakage)

**Time:** ~20 minutes

**What you'll do:**
1. Explore the StreamCart dataset
2. Practice the 7-line framing template
3. Identify data leakage in features
4. Define a proper churn label

---

## Setup

Run this cell to load the data. No installation needed-just pandas and numpy.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

DATA_URL = 'https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/shared/data/streamcart_customers.csv'

def generate_streamcart_data(n=5000):
    """Generate synthetic StreamCart data if remote URL is unavailable."""
    np.random.seed(42)
    
    signup_dates = [datetime(2023, 1, 1) + timedelta(days=np.random.randint(0, 500)) for _ in range(n)]
    snapshot_date = datetime(2024, 6, 1)
    tenure = [(snapshot_date - d).days // 30 for d in signup_dates]
    
    churn_prob = np.clip(0.1 + 0.3 * (np.array(tenure) < 3) + 0.2 * np.random.random(n), 0, 0.6)
    churned = np.random.random(n) < churn_prob
    
    cancel_dates = [snapshot_date + timedelta(days=np.random.randint(1, 45)) if c else None for c in churned]
    
    df = pd.DataFrame({
        'customer_id': range(1, n+1),
        'signup_date': signup_dates,
        'snapshot_date': [snapshot_date] * n,
        'subscription_status': ['canceled' if c else 'active' for c in churned],
        'churn_date': cancel_dates,
        'cancel_reason': [np.random.choice(['too_expensive', 'not_using', 'competitor', 'other']) if c else None for c in churned],
        'tenure_months': tenure,
        'logins_last_7d': np.random.poisson(3, n),
        'logins_last_30d': np.random.poisson(12, n),
        'support_tickets_last_30d': np.random.poisson(0.5, n),
        'items_skipped_last_3_boxes': np.random.poisson(1, n),
        'nps_score': [np.random.choice([None] + list(range(1, 11)), p=[0.3] + [0.07]*10) for _ in range(n)],
        'plan_type': np.random.choice(['basic', 'premium', 'family'], n),
        'avg_order_value': np.random.exponential(45, n),
    })
    
    # Create 30-day churn label
    df['churn_30d'] = (
        (df['subscription_status'] == 'canceled') & 
        (pd.to_datetime(df['churn_date']) <= snapshot_date + timedelta(days=30))
    ).astype(int)
    
    return df

# Try to load from remote, fall back to synthetic data
try:
    df = pd.read_csv(DATA_URL)
    print(f"Loaded {len(df):,} customers from remote")
except Exception as e:
    print(f"Remote data unavailable ({type(e).__name__}). Generating synthetic data...")
    df = generate_streamcart_data(5000)
    print(f"Generated {len(df):,} synthetic customers")

print(f"Columns: {len(df.columns)}")
df.head()

## Part 1: Explore the Data

Before building any model, understand what you're working with.

In [None]:
# What's in this dataset?
print("=== Column Types ===")
print(df.dtypes)
print("\n=== Basic Stats ===")
df.describe()

In [None]:
# The target variable: churn_30d
print("=== Churn Distribution ===")
print(df['churn_30d'].value_counts())
print(f"\nChurn rate: {df['churn_30d'].mean():.1%}")

### Quick Check

**Question:** About what percentage of customers churned in the next 30 days?

This is your **baseline**. A model that predicts "no churn" for everyone would be right ~89% of the time. But that's useless for the business.

---

## Part 2: The 7-Line Framing Template

Before writing any model code, fill out this template. If you can't, you're not ready to build.

| Line | Question | Your Answer |
|------|----------|-------------|
| 1. Problem | What business outcome are we trying to improve? | |
| 2. Action | What will we DO with the prediction? | |
| 3. Prediction | What exactly does the model output? | |
| 4. Label | How do we define this in historical data? | |
| 5. Features | What info is available at prediction time? | |
| 6. Metric | How do we measure if the model helps? | |
| 7. Constraints | What limits exist in production? | |

In [None]:
# TODO: Fill out the framing for StreamCart's churn problem
#
# Context: StreamCart's retention team can call 500 customers per week.
# They want to prioritize customers most likely to cancel.

problem_framing = {
    "problem": "????",      # What business metric are we improving?
    "action": "????",       # What will the retention team DO?
    "prediction": "????",   # What does the model output? (probability? score? yes/no?)
    "label": "????",        # How is churn defined in the data?
    "features": "????",     # List 3-5 features that would be available
    "metric": "????",       # How do we know if the model is good? (hint: 500/week capacity)
    "constraints": "????"   # Any production limits?
}

# Uncomment to see your answers:
# for k, v in problem_framing.items():
#     print(f"{k}: {v}")

In [None]:
# ============================================
# SELF-CHECK: Run this to validate your framing
# ============================================

def check_framing(framing):
    errors = []
    
    # Check problem
    if 'churn' not in framing['problem'].lower() and 'cancel' not in framing['problem'].lower() and 'retain' not in framing['problem'].lower():
        errors.append("Problem should mention churn, cancellation, or retention")
    
    # Check action
    if framing['action'] == "????" or len(framing['action']) < 10:
        errors.append("Action should describe what the team will DO with predictions")
    
    # Check metric mentions capacity
    if '500' not in framing['metric'] and 'precision' not in framing['metric'].lower() and 'top' not in framing['metric'].lower():
        errors.append("Metric should account for the 500/week capacity (hint: precision@500)")
    
    # Check features don't include leakage
    leaky_terms = ['cancel_reason', 'churn_date', 'cancel_date']
    for term in leaky_terms:
        if term in framing['features'].lower():
            errors.append(f"'{term}' is leakage! It only exists AFTER someone churns.")
    
    if errors:
        print("‚ùå Issues found:")
        for e in errors:
            print(f"   - {e}")
    else:
        print("‚úì Framing looks good!")
    
    return len(errors) == 0

check_framing(problem_framing)

---

## Part 3: Spotting Data Leakage

This is the **#1 mistake** in applied ML. Leakage means using information that wouldn't be available at prediction time.

In [None]:
# Look at all columns
print("=== All Columns ===")
for col in df.columns:
    print(f"  {col}")

In [None]:
# TODO: Which columns would cause DATA LEAKAGE if used as features?
#
# Remember: At prediction time (snapshot_date), we're predicting if they'll
# churn in the NEXT 30 days. Any info that only exists AFTER they churn is leakage.

leaky_columns = [
    # "???",  # List columns that would leak
    # "???",
]

safe_columns = [
    # "???",  # List columns that are safe to use
    # "???",
]

In [None]:
# ============================================
# SELF-CHECK: Verify your leakage detection
# ============================================

KNOWN_LEAKY = {'churn_30d', 'churn_date', 'cancel_reason'}
KNOWN_SAFE = {'tenure_months', 'logins_last_7d', 'logins_last_30d', 
              'support_tickets_last_30d', 'items_skipped_last_3_boxes',
              'nps_score', 'plan_type', 'avg_order_value'}

your_leaky = set(leaky_columns)
your_safe = set(safe_columns)

# Check leaky
if KNOWN_LEAKY.issubset(your_leaky):
    print("‚úì Correctly identified leaky columns!")
else:
    missing = KNOWN_LEAKY - your_leaky
    print(f"‚ùå Missed leaky columns: {missing}")
    print("   Hint: These only exist AFTER someone churns")

# Check safe
if len(your_safe) >= 5:
    print(f"‚úì Identified {len(your_safe)} safe features")
else:
    print(f"‚ö†Ô∏è  Only {len(your_safe)} safe features. Look for behavioral features.")

---

## Part 4: Why Leakage Is Dangerous

Let's see what happens if you accidentally use a leaky feature.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Split data
X = df[['tenure_months', 'logins_last_30d', 'support_tickets_last_30d']].fillna(0)
y = df['churn_30d']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train model (no leakage)
model_clean = RandomForestClassifier(n_estimators=50, max_depth=5, random_state=42)
model_clean.fit(X_train, y_train)

clean_auc = roc_auc_score(y_test, model_clean.predict_proba(X_test)[:, 1])
print(f"AUC without leakage: {clean_auc:.3f}")
print("This is realistic performance.")

In [None]:
# Now let's "accidentally" include a leaky feature
# cancel_reason is encoded (it only exists for churners!)

df['has_cancel_reason'] = df['cancel_reason'].notna().astype(int)

X_leaky = df[['tenure_months', 'logins_last_30d', 'support_tickets_last_30d', 'has_cancel_reason']].fillna(0)

X_train_l, X_test_l, y_train_l, y_test_l = train_test_split(X_leaky, y, test_size=0.2, random_state=42)

model_leaky = RandomForestClassifier(n_estimators=50, max_depth=5, random_state=42)
model_leaky.fit(X_train_l, y_train_l)

leaky_auc = roc_auc_score(y_test_l, model_leaky.predict_proba(X_test_l)[:, 1])
print(f"AUC with leakage: {leaky_auc:.3f}")
print("\n‚ö†Ô∏è  This is suspiciously perfect! The model is 'cheating'.")
print("   In production, has_cancel_reason would always be 0 (they haven't canceled yet).")

In [None]:
# === Business Metric: Precision@500 ===
# The retention team can call 500 customers/week.
# How many actual churners are in the model's top 500 predictions?

def precision_at_k(y_true, y_scores, k=500):
    """Calculate precision at k: what fraction of top-k predictions are true positives."""
    top_k_idx = np.argsort(y_scores)[-k:][::-1]
    return y_true.iloc[top_k_idx].mean()

# Calculate for the clean model (no leakage)
y_scores_clean = model_clean.predict_proba(X_test)[:, 1]
p500_clean = precision_at_k(y_test, y_scores_clean, k=min(500, len(y_test)))
baseline = y_test.mean()

print("=== Business Metric: Precision@500 ===")
print(f"Precision@500: {p500_clean:.1%}")
print(f"Baseline (random selection): {baseline:.1%}")
print(f"Lift: {p500_clean / baseline:.1f}x better than random")
print(f"\nIn 500 calls:")
print(f"  - With model: ~{int(500 * p500_clean)} actual churners")
print(f"  - Random: ~{int(500 * baseline)} actual churners")
print(f"  - Extra saves: ~{int(500 * (p500_clean - baseline))} more churners reached")

### The Lesson

If your AUC is above 0.95, **be suspicious**. Real-world churn models typically achieve 0.70-0.85.

A model with leakage will:
- Look amazing in development
- Fail completely in production
- Waste months of work

---

## Part 5: Defining the Label Properly

The label `churn_30d` is already defined. But let's understand what goes into defining it correctly.

In [None]:
# What's in our label?
print("snapshot_date:", df['snapshot_date'].iloc[0])
print("\nFor churners:")
churners = df[df['churn_30d'] == 1][['customer_id', 'snapshot_date', 'churn_date']].head(5)
print(churners)

print("\nFor non-churners:")
stayers = df[df['churn_30d'] == 0][['customer_id', 'snapshot_date', 'churn_date']].head(5)
print(stayers)

In [None]:
# TODO: If you had to define churn_30d from raw data, how would you do it?
#
# Complete this function:

def define_churn_label(df, snapshot_date, window_days=30):
    """
    Define churn label: Did the customer cancel within window_days after snapshot_date?
    
    Parameters:
    -----------
    df : DataFrame with 'churn_date' column (or None if still active)
    snapshot_date : The point-in-time we're predicting FROM (string or datetime)
    window_days : How many days to look forward
    
    Returns:
    --------
    Series with 1 (churned) or 0 (stayed)
    """
    # TODO: Implement this
    # Hint: churn = 1 if churn_date is between snapshot_date and snapshot_date + window_days
    # Hint: Use pd.to_datetime() to convert dates, timedelta for date math
    
    pass  # Replace with your implementation


# ============================================
# SELF-CHECK: Test your function
# ============================================

# Run this after implementing define_churn_label above
if 'define_churn_label' in dir() and callable(define_churn_label):
    try:
        # Get snapshot date from data (or use default)
        if 'snapshot_date' in df.columns:
            test_snapshot = pd.to_datetime(df['snapshot_date'].iloc[0])
        else:
            test_snapshot = pd.Timestamp('2024-06-01')
        
        my_labels = define_churn_label(df, test_snapshot, 30)
        
        if my_labels is None:
            print("‚ö†Ô∏è  Your function returned None. Make sure to return a Series.")
        else:
            match_rate = (my_labels == df['churn_30d']).mean()
            print(f"Your labels match ground truth: {match_rate:.1%}")
            
            if match_rate > 0.95:
                print("‚úì Excellent! Your label logic is correct.")
            elif match_rate > 0.80:
                print("‚ö†Ô∏è  Close! Check your date comparison logic.")
                print("   Hint: Make sure you're comparing dates, not strings.")
            else:
                print("‚ùå Labels don't match. Review the logic.")
                print("   Expected: churn = 1 if churn_date is between snapshot and snapshot + 30 days")
    except Exception as e:
        print(f"‚ùå Error testing function: {e}")
        print("   Make sure your function handles date conversions properly.")
else:
    print("Implement define_churn_label above, then run this cell to test.")

---

## üìù Final Exercise: Explain It

Your VP asks: "Why can't we just predict churn using their subscription status?"

Write a 3-4 sentence response explaining why that's circular reasoning.

### Rubric (self-assess against this)

A strong answer should:
1. **Explain the circularity** - subscription_status IS the outcome, not a predictor
2. **Clarify the timing problem** - you want to predict BEFORE they cancel, not after
3. **Give a concrete example** - what prediction time looks like vs. using the outcome

### Example Response (reveal AFTER writing yours)

<details>
<summary>Click to reveal example</summary>

"Subscription status tells us who already canceled-but we need to predict who WILL cancel before it happens, so we can intervene. Using status would be like predicting yesterday's weather using today's newspaper. Instead, we use behavioral signals (login frequency, support tickets, skipped items) measured BEFORE the cancellation decision, so our predictions are actionable."

</details>

See the full grading rubric in `rubrics/explain_it_rubric.md`

In [None]:
# Write your response here (as a comment or string):

vp_response = """
YOUR RESPONSE HERE
"""

print(vp_response)

---

## ‚úÖ Module 1 Complete!

**What you learned:**
- The 7-line framing template
- How to identify data leakage
- Why leakage creates fake performance
- How to define labels properly

**Key takeaway:** The most important ML skill isn't coding-it's asking "Would I have this data at prediction time?"

**Next:** [Module 2: Your First Prediction Model ‚Üí](./module_02_logistic_regression.ipynb)