# Capstone Project: StreamCart Retention System

**Scenario:** StreamCart's retention team can call 500 customers per week. Your job is to identify WHICH 500 customers to call.

**Your Deliverables:**
1. Problem framing document
2. Feature engineering pipeline
3. Trained model with proper evaluation
4. Business recommendation

**Runtime:** ~15 minutes on free Colab

---

In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score, precision_score, recall_score
import warnings
warnings.filterwarnings('ignore')

# Install lightgbm if needed
try:
    import lightgbm as lgb
except:
    !pip install lightgbm -q
    import lightgbm as lgb

print("Setup complete!")

In [None]:
# Load data
DATA_URL = "https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/"

customers = pd.read_csv(DATA_URL + "streamcart_customers.csv")
events = pd.read_csv(DATA_URL + "streamcart_events.csv")

print(f"Customers: {len(customers):,} rows")
print(f"Events: {len(events):,} rows")
print(f"\nChurn rate: {customers['churn_30d'].mean():.1%}")

---

# Part 1: Problem Framing (20 points)

Before touching any models, define the problem clearly.

## 1.1 The Business Context

Read the scenario carefully:

> The retention team can make **500 outbound calls per week** to offer personalized discounts. Currently they call at random. Historical data shows that when the team calls a customer who was going to churn, they save that customer **30% of the time**. Each saved customer is worth **$200** in future revenue. Each call costs **$15** in labor.

In [None]:
# TODO: Complete the 7-line framing template

framing = """
=== ML PROBLEM FRAMING ===

1. Business Goal: [What outcome does StreamCart want?]
   YOUR ANSWER: 

2. ML Task Type: [Classification / Regression / Ranking / Clustering]
   YOUR ANSWER: 

3. Target (y): [What exactly are we predicting? Be specific about time window]
   YOUR ANSWER: 

4. Prediction Point: [When do we make predictions?]
   YOUR ANSWER: 

5. Features (X): [What data is available BEFORE prediction point?]
   YOUR ANSWER: 

6. Success Metric: [How do we measure model quality given the 500-call constraint?]
   YOUR ANSWER: 

7. Business Action: [What happens with predictions?]
   YOUR ANSWER: 

"""
print(framing)

## 1.2 Baseline Calculation

Before building any model, calculate what random selection would achieve.

In [None]:
# TODO: Calculate the baseline (random selection)

base_churn_rate = customers['churn_30d'].mean()
calls_per_week = 500
save_rate = 0.30  # 30% of churners can be saved
value_per_save = 200  # $200 per saved customer
cost_per_call = 15  # $15 per call

# Random selection baseline
# TODO: Calculate expected churners reached with random 500 calls
expected_churners_random = None  # YOUR CODE HERE

# TODO: Calculate expected saves (churners reached * save_rate)
expected_saves_random = None  # YOUR CODE HERE

# TODO: Calculate net value (saves * value - calls * cost)
net_value_random = None  # YOUR CODE HERE

print(f"=== RANDOM SELECTION BASELINE ===")
print(f"Churn rate: {base_churn_rate:.1%}")
print(f"Expected churners in 500 random calls: {expected_churners_random:.0f}")
print(f"Expected saves: {expected_saves_random:.0f}")
print(f"Net value: ${net_value_random:,.0f}")

In [None]:
# Self-check: Baseline calculation
assert expected_churners_random is not None, "Calculate expected_churners_random"
assert 30 < expected_churners_random < 100, f"Check your calculation: {expected_churners_random}"
print("✓ Part 1.2 baseline calculation complete")

---

# Part 2: Feature Engineering (20 points)

Create meaningful features from raw data. Remember: **no leakage!**

In [None]:
# Explore the data first
print("=== CUSTOMER DATA ===")
print(customers.columns.tolist())
print("\nSample:")
customers.head()

In [None]:
# Check for potential leakage columns
print("=== LEAKAGE CHECK ===")
print("\nColumns to AVOID (contain future info):")
leaky_columns = ['churn_date', 'cancel_reason', 'churn_30d']  # target is fine to use as y
for col in leaky_columns:
    if col in customers.columns:
        print(f"  - {col}")

## 2.1 Create Derived Features

Create at least 3 meaningful features. Ideas:
- **Ratio features:** orders per month, logins per month
- **Change features:** recent activity vs historical
- **Interaction features:** combinations that might signal risk

In [None]:
# TODO: Create your feature engineering pipeline

df = customers.copy()

# Feature 1: Orders per month (ratio feature)
# TODO: Create orders_per_month = orders_last_30d / (tenure_months + 1)
df['orders_per_month'] = None  # YOUR CODE HERE

# Feature 2: Support intensity (ratio feature)
# TODO: Create support_intensity = support_tickets_last_30d / (tenure_months + 1)
df['support_intensity'] = None  # YOUR CODE HERE

# Feature 3: YOUR CHOICE - create at least one more meaningful feature
# Ideas: engagement_score, login_trend, nps_risk, etc.
# TODO: Create your feature
df['your_feature'] = None  # YOUR CODE HERE - rename this column!

print("Engineered features created:")
print(df[['orders_per_month', 'support_intensity', 'your_feature']].describe())

In [None]:
# Self-check: Feature engineering
assert df['orders_per_month'].notna().all(), "orders_per_month has NaN values"
assert df['support_intensity'].notna().all(), "support_intensity has NaN values"
print("✓ Part 2.1 feature engineering complete")

## 2.2 Leakage Audit

For each feature, confirm it would be available at prediction time.

In [None]:
# TODO: Complete the leakage audit

leakage_audit = """
=== FEATURE LEAKAGE AUDIT ===

For each feature, answer: "Would I have this at prediction time?"

| Feature | Source | Available at prediction? | Safe? |
|---------|--------|-------------------------|-------|
| tenure_months | Account age | Yes - historical | ✓ |
| logins_last_30d | Activity logs | Yes - historical | ✓ |
| orders_per_month | Derived from orders + tenure | Yes | ✓ |
| support_intensity | Derived from tickets + tenure | Yes | ✓ |
| your_feature | TODO: document source | TODO | TODO |

Confirm: None of my features use churn_date, cancel_reason, or future data.
"""
print(leakage_audit)

## 2.3 Prepare Final Feature Set

In [None]:
# TODO: Define your final feature list

# Base features (from raw data)
base_features = [
    'tenure_months',
    'logins_last_30d',
    'orders_last_30d',
    'support_tickets_last_30d',
    'nps_score'
]

# Engineered features (your creations)
engineered_features = [
    'orders_per_month',
    'support_intensity',
    # TODO: Add your custom feature name here
]

all_features = base_features + engineered_features

# Prepare X and y
X = df[all_features].fillna(0)
y = df['churn_30d']

print(f"Features: {len(all_features)}")
print(f"Samples: {len(X):,}")
print(f"Target distribution: {y.mean():.1%} churn")

---

# Part 3: Model Training (20 points)

Train at least two models: a simple baseline and an advanced model.

## 3.1 Train/Validation/Test Split

In [None]:
# TODO: Create proper train/val/test splits

# First split: 80% train+val, 20% test (holdout)
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

# Second split: 75% train, 25% val (from the 80%)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)

print(f"Training set: {len(X_train):,} samples")
print(f"Validation set: {len(X_val):,} samples")
print(f"Test set: {len(X_test):,} samples (HOLDOUT - don't touch until final eval!)")

## 3.2 Model 1: Logistic Regression Baseline

In [None]:
# TODO: Train logistic regression baseline

# Scale features for logistic regression
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Train model
model_lr = LogisticRegression(random_state=42, max_iter=1000)
model_lr.fit(X_train_scaled, y_train)

# Evaluate on validation
probs_lr_val = model_lr.predict_proba(X_val_scaled)[:, 1]
auc_lr = roc_auc_score(y_val, probs_lr_val)

print(f"Logistic Regression Validation AUC: {auc_lr:.3f}")

In [None]:
# Interpret coefficients
coef_df = pd.DataFrame({
    'feature': all_features,
    'coefficient': model_lr.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

print("\nFeature Importance (Logistic Regression):")
for _, row in coef_df.iterrows():
    direction = "↑ churn" if row['coefficient'] > 0 else "↓ churn"
    print(f"  {row['feature']}: {row['coefficient']:.3f} ({direction})")

## 3.3 Model 2: LightGBM with Early Stopping

In [None]:
# TODO: Train LightGBM with proper regularization and early stopping

model_lgb = lgb.LGBMClassifier(
    n_estimators=500,
    num_leaves=31,
    max_depth=6,
    min_child_samples=20,
    learning_rate=0.05,
    random_state=42,
    verbose=-1
)

# Train with early stopping
model_lgb.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    callbacks=[lgb.early_stopping(stopping_rounds=50, verbose=False)]
)

# Evaluate on validation
probs_lgb_val = model_lgb.predict_proba(X_val)[:, 1]
auc_lgb = roc_auc_score(y_val, probs_lgb_val)

print(f"LightGBM Validation AUC: {auc_lgb:.3f}")
print(f"Trees used: {model_lgb.best_iteration_}")

In [None]:
# Check for overfitting
probs_lgb_train = model_lgb.predict_proba(X_train)[:, 1]
auc_lgb_train = roc_auc_score(y_train, probs_lgb_train)

print(f"\nOverfitting check:")
print(f"  Training AUC: {auc_lgb_train:.3f}")
print(f"  Validation AUC: {auc_lgb:.3f}")
print(f"  Gap: {auc_lgb_train - auc_lgb:.3f}")

if auc_lgb_train - auc_lgb > 0.05:
    print("  ⚠️ Possible overfitting - consider more regularization")
else:
    print("  ✓ Gap is acceptable")

In [None]:
# Self-check: Models trained
assert auc_lr > 0.55, f"Logistic regression AUC too low: {auc_lr}"
assert auc_lgb > 0.55, f"LightGBM AUC too low: {auc_lgb}"
print("✓ Part 3 models trained successfully")

---

# Part 4: Evaluation & Deployment (20 points)

Evaluate models on the **test set** using business-relevant metrics.

## 4.1 Final Evaluation on Test Set

In [None]:
# Get predictions on test set
X_test_scaled = scaler.transform(X_test)

probs_lr_test = model_lr.predict_proba(X_test_scaled)[:, 1]
probs_lgb_test = model_lgb.predict_proba(X_test)[:, 1]

# AUC comparison
auc_lr_test = roc_auc_score(y_test, probs_lr_test)
auc_lgb_test = roc_auc_score(y_test, probs_lgb_test)

print("=== TEST SET RESULTS ===")
print(f"Logistic Regression AUC: {auc_lr_test:.3f}")
print(f"LightGBM AUC: {auc_lgb_test:.3f}")

## 4.2 Precision@500 (The Business Metric)

In [None]:
# TODO: Calculate Precision@500 for both models

K = 500  # Retention team capacity

def precision_at_k(y_true, y_proba, k):
    """Calculate precision in top k predictions."""
    top_k_idx = np.argsort(y_proba)[::-1][:k]
    return y_true.iloc[top_k_idx].mean()

# Scale K to test set size (500 is for full dataset)
k_test = int(K * len(y_test) / len(y))

precision_lr = precision_at_k(y_test, probs_lr_test, k_test)
precision_lgb = precision_at_k(y_test, probs_lgb_test, k_test)
precision_random = y_test.mean()  # baseline

print(f"\n=== PRECISION@{k_test} (scaled from @500) ===")
print(f"Random baseline: {precision_random:.1%}")
print(f"Logistic Regression: {precision_lr:.1%}")
print(f"LightGBM: {precision_lgb:.1%}")

In [None]:
# Calculate LIFT
lift_lr = precision_lr / precision_random
lift_lgb = precision_lgb / precision_random

print(f"\n=== LIFT ===")
print(f"Logistic Regression: {lift_lr:.1f}x better than random")
print(f"LightGBM: {lift_lgb:.1f}x better than random")

## 4.3 Business Impact Calculation

In [None]:
# TODO: Calculate business impact for your best model

# Use the better model
best_precision = max(precision_lr, precision_lgb)
best_model = "LightGBM" if precision_lgb > precision_lr else "Logistic Regression"

# Scale back to full 500 calls
expected_churners_model = 500 * best_precision
expected_saves_model = expected_churners_model * save_rate
net_value_model = (expected_saves_model * value_per_save) - (500 * cost_per_call)

# Improvement over random
value_improvement = net_value_model - net_value_random

print(f"\n=== BUSINESS IMPACT ({best_model}) ===")
print(f"Expected churners in top 500: {expected_churners_model:.0f}")
print(f"Expected saves: {expected_saves_model:.0f}")
print(f"Net value per week: ${net_value_model:,.0f}")
print(f"\nImprovement over random: ${value_improvement:,.0f}/week")
print(f"Annual impact: ${value_improvement * 52:,.0f}/year")

In [None]:
# Self-check: Evaluation complete
assert best_precision > precision_random, "Model should beat random baseline"
assert value_improvement > 0, "Model should create positive business value"
print("✓ Part 4 evaluation complete")

## 4.4 Model Comparison Summary

In [None]:
# Create comparison table
comparison = pd.DataFrame({
    'Model': ['Random', 'Logistic Regression', 'LightGBM'],
    'AUC': [0.50, auc_lr_test, auc_lgb_test],
    f'Precision@{k_test}': [precision_random, precision_lr, precision_lgb],
    'Lift': [1.0, lift_lr, lift_lgb],
    'Interpretable': ['N/A', 'Yes', 'Limited']
})

print("\n=== MODEL COMPARISON ===")
print(comparison.to_string(index=False))

---

# Part 5: Communication (20 points)

Write a clear recommendation for stakeholders.

## 5.1 PM Update

Write a 200-word update for the VP of Customer Success. Include:
- What you built
- Key results (in business terms)
- Recommendation
- Next steps

In [None]:
# TODO: Write your PM update (replace the template below)

pm_update = """
=== WEEKLY UPDATE: CHURN PREDICTION MODEL ===

Hi [VP Name],

[WHAT WE BUILT]
TODO: Describe what you built in 1-2 sentences. Avoid jargon.

[KEY RESULTS]
TODO: Share the business impact in concrete numbers:
- How many more churners will the team reach?
- What's the expected value improvement?
- How does this compare to current random selection?

[RECOMMENDATION]
TODO: State your recommendation clearly:
- Which model should we use?
- What trade-offs should stakeholders be aware of?

[NEXT STEPS]
TODO: Propose 2-3 concrete next steps:
- Pilot test?
- Integration with call system?
- Monitoring plan?

Let me know if you have questions.

[Your Name]
"""

print(pm_update)
print(f"\nWord count: {len(pm_update.split())} words")

## 5.2 Final Self-Assessment

In [None]:
# Final checklist
checklist = """
=== CAPSTONE CHECKLIST ===

Part 1: Problem Framing (20 pts)
[ ] Completed 7-line framing template
[ ] Calculated random baseline

Part 2: Feature Engineering (20 pts)
[ ] Created 3+ meaningful features
[ ] Completed leakage audit
[ ] No future data used

Part 3: Model Training (20 pts)
[ ] Trained baseline model (logistic regression)
[ ] Trained advanced model (LightGBM)
[ ] Used early stopping
[ ] Checked for overfitting

Part 4: Evaluation (20 pts)
[ ] Reported Precision@500
[ ] Calculated Lift
[ ] Computed business impact ($)
[ ] Created comparison table

Part 5: Communication (20 pts)
[ ] Wrote PM update (150-250 words)
[ ] Included business metrics
[ ] Made clear recommendation
[ ] Proposed next steps
"""
print(checklist)

---

# Bonus: Customer Segmentation (Optional)

Segment customers to enable differentiated marketing campaigns.

In [None]:
# BONUS: Customer segmentation
from sklearn.cluster import KMeans

# Cluster on behavioral features
cluster_features = ['tenure_months', 'logins_last_30d', 'orders_last_30d', 'support_tickets_last_30d']
X_cluster = df[cluster_features].fillna(0)
X_cluster_scaled = StandardScaler().fit_transform(X_cluster)

# Find optimal K using elbow method
inertias = []
for k in range(2, 8):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_cluster_scaled)
    inertias.append(km.inertia_)

plt.figure(figsize=(8, 4))
plt.plot(range(2, 8), inertias, 'bo-')
plt.xlabel('K')
plt.ylabel('Inertia')
plt.title('Elbow Method')
plt.show()

In [None]:
# Apply clustering with chosen K
k_chosen = 4  # Based on elbow
kmeans = KMeans(n_clusters=k_chosen, random_state=42, n_init=10)
df['segment'] = kmeans.fit_predict(X_cluster_scaled)

# Profile segments
segment_profiles = df.groupby('segment').agg({
    'tenure_months': 'mean',
    'logins_last_30d': 'mean',
    'orders_last_30d': 'mean',
    'support_tickets_last_30d': 'mean',
    'churn_30d': 'mean',
    'customer_id': 'count'
}).round(2)

segment_profiles.columns = ['Tenure', 'Logins', 'Orders', 'Tickets', 'Churn Rate', 'Count']
print("\n=== SEGMENT PROFILES ===")
print(segment_profiles)

---

## Congratulations!

You've completed the capstone project. You now have:

1. **A clear problem framing** that ties ML to business action
2. **Meaningful features** without data leakage
3. **Two trained models** with proper validation
4. **Business-oriented evaluation** (Precision@K, Lift, $ impact)
5. **A stakeholder-ready recommendation**

This is exactly what ML looks like in industry. Well done!