# Module 2: Your First Prediction Model

**Goal:** Train a logistic regression model and interpret what it learned

**Time:** ~20 minutes

**What you'll do:**
1. Train logistic regression on churn data
2. Interpret the coefficients
3. Calculate precision@500 (our business metric)
4. Compare different customer profiles

---

## Setup

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score, precision_score, recall_score
import matplotlib.pyplot as plt

# Load data
try:
    df = pd.read_csv('https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/streamcart_customers.csv')
except:
    df = pd.read_csv('../data/streamcart_customers.csv')

print(f"Loaded {len(df):,} customers")
print(f"Churn rate: {df['churn_30d'].mean():.1%}")

## Part 1: Prepare the Data

Select features that are:
- Available at prediction time (no leakage!)
- Likely predictive of churn

In [None]:
# Features we'll use (these are safe‚Äîno leakage)
features = [
    'tenure_months',
    'logins_last_7d',
    'logins_last_30d',
    'support_tickets_last_30d',
    'items_skipped_last_3_boxes',
    'nps_score'
]

# Handle missing values (NPS has some nulls)
X = df[features].fillna(df[features].median())
y = df['churn_30d']

print("Features:")
for f in features:
    print(f"  {f}: range [{X[f].min():.0f}, {X[f].max():.0f}]")

In [None]:
# Split into train and test (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {len(X_train):,} customers")
print(f"Test set: {len(X_test):,} customers")
print(f"\nTrain churn rate: {y_train.mean():.1%}")
print(f"Test churn rate: {y_test.mean():.1%}")

---

## Part 2: Train Logistic Regression

Logistic regression learns a weight (coefficient) for each feature.

In [None]:
# TODO: Create and train a LogisticRegression model
#
# Hint: model = LogisticRegression(max_iter=1000)
#       model.fit(X_train, y_train)

model = None  # Replace with your code

# Uncomment when ready:
# model = LogisticRegression(max_iter=1000)
# model.fit(X_train, y_train)

In [None]:
# ============================================
# SELF-CHECK: Is the model trained?
# ============================================

assert model is not None, "Create the model first!"
assert hasattr(model, 'coef_'), "Model not trained‚Äîdid you call .fit()?"
print("‚úì Model trained successfully!")

---

## Part 3: Interpret the Coefficients

This is where logistic regression shines‚Äîwe can understand **what** the model learned.

In [None]:
# See what the model learned
print("=== Coefficients ===")
print("(Positive = increases churn probability, Negative = decreases churn probability)\n")

for feature, coef in sorted(zip(features, model.coef_[0]), key=lambda x: -abs(x[1])):
    direction = "‚Üë churn" if coef > 0 else "‚Üì churn"
    print(f"{feature:30} {coef:+.4f}  ({direction})")

In [None]:
# Visualize coefficients
coef_df = pd.DataFrame({
    'feature': features,
    'coefficient': model.coef_[0]
}).sort_values('coefficient')

plt.figure(figsize=(10, 5))
colors = ['green' if c < 0 else 'red' for c in coef_df['coefficient']]
plt.barh(coef_df['feature'], coef_df['coefficient'], color=colors)
plt.xlabel('Coefficient (positive = more churn)')
plt.title('What Predicts Churn?')
plt.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
plt.tight_layout()
plt.show()

### Questions to Answer:

1. Which feature has the strongest positive effect on churn?
2. Which feature is most protective against churn?
3. Do these directions make business sense?

In [None]:
# TODO: Answer in comments
#
# 1. Strongest positive (increases churn): ???
# 2. Most protective (decreases churn): ???
# 3. Does this make sense? ???

---

## Part 4: Get Predictions

Now let's use the model to predict churn probabilities.

In [None]:
# TODO: Get predicted probabilities for the test set
#
# Hint: predict_proba returns [P(no churn), P(churn)]
#       You want the second column (index 1)

y_pred_proba = None  # Replace with your code

# Uncomment when ready:
# y_pred_proba = model.predict_proba(X_test)[:, 1]

In [None]:
# ============================================
# SELF-CHECK: Are predictions valid?
# ============================================

assert y_pred_proba is not None, "Generate predictions first!"
assert len(y_pred_proba) == len(y_test), "Wrong number of predictions"
assert y_pred_proba.min() >= 0 and y_pred_proba.max() <= 1, "Probabilities should be 0-1"

print(f"‚úì Generated {len(y_pred_proba):,} predictions")
print(f"  Range: {y_pred_proba.min():.2%} to {y_pred_proba.max():.2%}")
print(f"  Mean: {y_pred_proba.mean():.2%}")

In [None]:
# Distribution of predictions
plt.figure(figsize=(10, 4))
plt.hist(y_pred_proba, bins=50, edgecolor='black', alpha=0.7)
plt.xlabel('Predicted Churn Probability')
plt.ylabel('Number of Customers')
plt.title('Distribution of Predictions')
plt.axvline(x=y_test.mean(), color='red', linestyle='--', label=f'Base rate ({y_test.mean():.1%})')
plt.legend()
plt.show()

---

## Part 5: Evaluate with AUC

AUC tells us: how well does the model **rank** customers by churn risk?

In [None]:
# Calculate AUC
auc = roc_auc_score(y_test, y_pred_proba)

print(f"AUC: {auc:.3f}")
print(f"\nInterpretation:")
print(f"  0.50 = random guessing")
print(f"  0.70 = decent")
print(f"  0.80 = good")
print(f"  0.90+ = either excellent or suspicious (check for leakage!)")

---

## Part 6: The Business Metric - Precision@500

AUC is nice, but the retention team can only call **500 customers per week**.

The real question: Of the top 500 predictions, how many are actual churners?

In [None]:
# TODO: Calculate Precision@500
#
# Steps:
# 1. Sort customers by predicted probability (highest first)
# 2. Take the top 500
# 3. Calculate what fraction actually churned

k = 500

# Get indices of top K predictions (highest probability)
top_k_indices = np.argsort(y_pred_proba)[-k:]  # argsort gives low-to-high, so take last k

# What fraction of top K actually churned?
precision_at_k = None  # Replace with your code

# Uncomment when ready:
# precision_at_k = y_test.iloc[top_k_indices].mean()

In [None]:
# ============================================
# SELF-CHECK: Compare to baseline
# ============================================

assert precision_at_k is not None, "Calculate precision@500 first!"

baseline = y_test.mean()  # Random selection would get this rate
lift = precision_at_k / baseline

print(f"=== Precision@{k} ===")
print(f"Model: {precision_at_k:.1%}")
print(f"Random baseline: {baseline:.1%}")
print(f"Lift: {lift:.1f}x")
print(f"\n‚Üí The model finds {lift:.1f}x more churners than random targeting!")

assert precision_at_k > baseline, "Model should beat random!"
print("\n‚úì Model is better than random!")

In [None]:
# How does precision change at different K?
ks = [100, 200, 300, 500, 750, 1000]
precisions = []

for k in ks:
    top_k = np.argsort(y_pred_proba)[-k:]
    prec = y_test.iloc[top_k].mean()
    precisions.append(prec)
    print(f"Precision@{k}: {prec:.1%} (lift: {prec/baseline:.1f}x)")

plt.figure(figsize=(10, 5))
plt.plot(ks, precisions, 'o-', label='Model')
plt.axhline(y=baseline, color='red', linestyle='--', label=f'Random ({baseline:.1%})')
plt.xlabel('K (number of customers targeted)')
plt.ylabel('Precision (% churners in top K)')
plt.title('Precision at Different K')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

---

## Part 7: Compare Customer Profiles

Let's see how the model scores different types of customers.

In [None]:
# Create hypothetical customer profiles
profiles = pd.DataFrame([
    # High risk: new, inactive, complaining
    {'tenure_months': 2, 'logins_last_7d': 0, 'logins_last_30d': 2,
     'support_tickets_last_30d': 3, 'items_skipped_last_3_boxes': 2, 'nps_score': 4},
    
    # Low risk: veteran, engaged, happy
    {'tenure_months': 24, 'logins_last_7d': 5, 'logins_last_30d': 20,
     'support_tickets_last_30d': 0, 'items_skipped_last_3_boxes': 0, 'nps_score': 9},
    
    # Medium: average customer
    {'tenure_months': 8, 'logins_last_7d': 2, 'logins_last_30d': 8,
     'support_tickets_last_30d': 1, 'items_skipped_last_3_boxes': 1, 'nps_score': 7},
])

profile_names = ['High Risk (new, inactive)', 'Low Risk (veteran, engaged)', 'Average Customer']

# Predict
profile_probs = model.predict_proba(profiles)[:, 1]

print("=== Customer Profile Predictions ===")
for name, prob in zip(profile_names, profile_probs):
    print(f"{name}: {prob:.1%} churn probability")

In [None]:
# Find actual high and low risk customers in test set
high_risk_idx = np.argmax(y_pred_proba)
low_risk_idx = np.argmin(y_pred_proba)

print("=== Real High Risk Customer ===")
print(X_test.iloc[high_risk_idx])
print(f"Predicted churn: {y_pred_proba[high_risk_idx]:.1%}")
print(f"Actually churned: {'Yes' if y_test.iloc[high_risk_idx] == 1 else 'No'}")

print("\n=== Real Low Risk Customer ===")
print(X_test.iloc[low_risk_idx])
print(f"Predicted churn: {y_pred_proba[low_risk_idx]:.1%}")
print(f"Actually churned: {'Yes' if y_test.iloc[low_risk_idx] == 1 else 'No'}")

---

## üìù Final Exercise: Explain It

Your PM asks: "Why are we using such a simple model? Shouldn't we use AI?"

Write a 4-5 sentence response explaining why logistic regression is a good starting point.

In [None]:
# Write your response here:

pm_response = """
YOUR RESPONSE HERE
"""

print(pm_response)

---

## ‚úÖ Module 2 Complete!

**What you learned:**
- How to train logistic regression
- How to interpret coefficients
- How to calculate precision@K (the business metric)
- How to compare customer profiles

**Key metrics from this lab:**

In [None]:
# Summary
print("=== Module 2 Summary ===")
print(f"Model: Logistic Regression")
print(f"Features: {len(features)}")
print(f"AUC: {auc:.3f}")
print(f"Precision@500: {precision_at_k:.1%}")
print(f"Lift@500: {lift:.1f}x")
print(f"\nTop predictor of churn: {features[np.argmax(model.coef_[0])]}")
print(f"Most protective factor: {features[np.argmin(model.coef_[0])]}")

**Next:** [Module 3: When Linear Isn't Enough ‚Üí](./module_03_decision_trees.ipynb)