# Module 4: Logistic Regression for Churn Prediction

**Goal:** Build a logistic regression model to predict customer churn, then optimize the classification threshold for business impact.

**Prerequisites:** Module 3 (Linear Regression)

**Expected Runtime:** ~45 minutes

**Outputs:**
- Fitted logistic regression with probability interpretation
- Precision/recall/F1 tradeoff analysis
- Cost-optimized classification threshold
- Stakeholder summary of churn model results

---

## Setup

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_auc_score, roc_curve
)
import warnings
warnings.filterwarnings('ignore')

print("âœ“ Libraries loaded")

## 1. Load and Explore Data

In [None]:
DATA_URL = 'https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/shared/data/'

customers = pd.read_csv(DATA_URL + 'streamcart_customers.csv')
print(f"Loaded {len(customers)} customers")
customers.head()

In [None]:
# Check the target: churn_30d
churn_rate = customers['churn_30d'].mean()
print(f"\nChurn rate: {churn_rate:.1%}")
print(f"Churned: {customers['churn_30d'].sum()}")
print(f"Retained: {(1 - customers['churn_30d']).sum()}")

### Self-Check: Is this imbalanced?

If the churn rate is below 20%, we have class imbalance. Keep this in mind when evaluating metrics!

In [None]:
# Select features for churn prediction
feature_cols = ['tenure_days', 'orders_total', 'total_spend', 'support_tickets_total', 'avg_order_value']

# Check if columns exist, create if needed
if 'tenure_days' not in customers.columns:
    customers['tenure_days'] = (pd.to_datetime('2024-01-01') - pd.to_datetime(customers['signup_date'])).dt.days
if 'avg_order_value' not in customers.columns:
    customers['avg_order_value'] = customers['total_spend'] / customers['orders_total'].replace(0, 1)

# Filter to available columns
available_features = [c for c in feature_cols if c in customers.columns]
print(f"Using features: {available_features}")

X = customers[available_features].fillna(0)
y = customers['churn_30d']

print(f"\nFeature matrix shape: {X.shape}")

## 2. Train-Test Split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} customers ({y_train.mean():.1%} churn)")
print(f"Test set: {len(X_test)} customers ({y_test.mean():.1%} churn)")

**Why stratify?** This ensures the churn rate is the same in train and test sets. Important for imbalanced data!

## 3. Baseline: Always Predict Majority Class

In [None]:
# What if we always predict "no churn"?
baseline_preds = np.zeros(len(y_test))

print("Baseline (always predict 'no churn'):")
print(f"  Accuracy: {accuracy_score(y_test, baseline_preds):.1%}")
print(f"  Precision: {precision_score(y_test, baseline_preds, zero_division=0):.1%}")
print(f"  Recall: {recall_score(y_test, baseline_preds):.1%}")
print(f"  F1: {f1_score(y_test, baseline_preds):.3f}")

### Self-Check: The Accuracy Trap

Notice how baseline accuracy might be high (if churn rate is low), but recall is 0% - we catch zero churners!

**Lesson:** Don't trust accuracy alone for classification.

## 4. Fit Logistic Regression

In [None]:
# Fit logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

print("âœ“ Model trained")
print(f"\nIntercept: {model.intercept_[0]:.4f}")
print("\nCoefficients:")
for name, coef in zip(available_features, model.coef_[0]):
    odds_ratio = np.exp(coef)
    print(f"  {name}: {coef:.4f} (odds ratio: {odds_ratio:.2f})")

### Interpreting Coefficients

- **Positive coefficient:** Increases probability of churn
- **Negative coefficient:** Decreases probability of churn
- **Odds ratio:** Each unit increase in the feature multiplies the odds by this factor

## 5. Get Probabilities

In [None]:
# Get probability of churn (class 1)
probabilities = model.predict_proba(X_test)[:, 1]

print("Probability distribution:")
print(f"  Min: {probabilities.min():.3f}")
print(f"  Max: {probabilities.max():.3f}")
print(f"  Mean: {probabilities.mean():.3f}")
print(f"  Median: {np.median(probabilities):.3f}")

In [None]:
# Visualize probability distributions by actual class
fig, ax = plt.subplots(figsize=(10, 5))

ax.hist(probabilities[y_test == 0], bins=30, alpha=0.6, label='Actual: Retained', color='#3b82f6')
ax.hist(probabilities[y_test == 1], bins=30, alpha=0.6, label='Actual: Churned', color='#ef4444')
ax.axvline(x=0.5, color='#f59e0b', linestyle='--', linewidth=2, label='Threshold = 0.5')

ax.set_xlabel('Predicted Probability of Churn')
ax.set_ylabel('Count')
ax.set_title('Probability Distributions by Actual Class')
ax.legend()
plt.tight_layout()
plt.show()

**What to look for:**
- Good model: Blue (retained) peaks LEFT, Red (churned) peaks RIGHT
- Perfect model: No overlap between distributions
- Poor model: Distributions overlap completely

## 6. Evaluate with Default Threshold (0.5)

In [None]:
# Default predictions (threshold = 0.5)
default_preds = model.predict(X_test)

print("Default threshold (0.5):")
print(f"  Accuracy: {accuracy_score(y_test, default_preds):.1%}")
print(f"  Precision: {precision_score(y_test, default_preds):.1%}")
print(f"  Recall: {recall_score(y_test, default_preds):.1%}")
print(f"  F1: {f1_score(y_test, default_preds):.3f}")
print(f"  AUC: {roc_auc_score(y_test, probabilities):.3f}")

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test, default_preds)
tn, fp, fn, tp = cm.ravel()

print("\nConfusion Matrix:")
print(f"                 Predicted")
print(f"                 No    Yes")
print(f"Actual No      {tn:4d}  {fp:4d}  (True Neg / False Pos)")
print(f"Actual Yes     {fn:4d}  {tp:4d}  (False Neg / True Pos)")

## 7. TODO: Find the Optimal Threshold

The business context:
- **Cost of False Positive:** $50 (wasted retention offer)
- **Cost of False Negative:** $200 (lost customer lifetime value)

Your task: Find the threshold that minimizes total cost.

In [None]:
# Cost calculation function
def calculate_total_cost(y_true, y_pred, fp_cost=50, fn_cost=200):
    """
    Calculate total business cost from predictions.
    
    Args:
        y_true: Actual labels
        y_pred: Predicted labels
        fp_cost: Cost per false positive ($50 = wasted retention offer)
        fn_cost: Cost per false negative ($200 = lost customer)
    
    Returns:
        Total cost
    """
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    
    # Total cost = FP cost + FN cost
    total_cost = fp * fp_cost + fn * fn_cost
    
    return total_cost

# Test the function
test_cost = calculate_total_cost(y_test, default_preds)
print(f"Test: Cost at threshold 0.5 = ${test_cost:,}")

In [None]:
# Sweep thresholds to find cost-optimal one
thresholds = np.arange(0.1, 0.9, 0.05)
results = []

for thresh in thresholds:
    preds = (probabilities >= thresh).astype(int)
    
    # Calculate metrics for this threshold
    prec = precision_score(y_test, preds, zero_division=0)
    rec = recall_score(y_test, preds)
    f1 = f1_score(y_test, preds)
    cost = calculate_total_cost(y_test, preds, fp_cost=50, fn_cost=200)
    
    results.append({
        'threshold': thresh,
        'precision': prec,
        'recall': rec,
        'f1': f1,
        'cost': cost
    })

results_df = pd.DataFrame(results)
results_df

In [None]:
# Find and print the optimal threshold
optimal_idx = results_df['cost'].idxmin()
optimal_row = results_df.loc[optimal_idx]

print(f"=== Optimal Threshold: {optimal_row['threshold']:.2f} ===")
print(f"At this threshold:")
print(f"  Precision: {optimal_row['precision']:.1%}")
print(f"  Recall: {optimal_row['recall']:.1%}")
print(f"  F1: {optimal_row['f1']:.3f}")
print(f"  Total Cost: ${optimal_row['cost']:,.0f}")

# Compare to default
default_cost = calculate_total_cost(y_test, default_preds)
print(f"\nðŸ’° Savings vs default (0.5): ${default_cost - optimal_row['cost']:,.0f}")

### Self-Check: Threshold Intuition

- Is the optimal threshold above or below 0.5?
- Why does this make sense given that FN cost > FP cost?

## 8. Visualize the Tradeoff

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Precision-Recall tradeoff
ax1 = axes[0]
ax1.plot(results_df['threshold'], results_df['precision'], 'b-', label='Precision', linewidth=2)
ax1.plot(results_df['threshold'], results_df['recall'], 'r-', label='Recall', linewidth=2)
ax1.plot(results_df['threshold'], results_df['f1'], 'g--', label='F1', linewidth=2)
ax1.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)
ax1.set_xlabel('Threshold')
ax1.set_ylabel('Score')
ax1.set_title('Precision-Recall Tradeoff')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Cost curve
ax2 = axes[1]
ax2.plot(results_df['threshold'], results_df['cost'], 'purple', linewidth=2)
min_cost_thresh = results_df.loc[results_df['cost'].idxmin(), 'threshold']
ax2.axvline(x=min_cost_thresh, color='green', linestyle='--', label=f'Optimal: {min_cost_thresh:.2f}')
ax2.set_xlabel('Threshold')
ax2.set_ylabel('Total Cost ($)')
ax2.set_title('Business Cost by Threshold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 9. ROC Curve

In [None]:
# ROC Curve
fpr, tpr, roc_thresholds = roc_curve(y_test, probabilities)
auc = roc_auc_score(y_test, probabilities)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, 'b-', linewidth=2, label=f'Model (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.5)')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate (Recall)')
plt.title('ROC Curve')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"\nAUC Interpretation:")
print(f"  Random guessing: 0.5")
print(f"  Perfect model: 1.0")
print(f"  Our model: {auc:.3f}")

## 10. Final Evaluation with Optimal Threshold

In [None]:
# Apply optimal threshold
optimal_threshold = results_df.loc[results_df['cost'].idxmin(), 'threshold']
final_preds = (probabilities >= optimal_threshold).astype(int)

print(f"Final Model Performance (threshold = {optimal_threshold:.2f})")
print("=" * 50)
print(classification_report(y_test, final_preds, target_names=['Retained', 'Churned']))

In [None]:
# Business impact comparison
default_cost = calculate_total_cost(y_test, default_preds)
optimal_cost = calculate_total_cost(y_test, final_preds)
baseline_cost = calculate_total_cost(y_test, baseline_preds)

print("\nBusiness Impact Comparison:")
print(f"  Baseline (no model): ${baseline_cost:,.0f}")
print(f"  Default threshold (0.5): ${default_cost:,.0f}")
print(f"  Optimal threshold ({optimal_threshold:.2f}): ${optimal_cost:,.0f}")
print(f"\n  Savings vs baseline: ${baseline_cost - optimal_cost:,.0f}")
print(f"  Savings vs default: ${default_cost - optimal_cost:,.0f}")

## 11. Stakeholder Summary

### TODO: Write a 3-bullet summary (~100 words) for the retention team

Template:
â€¢ **What it does:** A model that predicts churn risk for each customer, using threshold ____ to balance costs.
â€¢ **Performance:** Of customers we flag, about ___% actually churn (precision). We catch ___% of all churners (recall).
â€¢ **Recommendation:** [How should the team use these predictions? What's the expected cost savings?]

**Your Summary:**

_[Write your summary here]_

In [None]:
# SELF-CHECK: Verify your threshold optimization is correct
# Run this after completing the cost optimization section

# Cost parameters (should match what you used above)
FP_COST = 50   # Cost of false positive (wasted retention offer)
FN_COST = 200  # Cost of false negative (lost customer)

# Check that the cost function works
assert calculate_total_cost(y_test, default_preds) > 0, "Cost function should return positive value"

# Check that you found an optimal threshold
assert 'optimal_threshold' in dir(), "Should have found optimal_threshold"
assert optimal_threshold != 0.5, "Optimal threshold should differ from default 0.5 (given FN costs 4x FP)"

# Check that optimal threshold is lower (since FN cost > FP cost)
assert optimal_threshold < 0.5, "When FN costs more than FP, optimal threshold should be below 0.5"

# Check cost savings
optimal_cost = calculate_total_cost(y_test, final_preds, FP_COST, FN_COST)
default_cost_check = calculate_total_cost(y_test, default_preds, FP_COST, FN_COST)
assert optimal_cost <= default_cost_check, "Optimal threshold should reduce or maintain costs"

print("âœ… Self-check passed!")
print(f"   Optimal threshold: {optimal_threshold:.2f}")
print(f"   Default cost (0.5): ${default_cost_check:,}")
print(f"   Optimal cost: ${optimal_cost:,}")
print(f"   Savings: ${default_cost_check - optimal_cost:,}")

---

## Self-Assessment Checklist

- [ ] I understand why accuracy is misleading for imbalanced classes
- [ ] I can explain the precision-recall tradeoff
- [ ] I found the optimal threshold using business costs
- [ ] I can interpret logistic regression coefficients as odds ratios
- [ ] I wrote a clear stakeholder summary

## Next Steps

1. **Debug Drill:** Fix a classification model with threshold issues
2. **Module 5:** Decision Trees - see how non-linear boundaries work