# Module 3: When Linear Isn't Enough

**Goal:** Understand decision trees and random forests, and when they beat logistic regression

**Time:** ~20 minutes

**What you'll do:**
1. Train and visualize a decision tree
2. Train a random forest
3. Compare to logistic regression
4. Interpret feature importance

---

## Setup

In [None]:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import matplotlib.pyplot as plt

# Load data
df = pd.read_csv('https://raw.githubusercontent.com/189investmentai/ml-foundations-interactive/main/shared/data/streamcart_customers.csv')

# Prepare features
features = ['tenure_months', 'logins_last_7d', 'logins_last_30d',
            'support_tickets_last_30d', 'items_skipped_last_3_boxes', 'nps_score']

X = df[features].fillna(df[features].median())
y = df['churn_30d']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training: {len(X_train):,} | Test: {len(X_test):,}")
print(f"Churn rate: {y_train.mean():.1%}")

---

## Part 1: Train a Decision Tree

Decision trees ask yes/no questions to split customers into groups.

In [None]:
# TODO: Train a decision tree with max_depth=3
#
# Why max_depth=3? Deeper trees overfit. Start shallow.

tree_model = None  # Replace with your code

# Uncomment when ready:
# tree_model = DecisionTreeClassifier(max_depth=3, random_state=42)
# tree_model.fit(X_train, y_train)

In [None]:
# ============================================
# SELF-CHECK
# ============================================

assert tree_model is not None, "Create the tree model first!"
assert hasattr(tree_model, 'tree_'), "Model not trained‚Äîdid you call .fit()?"
print(f"‚úì Tree trained with {tree_model.tree_.node_count} nodes")

### Visualize the Tree

This is the beauty of decision trees‚Äîyou can actually see the logic!

In [None]:
plt.figure(figsize=(20, 10))
plot_tree(
    tree_model,
    feature_names=features,
    class_names=['Stay', 'Churn'],
    filled=True,
    rounded=True,
    fontsize=10
)
plt.title('Churn Decision Tree')
plt.tight_layout()
plt.show()

In [None]:
# TODO: Look at the tree and answer:
#
# 1. What feature does the tree split on FIRST? ___________
# 2. What's the threshold for that split? ___________
# 3. Does this make business sense? ___________

---

## Part 2: Train a Random Forest

Random forests = 100+ trees voting together. More accurate but less interpretable.

In [None]:
# TODO: Train a random forest with 100 trees, max_depth=5

rf_model = None  # Replace with your code

# Uncomment when ready:
# rf_model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
# rf_model.fit(X_train, y_train)

In [None]:
# ============================================
# SELF-CHECK
# ============================================

assert rf_model is not None, "Create the random forest first!"
assert len(rf_model.estimators_) == 100, "Should have 100 trees"
print(f"‚úì Random forest with {len(rf_model.estimators_)} trees trained")

---

## Part 3: Compare All Three Models

Let's train logistic regression too and compare all approaches.

In [None]:
# Train logistic regression for comparison
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train, y_train)

# Get predictions from each model
tree_probs = tree_model.predict_proba(X_test)[:, 1]
rf_probs = rf_model.predict_proba(X_test)[:, 1]
lr_probs = lr_model.predict_proba(X_test)[:, 1]

# Calculate AUC for each
tree_auc = roc_auc_score(y_test, tree_probs)
rf_auc = roc_auc_score(y_test, rf_probs)
lr_auc = roc_auc_score(y_test, lr_probs)

print("=== AUC Comparison ===")
print(f"Logistic Regression: {lr_auc:.3f}")
print(f"Single Decision Tree: {tree_auc:.3f}")
print(f"Random Forest:        {rf_auc:.3f}")
print(f"\nBest: {'Random Forest' if rf_auc > lr_auc else 'Logistic Regression'}")

In [None]:
# Calculate Precision@500 for each
k = 500
baseline = y_test.mean()

def precision_at_k(y_true, y_proba, k):
    top_k = np.argsort(y_proba)[-k:]
    return y_true.iloc[top_k].mean()

tree_prec = precision_at_k(y_test, tree_probs, k)
rf_prec = precision_at_k(y_test, rf_probs, k)
lr_prec = precision_at_k(y_test, lr_probs, k)

print(f"=== Precision@{k} ===")
print(f"Random baseline:      {baseline:.1%}")
print(f"Logistic Regression:  {lr_prec:.1%} (lift: {lr_prec/baseline:.1f}x)")
print(f"Single Decision Tree: {tree_prec:.1%} (lift: {tree_prec/baseline:.1f}x)")
print(f"Random Forest:        {rf_prec:.1%} (lift: {rf_prec/baseline:.1f}x)")

### Key Question

Did the random forest beat logistic regression? By how much?

Often the improvement is small (0.01-0.03 AUC). Is that worth the loss of interpretability?

---

## Part 4: Feature Importance

Random forests tell us which features matter most.

In [None]:
# Get feature importance from random forest
importance_df = pd.DataFrame({
    'feature': features,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=True)

print("=== Feature Importance (Random Forest) ===")
for _, row in importance_df.iterrows():
    print(f"{row['feature']:30} {row['importance']:.3f}")

# Visualize
plt.figure(figsize=(10, 5))
plt.barh(importance_df['feature'], importance_df['importance'], color='steelblue')
plt.xlabel('Importance (Gini)')
plt.title('Which Features Matter Most?')
plt.tight_layout()
plt.show()

### Compare to Logistic Regression Coefficients

Feature importance (trees) vs coefficients (logistic regression) measure different things.

In [None]:
# Compare rankings
lr_importance = pd.DataFrame({
    'feature': features,
    'lr_coef_abs': np.abs(lr_model.coef_[0]),
    'rf_importance': rf_model.feature_importances_
})

lr_importance['lr_rank'] = lr_importance['lr_coef_abs'].rank(ascending=False)
lr_importance['rf_rank'] = lr_importance['rf_importance'].rank(ascending=False)

print("=== Feature Ranking Comparison ===")
print(lr_importance[['feature', 'lr_rank', 'rf_rank']].sort_values('rf_rank'))

---

## Part 5: Why Trees Find Interactions

Trees can find patterns like "High tenure AND support tickets = very high risk" automatically.

In [None]:
# Let's look at churn rates in different segments
df_analysis = df.copy()
df_analysis['tenure_bucket'] = pd.cut(df['tenure_months'], bins=[0, 6, 12, 100], labels=['New', 'Medium', 'Veteran'])
df_analysis['has_tickets'] = (df['support_tickets_last_30d'] > 0).astype(int)

# Cross-tabulation
segment_churn = df_analysis.groupby(['tenure_bucket', 'has_tickets'])['churn_30d'].agg(['mean', 'count'])
segment_churn.columns = ['churn_rate', 'count']
segment_churn['churn_rate'] = segment_churn['churn_rate'].map('{:.1%}'.format)

print("=== Churn by Segment ===")
print(segment_churn)
print("\nüí° Notice: Veterans WITH tickets might have different risk than the average.")
print("   Trees find these interactions automatically!")

---

## Part 6: Overfitting Demo

What happens if we remove the depth limit?

In [None]:
# Train an unrestricted tree
tree_overfit = DecisionTreeClassifier(random_state=42)  # No max_depth!
tree_overfit.fit(X_train, y_train)

# Evaluate
train_auc = roc_auc_score(y_train, tree_overfit.predict_proba(X_train)[:, 1])
test_auc = roc_auc_score(y_test, tree_overfit.predict_proba(X_test)[:, 1])

print("=== Unrestricted Tree ===")
print(f"Number of leaves: {tree_overfit.get_n_leaves()}")
print(f"Train AUC: {train_auc:.3f}")
print(f"Test AUC:  {test_auc:.3f}")
print(f"\n‚ö†Ô∏è  Gap of {train_auc - test_auc:.3f} = OVERFITTING!")
print(f"   The tree memorized {tree_overfit.get_n_leaves()} tiny groups.")

---

## üìù Final Exercise: Explain It

The PM sees that random forest beats logistic regression and asks: "Why did it predict this customer would churn?"

Write a 4-5 sentence response explaining the interpretability tradeoff.

In [None]:
# Write your response:

pm_response = """
YOUR RESPONSE HERE
"""

print(pm_response)

---

## ‚úÖ Module 3 Complete!

**What you learned:**
- How decision trees make splits
- Why random forests are more robust
- How to read feature importance
- The overfitting danger with unrestricted trees

**Key takeaway:** Trees find interactions automatically, but sacrifice interpretability.

In [None]:
# Summary
print("=== Module 3 Summary ===")
print(f"\nModel Performance:")
print(f"  Logistic Regression: {lr_auc:.3f} AUC")
print(f"  Decision Tree:       {tree_auc:.3f} AUC")
print(f"  Random Forest:       {rf_auc:.3f} AUC")
print(f"\nTop Feature (RF): {features[np.argmax(rf_model.feature_importances_)]}")
print(f"\nOverfitting demo: Unrestricted tree had {train_auc - test_auc:.2f} train-test gap")

**Next:** [Module 4: Combining Many Weak Learners ‚Üí](./module_04_boosting.ipynb)