# Random Forests

A collection of decision trees that work together to make better predictions than any single tree could make on its own.

### Advantage
A single decision tree can overfit—it memorizes the training data too well and doesn't generalize to new situations. A random forest solves this by building many slightly different decision trees, each one learning from a slightly different view of your data, then averaging their predictions together.

### How it works:
Each tree in the forest is trained on a random sample of your data (with replacement—meaning some rows might be used multiple times, others skipped). Additionally, at each node when the tree decides which feature to split on, it only considers a random subset of features to choose from. These two random choices—different data samples and different feature choices—mean each tree ends up being unique and makes different mistakes.

When you want to make a prediction on new data, all the trees vote. For classification, the class that most trees predict wins. For regression, you take the average of all the predictions. This "wisdom of the crowd" effect means the errors from individual trees cancel each other out.

# Random Forests: Why Ensembles Beat Single Trees
## Customer Churn Prediction

You just saw how decision trees work. They're interpretable and powerful, but they have a problem: **instability**.

Small changes in data → very different trees. This is where **Random Forests** come in.

**Today:** Understand why ensembles reduce variance and produce better predictions.

---

## Setup: Import Libraries

We'll use the same libraries as before, plus RandomForestClassifier from scikit-learn.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import seaborn as sns

# Set random seed for reproducibility
np.random.seed(42)

# Make plots look nice
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Part 1: Load & Prepare Data

Same churn dataset as before. We'll use it to compare single trees vs. ensembles.

In [None]:
# Load the telecom churn dataset from GitHub
url = 'https://raw.githubusercontent.com/TUHHStartupEngineers/dat_sci_ss20/master/13/WA_Fn-UseC_-Telco-Customer-Churn.csv'
df = pd.read_csv(url)

print(f"Dataset shape: {df.shape}")
print(f"Churn rate: {(df['Churn'] == 'Yes').mean():.1%}")

### Prepare Data

Encode categorical features, convert target to binary, split train/test.

In [None]:
# Clean and prepare data
df_clean = df.copy()
df_clean = df_clean.drop(columns=['customerID'])

# Target: convert Yes/No to 1/0
y = (df_clean['Churn'] == 'Yes').astype(int)
X = df_clean.drop(columns=['Churn'])

# Encode categorical features
for col in X.columns:
    if X[col].dtype == 'object':
        le = LabelEncoder()
        X[col] = le.fit_transform(X[col])

# Fix TotalCharges column
X['TotalCharges'] = pd.to_numeric(X['TotalCharges'], errors='coerce')
X['TotalCharges'].fillna(X['TotalCharges'].median(), inplace=True)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

---

## Part 2: Why Ensembles?

### The Problem with Single Trees

Decision trees have high **variance**: small changes in data → very different trees → unstable predictions.

**The Ensemble Idea:** Train *many* trees on slightly different versions of the data, then average their predictions. This reduces variance and improves generalization.

---

## Part 3: Train & Compare

Let's train:
- 1 decision tree (to establish a baseline)
- Random forests with 50, 100, and 500 trees

Then compare test accuracy.

In [None]:
# Train a single decision tree for comparison
print("Training single decision tree...")
tree = DecisionTreeClassifier(max_depth=5, random_state=42)
tree.fit(X_train, y_train)
tree_acc = tree.score(X_test, y_test)

print(f"Single Tree Test Accuracy: {tree_acc:.3f}")

# Train random forests with different numbers of trees
results = {'model': ['Single Tree'], 'n_estimators': [1], 'test_accuracy': [tree_acc]}

for n_trees in [50, 100, 500]:
    print(f"\nTraining Random Forest with {n_trees} trees...")
    forest = RandomForestClassifier(n_estimators=n_trees, max_depth=5, random_state=42, n_jobs=-1)
    forest.fit(X_train, y_train)
    forest_acc = forest.score(X_test, y_test)
    print(f"  Test Accuracy: {forest_acc:.3f}")

    results['model'].append(f'Forest ({n_trees})')
    results['n_estimators'].append(n_trees)
    results['test_accuracy'].append(forest_acc)

results_df = pd.DataFrame(results)
print("\n" + "=" * 60)
print("COMPARISON:")
print(results_df.to_string(index=False))

### Visualize the Comparison

How much does ensemble size matter?

In [None]:
# Plot the results
plt.figure(figsize=(10, 6))
colors = ['red'] + ['green'] * (len(results_df) - 1)
bars = plt.bar(range(len(results_df)), results_df['test_accuracy'], color=colors, alpha=0.7, edgecolor='black')
plt.xticks(range(len(results_df)), results_df['model'], rotation=15, ha='right')
plt.ylabel('Test Accuracy')
plt.title('Single Tree vs Random Forests: Accuracy Comparison')
plt.ylim([0.7, 0.85])

# Add value labels on bars
for i, bar in enumerate(bars):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

print("\nKey insight: Forest beats single tree. More trees help, then plateau.")

---

## Part 4: The Accuracy Plateau

Adding more trees improves accuracy at first, but then it plateaus. Let's see this more clearly by training forests with many different sizes.

In [None]:
# Train forests with different numbers of trees to see the plateau
print("Training forests with increasing numbers of trees...")
n_estimators_range = [1, 5, 10, 20, 50, 100, 200, 500]
accuracies = []

for n in n_estimators_range:
    forest = RandomForestClassifier(n_estimators=n, max_depth=5, random_state=42, n_jobs=-1)
    forest.fit(X_train, y_train)
    acc = forest.score(X_test, y_test)
    accuracies.append(acc)
    print(f"  n_estimators={n:3d} → accuracy={acc:.4f}")

print("\nDone!")

### Plot the Plateau

This is the key insight: Why does accuracy plateau? When does it stop improving?

In [None]:
# Plot accuracy vs number of trees
plt.figure(figsize=(12, 6))
plt.plot(n_estimators_range, accuracies, marker='o', linewidth=2, markersize=8, color='steelblue')
plt.xscale('log')  # Log scale shows the diminishing returns more clearly
plt.xlabel('Number of Trees (log scale)')
plt.ylabel('Test Accuracy')
plt.title('Random Forest Accuracy vs Number of Trees')
plt.grid(True, alpha=0.3)

# Add annotations
plt.axhline(y=max(accuracies), color='r', linestyle='--', alpha=0.5, label='Best accuracy')
plt.legend()

plt.tight_layout()
plt.show()

print(f"\nAccuracy plateau:")
print(f"  With 1 tree: {accuracies[0]:.4f}")
print(f"  With 10 trees: {accuracies[2]:.4f} (improvement: {(accuracies[2]-accuracies[0])*100:.2f}%)")
print(f"  With 100 trees: {accuracies[5]:.4f} (improvement from 10: {(accuracies[5]-accuracies[2])*100:.2f}%)")
print(f"  With 500 trees: {accuracies[7]:.4f} (improvement from 100: {(accuracies[7]-accuracies[5])*100:.2f}%)")
print("\n→ Diminishing returns: More trees help, but the gain gets smaller.")

---

## Part 5: Feature Importance Comparison

Do single trees and forests rely on the same features?

In [None]:
# Train a forest for feature importance comparison
forest_best = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42, n_jobs=-1)
forest_best.fit(X_train, y_train)

# Extract feature importance from both
tree_importance = tree.feature_importances_
forest_importance = forest_best.feature_importances_

# Create comparison dataframe
importance_df = pd.DataFrame({
    'feature': X.columns,
    'single_tree': tree_importance,
    'forest': forest_importance
}).sort_values('forest', ascending=False)

print("Feature Importance Comparison (Top 10):")
print(importance_df.head(10).to_string(index=False))

### Visualize Feature Importance Comparison

Do they rank features the same way?

In [None]:
# Plot side-by-side comparison
top_features = importance_df.head(10)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Single tree
ax1.barh(range(len(top_features)), top_features['single_tree'], color='coral', alpha=0.7)
ax1.set_yticks(range(len(top_features)))
ax1.set_yticklabels(top_features['feature'])
ax1.set_xlabel('Importance')
ax1.set_title('Single Decision Tree\nTop 10 Features')
ax1.invert_yaxis()

# Forest
ax2.barh(range(len(top_features)), top_features['forest'], color='steelblue', alpha=0.7)
ax2.set_yticks(range(len(top_features)))
ax2.set_yticklabels(top_features['feature'])
ax2.set_xlabel('Importance')
ax2.set_title('Random Forest (100 trees)\nTop 10 Features')
ax2.invert_yaxis()

plt.tight_layout()
plt.show()

print("\nObservation: Do they rank the top features the same way?")
print(f"Single tree top feature: {importance_df.iloc[0]['feature']}")
print(f"Forest top feature: {importance_df.iloc[0]['feature']}")

---

## YOUR TURN: Pairs Practice (20 minutes)

Now you build your own ensemble and experiment with parameters.

### Task 1: Train & Compare (10 min)

- Train a single tree (max_depth=5) on the churn data
- Train a random forest with 100 trees
- Compare test accuracies
- **Does the forest beat the tree?**

### Task 2: See the Plateau (5 min)

- Train forests with n_estimators = 10, 50, 100, 200
- Plot accuracy for each
- **At what point do you see diminishing returns?**

### Task 3: Reflect (5 min)

- If you had to choose: single tree or forest? Why?
- When would you *not* use a forest?

---

## Student Practice: Code Along Below

In [None]:
# TASK 1: Train your own single tree and forest
# TODO: Train a DecisionTreeClassifier and a RandomForestClassifier
# Hint: Use max_depth=5 for the tree, n_estimators=100 for the forest

my_tree = None  # REPLACE THIS
my_forest = None  # REPLACE THIS

if my_tree and my_forest:
    tree_acc = my_tree.score(X_test, y_test)
    forest_acc = my_forest.score(X_test, y_test)

    print(f"Your Single Tree Accuracy: {tree_acc:.4f}")
    print(f"Your Forest Accuracy: {forest_acc:.4f}")
    print(f"\nImprovement: {(forest_acc - tree_acc)*100:.2f}%")
    print(f"Forest wins: {forest_acc > tree_acc}")
else:
    print("TODO: Train your models in this cell")

In [None]:
# TASK 2: Train forests with different n_estimators and see the plateau
# TODO: Train forests with 10, 50, 100, 200 trees and record accuracy

my_n_estimators = [10, 50, 100, 200]
my_accuracies = []

# YOUR CODE HERE:
# for n in my_n_estimators:
#     forest = RandomForestClassifier(n_estimators=n, max_depth=5, random_state=42)
#     forest.fit(X_train, y_train)
#     acc = forest.score(X_test, y_test)
#     my_accuracies.append(acc)

if my_accuracies:
    # Plot it
    plt.figure(figsize=(10, 6))
    plt.plot(my_n_estimators, my_accuracies, marker='o', linewidth=2, markersize=8)
    plt.xlabel('Number of Trees')
    plt.ylabel('Test Accuracy')
    plt.title('Your Forest: Accuracy vs Number of Trees')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

    # Print values
    for n, acc in zip(my_n_estimators, my_accuracies):
        print(f"n_estimators={n}: accuracy={acc:.4f}")
else:
    print("TODO: Train your forests and record accuracies")

## Reflection Questions (TASK 3)

**Discuss with your partner:**

1. Did your forest beat your single tree? By how much?
2. Where did you see diminishing returns? (How many trees before accuracy plateaued?)
3. If you had to pick one for production, which would you use? Why?
4. When might you *not* want to use a random forest?

In [None]:
# TASK 3: Reflection
# Type your thoughts as comments

# 1. Forest vs Single Tree:
#    YOUR ANSWER HERE

# 2. Diminishing returns at:
#    YOUR ANSWER HERE

# 3. For production, I would choose:
#    YOUR ANSWER HERE

# 4. When NOT to use forests:
#    YOUR ANSWER HERE

---

## Summary

**Key takeaways:**

1. **Single trees are unstable** → small data changes → very different trees
2. **Forests reduce variance** → many trees averaged → stable, accurate predictions
3. **More trees help, then plateau** → diminishing returns kick in (usually 50-100 trees is enough)
4. **Forests are still interpretable** → feature importance tells you what matters
5. **Ensemble is a fundamental ML principle** → applies to many algorithms, not just trees

**Trade-off:** Forests are more accurate but less interpretable than single trees. You can't easily "see" how a forest decides, unlike a single tree.

**Next steps:** XGBoost, Gradient Boosting—take this idea further with adaptive tree selection.