# Module 02: Random Forests

**Difficulty**: ‚≠ê‚≠ê‚≠ê Advanced

**Estimated Time**: 90 minutes

**Prerequisites**: 
- Module 00: Introduction to Ensemble Learning
- Module 01: Bagging and Bootstrap Aggregation
- Understanding of decision trees and Gini impurity

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand how Random Forests improve upon basic bagging with feature randomness
2. Tune Random Forest hyperparameters systematically
3. Extract and interpret feature importances
4. Handle imbalanced datasets using class weights
5. Compare Random Forests with single decision trees and basic bagging
6. Apply Random Forests to real-world classification and regression problems

---

## 1. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn models and utilities
from sklearn.datasets import (
    make_classification, 
    load_breast_cancer,
    fetch_california_housing
)
from sklearn.model_selection import (
    train_test_split,
    cross_val_score,
    GridSearchCV,
    RandomizedSearchCV
)
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import (
    RandomForestClassifier,
    RandomForestRegressor,
    BaggingClassifier
)
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    roc_auc_score,
    roc_curve,
    mean_squared_error,
    r2_score
)
from sklearn.inspection import permutation_importance

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Set random seeds for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Setup complete!")

## 2. Random Forests: Bagging + Feature Randomness

### What Makes Random Forests "Random"?

Random Forests extend basic bagging with an additional source of randomness:

**Standard Bagging**:
1. ‚úÖ Bootstrap sampling (random subset of samples)
2. ‚ùå Use ALL features at each split

**Random Forest**:
1. ‚úÖ Bootstrap sampling (random subset of samples)
2. ‚úÖ **Feature randomness** (random subset of features at each split)

### Why Feature Randomness?

**Problem with Standard Bagging**:
- If there's one very strong predictor, most trees will use it for the first split
- Trees become correlated (similar structure)
- Correlated predictions don't average out errors as effectively

**Solution: Random Feature Selection**:
- At each node, consider only random subset of $m$ features (typically $m = \sqrt{p}$ for classification)
- Forces trees to be more diverse
- Strong predictors don't dominate every tree
- Lower correlation between trees ‚Üí better variance reduction

### Mathematical Insight

Recall from Module 01:

$$\text{Ensemble Variance} = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$$

Where:
- $\rho$ = average correlation between trees
- $\sigma^2$ = variance of individual trees
- $B$ = number of trees

**Key**: By reducing $\rho$ (correlation), Random Forests achieve lower ensemble variance!

### Comparing Tree Correlation

In [None]:
# Create dataset with one strong predictor
X, y = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=5,
    n_redundant=5,
    random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Standard bagging (all features at each split)
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=10),
    n_estimators=20,
    max_features=1.0,  # All features
    random_state=42
)
bagging.fit(X_train, y_train)

# Random Forest (subset of features at each split)
rf = RandomForestClassifier(
    n_estimators=20,
    max_depth=10,
    max_features='sqrt',  # sqrt(20) ‚âà 4 features per split
    random_state=42
)
rf.fit(X_train, y_train)

# Get predictions from all trees in both ensembles
bagging_preds = np.array([
    tree.predict(X_test) for tree in bagging.estimators_
])

rf_preds = np.array([
    tree.predict(X_test) for tree in rf.estimators_
])

# Calculate pairwise correlations between trees
def calculate_tree_correlations(predictions):
    """
    Calculate average pairwise correlation between tree predictions.
    
    Args:
        predictions: Array of shape (n_trees, n_samples)
    
    Returns:
        Average correlation coefficient
    """
    n_trees = predictions.shape[0]
    correlations = []
    
    for i in range(n_trees):
        for j in range(i + 1, n_trees):
            corr = np.corrcoef(predictions[i], predictions[j])[0, 1]
            correlations.append(corr)
    
    return np.mean(correlations)

bagging_corr = calculate_tree_correlations(bagging_preds)
rf_corr = calculate_tree_correlations(rf_preds)

print("üå≤ Tree Correlation Analysis:")
print("=" * 50)
print(f"\nStandard Bagging (all features):")
print(f"  Average tree correlation: {bagging_corr:.4f}")
print(f"  Test accuracy: {bagging.score(X_test, y_test):.4f}")
print(f"\nRandom Forest (subset of features):")
print(f"  Average tree correlation: {rf_corr:.4f}")
print(f"  Test accuracy: {rf.score(X_test, y_test):.4f}")
print(f"\n‚úÖ Random Forest reduces correlation by {(1 - rf_corr/bagging_corr)*100:.1f}%!")

# Visualize
fig, ax = plt.subplots(figsize=(8, 6))
methods = ['Standard\nBagging', 'Random\nForest']
correlations = [bagging_corr, rf_corr]
colors = ['lightcoral', 'lightgreen']

bars = ax.bar(methods, correlations, color=colors, edgecolor='black', linewidth=2)
ax.set_ylabel('Average Tree Correlation', fontsize=12)
ax.set_title('Tree Correlation: Bagging vs Random Forest', fontsize=14, fontweight='bold')
ax.set_ylim(0, max(correlations) * 1.2)
ax.grid(True, alpha=0.3, axis='y')

for bar, corr in zip(bars, correlations):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2, height + 0.01,
            f'{corr:.4f}', ha='center', va='bottom', fontweight='bold', fontsize=12)

plt.tight_layout()
plt.show()

### üéØ Exercise 1: Feature Subset Size

Experiment with different values of `max_features`:

1. Train Random Forests with max_features = [1, 2, 5, 10, 20, 'sqrt', 'log2']
2. For each, calculate:
   - Average tree correlation
   - Test set accuracy
3. Plot correlation and accuracy vs max_features
4. What's the optimal value? Why?

In [None]:
# Your code here


## 3. Random Forest Hyperparameters

### Key Hyperparameters

Random Forests have many hyperparameters. Here are the most important ones:

#### 1. Number of Trees (`n_estimators`)
- **Default**: 100
- **Effect**: More trees ‚Üí better performance (with diminishing returns)
- **Recommendation**: Start with 100-500, increase if underfitting
- **Note**: Unlike boosting, RF doesn't overfit with more trees

#### 2. Max Features (`max_features`)
- **Default**: 'sqrt' for classification, '1.0' for regression
- **Effect**: Controls tree diversity and correlation
- **Options**:
  - 'sqrt': $\sqrt{n\_features}$ (good default for classification)
  - 'log2': $\log_2(n\_features)$
  - int: Specific number of features
  - float: Fraction of features
- **Recommendation**: Start with 'sqrt', tune if needed

#### 3. Tree Depth (`max_depth`)
- **Default**: None (fully grown trees)
- **Effect**: Controls individual tree complexity
- **Recommendation**: Usually leave as None, but limit if memory is concern

#### 4. Min Samples Split (`min_samples_split`)
- **Default**: 2
- **Effect**: Minimum samples required to split a node
- **Recommendation**: Increase (5-20) to prevent overfitting

#### 5. Min Samples Leaf (`min_samples_leaf`)
- **Default**: 1
- **Effect**: Minimum samples required in leaf nodes
- **Recommendation**: Increase (2-10) for smoother predictions

#### 6. Bootstrap (`bootstrap`)
- **Default**: True
- **Effect**: Whether to use bootstrap sampling
- **Recommendation**: Always keep True (it's what makes it a Random Forest!)

#### 7. OOB Score (`oob_score`)
- **Default**: False
- **Effect**: Calculate out-of-bag score during training
- **Recommendation**: Set to True for free validation score

#### 8. Class Weight (`class_weight`)
- **Default**: None
- **Effect**: Handle imbalanced datasets
- **Options**: 'balanced', 'balanced_subsample', or custom dict
- **Recommendation**: Use 'balanced' for imbalanced data

### Hyperparameter Tuning with GridSearchCV

In [None]:
# Load breast cancer dataset (binary classification)
data = load_breast_cancer()
X_cancer = data.data
y_cancer = data.target

X_train, X_test, y_train, y_test = train_test_split(
    X_cancer, y_cancer, test_size=0.3, random_state=42
)

print(f"Dataset: Breast Cancer Wisconsin")
print(f"Samples: {len(X_cancer)}")
print(f"Features: {len(data.feature_names)}")
print(f"Classes: {data.target_names}")
print(f"Class distribution: {np.bincount(y_cancer)}")

In [None]:
# Define hyperparameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_features': ['sqrt', 'log2', 0.5],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create base model
rf_base = RandomForestClassifier(random_state=42, n_jobs=-1)

# Grid search with cross-validation
print("\nüîç Starting Grid Search...")
print(f"Total combinations to test: {np.prod([len(v) for v in param_grid.values()])}")

grid_search = GridSearchCV(
    estimator=rf_base,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

grid_search.fit(X_train, y_train)

print("\n‚úÖ Grid Search Complete!")
print("\nüèÜ Best Parameters:")
for param, value in grid_search.best_params_.items():
    print(f"  {param}: {value}")

print(f"\nüìä Best CV Score: {grid_search.best_score_:.4f}")
print(f"Test Score: {grid_search.score(X_test, y_test):.4f}")

### Randomized Search for Large Spaces

In [None]:
# RandomizedSearchCV is faster for large hyperparameter spaces
from scipy.stats import randint, uniform

# Define distributions for randomized search
param_distributions = {
    'n_estimators': randint(50, 500),
    'max_features': uniform(0.3, 0.7),  # Sample between 0.3 and 1.0
    'max_depth': [None, 5, 10, 15, 20, 30],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'bootstrap': [True]
}

# Randomized search
print("\nüé≤ Starting Randomized Search...")
random_search = RandomizedSearchCV(
    estimator=rf_base,
    param_distributions=param_distributions,
    n_iter=50,  # Try 50 random combinations
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=42,
    verbose=1
)

random_search.fit(X_train, y_train)

print("\n‚úÖ Randomized Search Complete!")
print("\nüèÜ Best Parameters:")
for param, value in random_search.best_params_.items():
    print(f"  {param}: {value}")

print(f"\nüìä Best CV Score: {random_search.best_score_:.4f}")
print(f"Test Score: {random_search.score(X_test, y_test):.4f}")

# Compare with grid search
print(f"\n‚ö° Grid Search tested {len(grid_search.cv_results_['params'])} combinations")
print(f"‚ö° Random Search tested {len(random_search.cv_results_['params'])} combinations")
print(f"\nBest scores: Grid={grid_search.best_score_:.4f}, Random={random_search.best_score_:.4f}")

### üéØ Exercise 2: Custom Hyperparameter Tuning

Design your own hyperparameter search:

1. Create a different synthetic dataset with different properties (high dimensions, imbalanced, etc.)
2. Define a custom hyperparameter grid based on dataset characteristics
3. Use GridSearchCV or RandomizedSearchCV to find optimal parameters
4. Visualize how different hyperparameters affect performance
5. Compare tuned model with default Random Forest

In [None]:
# Your code here


## 4. Feature Importance

### Why Feature Importance?

Feature importance helps us:
1. **Understand model**: Which features drive predictions?
2. **Feature selection**: Remove unimportant features
3. **Domain insights**: Discover relationships in data
4. **Debug models**: Identify data leakage or spurious correlations

### Two Types of Feature Importance

#### 1. Impurity-Based (Default in sklearn)
- Measures average decrease in impurity (Gini/entropy) when feature is used
- **Pros**: Fast to compute (calculated during training)
- **Cons**: Biased toward high-cardinality features

#### 2. Permutation-Based
- Measures decrease in model score when feature values are randomly shuffled
- **Pros**: Unbiased, works with any model
- **Cons**: Slower (requires multiple predictions)

In [None]:
# Train Random Forest for feature importance analysis
rf_importance = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf_importance.fit(X_train, y_train)

# Get feature importances
feature_names = data.feature_names
importances = rf_importance.feature_importances_
std = np.std([
    tree.feature_importances_ for tree in rf_importance.estimators_
], axis=0)

# Create DataFrame for better visualization
importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances,
    'std': std
}).sort_values('importance', ascending=False)

print("\nüìä Top 10 Most Important Features (Impurity-Based):")
print(importance_df.head(10).to_string(index=False))

# Visualize
plt.figure(figsize=(10, 8))
top_features = importance_df.head(15)
plt.barh(range(len(top_features)), top_features['importance'], 
         xerr=top_features['std'], alpha=0.7, edgecolor='black')
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance', fontsize=12)
plt.title('Top 15 Features by Importance (with std across trees)', 
          fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

### Permutation Importance

In [None]:
# Calculate permutation importance on test set
perm_importance = permutation_importance(
    rf_importance, 
    X_test, 
    y_test,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

# Create DataFrame
perm_importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': perm_importance.importances_mean,
    'std': perm_importance.importances_std
}).sort_values('importance', ascending=False)

print("\nüìä Top 10 Most Important Features (Permutation-Based):")
print(perm_importance_df.head(10).to_string(index=False))

# Compare both methods
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 8))

# Impurity-based
top_impurity = importance_df.head(10)
ax1.barh(range(len(top_impurity)), top_impurity['importance'], 
         alpha=0.7, color='skyblue', edgecolor='black')
ax1.set_yticks(range(len(top_impurity)))
ax1.set_yticklabels(top_impurity['feature'])
ax1.set_xlabel('Importance', fontsize=12)
ax1.set_title('Impurity-Based Importance', fontsize=13, fontweight='bold')
ax1.invert_yaxis()
ax1.grid(True, alpha=0.3, axis='x')

# Permutation-based
top_perm = perm_importance_df.head(10)
ax2.barh(range(len(top_perm)), top_perm['importance'], 
         xerr=top_perm['std'], alpha=0.7, color='lightcoral', edgecolor='black')
ax2.set_yticks(range(len(top_perm)))
ax2.set_yticklabels(top_perm['feature'])
ax2.set_xlabel('Importance (Accuracy Drop)', fontsize=12)
ax2.set_title('Permutation-Based Importance', fontsize=13, fontweight='bold')
ax2.invert_yaxis()
ax2.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

print("\nüí° Note: Both methods may rank features differently!")
print("   Impurity-based: Shows what model uses internally")
print("   Permutation: Shows what actually affects predictions")

### üéØ Exercise 3: Feature Selection

Use feature importance for feature selection:

1. Train a Random Forest on all features
2. Select top k features (try k = 5, 10, 15, 20)
3. Train new Random Forest on only selected features
4. Compare performance and training time
5. Plot accuracy vs number of features
6. Find optimal number of features

In [None]:
# Your code here


## 5. Handling Imbalanced Datasets

### The Imbalanced Data Problem

Many real-world datasets are imbalanced (e.g., fraud detection: 99% legitimate, 1% fraud).

**Problem**: Standard RF may ignore minority class to maximize overall accuracy.

**Solutions**:
1. **Class weights**: Penalize misclassification of minority class more
2. **Balanced sampling**: Ensure each bootstrap sample is balanced
3. **Adjust threshold**: Change decision threshold from 0.5
4. **Use appropriate metrics**: F1, AUC-ROC instead of accuracy

In [None]:
# Create imbalanced dataset (10:1 ratio)
X_imb, y_imb = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_classes=2,
    weights=[0.9, 0.1],  # 90% class 0, 10% class 1
    random_state=42,
    flip_y=0.05
)

X_train_imb, X_test_imb, y_train_imb, y_test_imb = train_test_split(
    X_imb, y_imb, test_size=0.3, random_state=42, stratify=y_imb
)

print(f"Training set distribution:")
print(f"  Class 0: {np.sum(y_train_imb == 0)} samples ({np.sum(y_train_imb == 0)/len(y_train_imb)*100:.1f}%)")
print(f"  Class 1: {np.sum(y_train_imb == 1)} samples ({np.sum(y_train_imb == 1)/len(y_train_imb)*100:.1f}%)")

# Standard RF (no balancing)
rf_standard = RandomForestClassifier(
    n_estimators=100,
    random_state=42
)
rf_standard.fit(X_train_imb, y_train_imb)
y_pred_standard = rf_standard.predict(X_test_imb)

# RF with balanced class weights
rf_balanced = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced',  # Automatically adjust weights
    random_state=42
)
rf_balanced.fit(X_train_imb, y_train_imb)
y_pred_balanced = rf_balanced.predict(X_test_imb)

# RF with balanced subsample
rf_balanced_sub = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced_subsample',  # Balance each bootstrap sample
    random_state=42
)
rf_balanced_sub.fit(X_train_imb, y_train_imb)
y_pred_balanced_sub = rf_balanced_sub.predict(X_test_imb)

# Compare
print("\nüìä Performance Comparison:")
print("=" * 70)

for name, y_pred in [("Standard", y_pred_standard), 
                      ("Balanced", y_pred_balanced),
                      ("Balanced Subsample", y_pred_balanced_sub)]:
    print(f"\n{name}:")
    print(classification_report(y_test_imb, y_pred, target_names=['Class 0', 'Class 1']))

# ROC curves
from sklearn.metrics import RocCurveDisplay

fig, ax = plt.subplots(figsize=(10, 8))

models = [
    ("Standard", rf_standard),
    ("Balanced", rf_balanced),
    ("Balanced Subsample", rf_balanced_sub)
]

for name, model in models:
    RocCurveDisplay.from_estimator(model, X_test_imb, y_test_imb, ax=ax, name=name)

ax.plot([0, 1], [0, 1], 'k--', label='Chance')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curves: Handling Imbalanced Data', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

### üéØ Exercise 4: Extreme Imbalance

Create and handle an extremely imbalanced dataset:

1. Generate dataset with 100:1 imbalance ratio
2. Try different approaches:
   - Standard RF
   - Balanced class weights
   - Custom class weights {0: 1, 1: 50}
   - Adjusting decision threshold
3. Compare using precision-recall curves
4. Which approach works best?

In [None]:
# Your code here


## 6. Summary and Next Steps

### üéì Key Takeaways

1. **Random Forest = Bagging + Feature Randomness**:
   - Bootstrap sampling creates diversity in training data
   - Random feature selection reduces tree correlation
   - Lower correlation ‚Üí better ensemble performance

2. **Key Hyperparameters**:
   - `n_estimators`: More is better (100-500 typical)
   - `max_features`: 'sqrt' for classification, tune if needed
   - `min_samples_split/leaf`: Control tree complexity
   - `class_weight`: Handle imbalanced data

3. **Feature Importance**:
   - Impurity-based: Fast, built-in, may be biased
   - Permutation-based: Unbiased, slower, more reliable
   - Use for feature selection and model interpretation

4. **Imbalanced Data**:
   - Use `class_weight='balanced'` or 'balanced_subsample'
   - Evaluate with F1, AUC-ROC, precision-recall
   - Consider threshold adjustment

5. **Advantages of Random Forests**:
   - ‚úÖ Robust to overfitting
   - ‚úÖ Handles high-dimensional data
   - ‚úÖ Provides feature importances
   - ‚úÖ Minimal hyperparameter tuning needed
   - ‚úÖ Parallelizable (fast training)

6. **Limitations**:
   - ‚ùå Less interpretable than single trees
   - ‚ùå Larger model size
   - ‚ùå Slower prediction than single trees
   - ‚ùå Can struggle with very imbalanced data

### üìö What's Next?

- **Module 03**: AdaBoost (sequential ensemble that reduces bias)
- **Module 04**: Gradient Boosting (powerful sequential method)
- **Module 05**: XGBoost (optimized gradient boosting)

### üéØ Practice Recommendations

1. Apply Random Forest to your own classification problem
2. Perform hyperparameter tuning with GridSearchCV
3. Analyze feature importances to gain insights
4. Compare with single decision tree and logistic regression
5. Try on a real Kaggle dataset

### üìñ Additional Resources

- **Original Paper**: Breiman, L. (2001). "Random Forests"
- **Sklearn Documentation**: https://scikit-learn.org/stable/modules/ensemble.html#forest
- **Interpretability**: "Interpretable Machine Learning" by Molnar
- **Feature Importance**: Strobl et al. (2007) on permutation importance

---

**üöÄ Ready to learn boosting? Let's move to Module 03: AdaBoost!**