# Module 02: Random Forests

**Difficulty**: ⭐⭐⭐ Advanced
**Estimated Time**: 75 minutes
**Prerequisites**: 
- Module 00: Introduction to Ensemble Learning
- Module 01: Bagging and Bootstrap
- Decision Trees fundamentals

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand how Random Forests extend bagging with feature randomness
2. Tune key hyperparameters for optimal performance
3. Extract and visualize feature importance
4. Use out-of-bag evaluation for model validation
5. Handle imbalanced datasets with Random Forests
6. Compare Random Forest performance with standard bagging

## Setup and Configuration

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Machine learning
from sklearn.datasets import make_classification, load_breast_cancer, load_wine
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix, 
    roc_auc_score, roc_curve
)
from sklearn.inspection import permutation_importance

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("Setup complete! All libraries imported successfully.")

## 1. What Makes Random Forests Special?

### From Bagging to Random Forests

Random Forests build on bagging with an additional layer of randomness:

**Standard Bagging**:
1. Create bootstrap samples (random rows)
2. Train decision tree on each sample
3. Each tree considers ALL features at every split

**Random Forests**:
1. Create bootstrap samples (random rows)
2. Train decision tree on each sample
3. **Each split considers random subset of features** ← Key difference!

### Why Feature Randomness Matters

**Problem with standard bagging**:
- If one feature is very strong, most trees will split on it first
- Trees become correlated (similar structure)
- Averaging correlated predictions provides less benefit

**Solution with Random Forests**:
- Force trees to consider different features
- Creates diverse trees with different structures
- Reduces correlation → better ensemble performance

**Mathematical insight**: 
- Variance of average of correlated variables: $\text{Var}(\bar{X}) = \rho\sigma^2 + \frac{1-\rho}{n}\sigma^2$
- Lower correlation ($\rho$) → lower variance → better predictions

In [None]:
# Demonstration: Bagging vs Random Forest tree correlation
# Create dataset with one dominant feature
X, y = make_classification(
    n_samples=500,
    n_features=10,
    n_informative=10,
    n_redundant=0,
    random_state=RANDOM_STATE
)

# Make first feature very strong by adding signal
X[:, 0] = X[:, 0] + 2 * y

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

# Standard Bagging (considers all features at each split)
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=RANDOM_STATE),
    n_estimators=100,
    random_state=RANDOM_STATE
)
bagging_model.fit(X_train, y_train)

# Random Forest (considers sqrt(n_features) at each split)
rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=RANDOM_STATE
)
rf_model.fit(X_train, y_train)

# Get predictions from individual trees for correlation analysis
bagging_tree_preds = np.array([tree.predict(X_test) for tree in bagging_model.estimators_])
rf_tree_preds = np.array([tree.predict(X_test) for tree in rf_model.estimators_])

# Calculate average pairwise correlation
def calculate_avg_correlation(predictions):
    """Calculate average correlation between tree predictions."""
    corr_matrix = np.corrcoef(predictions)
    # Get upper triangle (excluding diagonal)
    upper_triangle = corr_matrix[np.triu_indices_from(corr_matrix, k=1)]
    return np.mean(upper_triangle)

bagging_corr = calculate_avg_correlation(bagging_tree_preds)
rf_corr = calculate_avg_correlation(rf_tree_preds)

print("Tree Correlation Analysis:")
print("=" * 50)
print(f"Bagging - Average tree correlation: {bagging_corr:.4f}")
print(f"Random Forest - Average tree correlation: {rf_corr:.4f}")
print(f"\nCorrelation reduction: {((bagging_corr - rf_corr) / bagging_corr * 100):.1f}%")
print("\nLower correlation means more diverse trees!")

# Compare performance
bagging_score = accuracy_score(y_test, bagging_model.predict(X_test))
rf_score = accuracy_score(y_test, rf_model.predict(X_test))

print("\nPerformance Comparison:")
print("=" * 50)
print(f"Bagging accuracy: {bagging_score:.4f}")
print(f"Random Forest accuracy: {rf_score:.4f}")
print(f"Improvement: {((rf_score - bagging_score) / bagging_score * 100):.2f}%")

## 2. Key Hyperparameters

### Understanding Random Forest Parameters

Random Forests have many tunable parameters. Here are the most important:

#### 2.1 Ensemble Parameters

**`n_estimators`**: Number of trees in the forest
- **Higher** → Better performance (but diminishing returns)
- **Higher** → Longer training time
- Typical values: 100-500
- Rule of thumb: Start with 100, increase if you have computational resources

**`max_features`**: Number of features to consider at each split
- **Lower** → More randomness, less correlation between trees
- **Higher** → Each tree is stronger individually
- Default for classification: `sqrt(n_features)`
- Default for regression: `n_features / 3`
- Most important parameter for controlling tree diversity!

#### 2.2 Tree Parameters

**`max_depth`**: Maximum depth of each tree
- **Higher** → More complex trees, can capture intricate patterns
- **Higher** → Risk of overfitting
- Default: `None` (trees grow until pure)
- Consider limiting if you have noisy data

**`min_samples_split`**: Minimum samples required to split a node
- **Higher** → Simpler trees, more regularization
- Default: 2
- Increase to prevent overfitting

**`min_samples_leaf`**: Minimum samples required in a leaf node
- **Higher** → Smoother decision boundaries
- Default: 1
- Useful for imbalanced datasets

#### 2.3 Special Parameters

**`oob_score`**: Use out-of-bag samples for validation
- Set to `True` to get free validation score
- No need for separate validation set!

**`class_weight`**: Handle imbalanced classes
- Set to `'balanced'` for automatic weighting
- Or provide custom weights

In [None]:
# Hyperparameter tuning demonstration
# Load a real dataset for realistic tuning
cancer_data = load_breast_cancer()
X, y = cancer_data.data, cancer_data.target
feature_names = cancer_data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

print(f"Dataset: {len(X_train)} training samples, {len(X_test)} test samples")
print(f"Features: {X.shape[1]}")
print(f"Classes: {np.unique(y)}")
print(f"Class distribution: {np.bincount(y)}")

In [None]:
# Effect of n_estimators
n_estimators_range = [10, 25, 50, 100, 200, 300, 500]
train_scores = []
test_scores = []

for n_est in n_estimators_range:
    rf = RandomForestClassifier(n_estimators=n_est, random_state=RANDOM_STATE, n_jobs=-1)
    rf.fit(X_train, y_train)
    
    train_scores.append(accuracy_score(y_train, rf.predict(X_train)))
    test_scores.append(accuracy_score(y_test, rf.predict(X_test)))

# Visualize
plt.figure(figsize=(10, 6))
plt.plot(n_estimators_range, train_scores, marker='o', label='Training Accuracy', linewidth=2)
plt.plot(n_estimators_range, test_scores, marker='s', label='Test Accuracy', linewidth=2)
plt.xlabel('Number of Trees (n_estimators)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Effect of Number of Trees on Performance', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nObservations:")
print(f"- Performance stabilizes around {n_estimators_range[np.argmax(np.diff(test_scores) > -0.001)]} trees")
print(f"- Best test accuracy: {max(test_scores):.4f}")
print("- Adding more trees after stabilization provides minimal benefit")

In [None]:
# Effect of max_features
n_features = X_train.shape[1]
max_features_range = [2, 4, 6, int(np.sqrt(n_features)), 10, 15, 20, n_features]
max_features_scores = []

for max_feat in max_features_range:
    rf = RandomForestClassifier(
        n_estimators=100, 
        max_features=min(max_feat, n_features),
        random_state=RANDOM_STATE,
        n_jobs=-1
    )
    # Use cross-validation for more robust estimates
    scores = cross_val_score(rf, X_train, y_train, cv=5)
    max_features_scores.append(scores.mean())

plt.figure(figsize=(10, 6))
plt.plot(max_features_range, max_features_scores, marker='o', linewidth=2, markersize=8)
plt.axvline(int(np.sqrt(n_features)), color='red', linestyle='--', 
            label=f'Default (sqrt={int(np.sqrt(n_features))})', linewidth=2)
plt.xlabel('Max Features per Split', fontsize=12)
plt.ylabel('Cross-Validation Accuracy', fontsize=12)
plt.title('Effect of Feature Randomness (max_features)', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

best_idx = np.argmax(max_features_scores)
print(f"\nBest max_features: {max_features_range[best_idx]}")
print(f"Default (sqrt): {int(np.sqrt(n_features))}")
print(f"\nThe default sqrt value is often near-optimal!")

In [None]:
# Grid search for best hyperparameters
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_features': ['sqrt', 'log2', 10],
    'max_depth': [10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create base model
rf_base = RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=-1)

# Grid search with cross-validation
print("Running grid search... This may take a minute.")
grid_search = GridSearchCV(
    estimator=rf_base,
    param_grid=param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train, y_train)

print("\nBest Parameters:")
print("=" * 50)
for param, value in grid_search.best_params_.items():
    print(f"{param:20s}: {value}")

print(f"\nBest CV Score: {grid_search.best_score_:.4f}")
print(f"Test Score: {grid_search.score(X_test, y_test):.4f}")

## 3. Feature Importance

Random Forests provide powerful feature importance metrics, helping us understand which features drive predictions.

### Types of Feature Importance

#### 3.1 Impurity-Based Importance (Default)
- Measures average decrease in impurity (Gini or entropy) from splits on that feature
- Fast to compute (available after training)
- **Caveat**: Biased toward high-cardinality features

#### 3.2 Permutation Importance
- Measures performance drop when feature values are randomly shuffled
- More reliable and unbiased
- **Caveat**: Slower to compute (requires multiple predictions)

### Why Feature Importance Matters

1. **Model interpretation**: Understand what drives predictions
2. **Feature selection**: Identify and remove irrelevant features
3. **Domain validation**: Check if important features make sense
4. **Debugging**: Detect data leakage or problematic features

In [None]:
# Train Random Forest for feature importance analysis
rf_importance = RandomForestClassifier(
    n_estimators=200,
    random_state=RANDOM_STATE,
    n_jobs=-1
)
rf_importance.fit(X_train, y_train)

# Method 1: Impurity-based importance
impurity_importance = rf_importance.feature_importances_

# Method 2: Permutation importance
perm_importance = permutation_importance(
    rf_importance, X_test, y_test, 
    n_repeats=10, random_state=RANDOM_STATE, n_jobs=-1
)

# Create comparison DataFrame
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Impurity_Importance': impurity_importance,
    'Permutation_Importance': perm_importance.importances_mean,
    'Permutation_Std': perm_importance.importances_std
}).sort_values('Impurity_Importance', ascending=False)

print("Top 10 Most Important Features:")
print("=" * 80)
print(importance_df.head(10).to_string(index=False))

In [None]:
# Visualize feature importance
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Plot 1: Impurity-based importance
top_features = importance_df.head(15)
axes[0].barh(range(len(top_features)), top_features['Impurity_Importance'], color='steelblue')
axes[0].set_yticks(range(len(top_features)))
axes[0].set_yticklabels(top_features['Feature'])
axes[0].invert_yaxis()
axes[0].set_xlabel('Importance Score', fontsize=11)
axes[0].set_title('Impurity-Based Feature Importance\n(Fast, may be biased)', 
                  fontsize=12, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# Plot 2: Permutation importance with error bars
top_perm = importance_df.sort_values('Permutation_Importance', ascending=False).head(15)
axes[1].barh(range(len(top_perm)), top_perm['Permutation_Importance'], 
             xerr=top_perm['Permutation_Std'], color='coral', capsize=3)
axes[1].set_yticks(range(len(top_perm)))
axes[1].set_yticklabels(top_perm['Feature'])
axes[1].invert_yaxis()
axes[1].set_xlabel('Importance Score', fontsize=11)
axes[1].set_title('Permutation Feature Importance\n(Slower, more reliable)', 
                  fontsize=12, fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Both methods identify similar top features (good sign!)")
print("- Permutation importance error bars show stability")
print("- Features with zero importance can be removed")

## 4. Out-of-Bag (OOB) Evaluation

### Free Validation Without a Validation Set!

**Key insight**: Each tree in a Random Forest is trained on a bootstrap sample (sampling with replacement). On average, each bootstrap sample contains ~63% of the data.

**The remaining ~37% of samples** (out-of-bag samples) were never seen by that tree.

**OOB evaluation**:
1. For each sample, find all trees that didn't see it during training
2. Use only those trees to make a prediction
3. Compare predictions to true labels
4. Result: unbiased performance estimate without separate validation set!

### Benefits
- No need to split off validation data
- Get performance estimate during training
- Particularly useful for small datasets
- Almost as good as cross-validation but much faster

In [None]:
# Demonstrate OOB evaluation
# Train Random Forest with OOB scoring enabled
rf_oob = RandomForestClassifier(
    n_estimators=200,
    oob_score=True,  # Enable OOB evaluation
    random_state=RANDOM_STATE,
    n_jobs=-1
)
rf_oob.fit(X_train, y_train)

# Compare OOB score with test score
oob_score = rf_oob.oob_score_
test_score = rf_oob.score(X_test, y_test)
train_score = rf_oob.score(X_train, y_train)

print("Performance Comparison:")
print("=" * 50)
print(f"Training Score:   {train_score:.4f}")
print(f"OOB Score:        {oob_score:.4f}  ← Free validation!")
print(f"Test Score:       {test_score:.4f}")
print(f"\nOOB vs Test difference: {abs(oob_score - test_score):.4f}")
print("\nOOB score is very close to test score without using test data!")

In [None]:
# OOB score vs number of trees
# Shows how OOB estimate stabilizes with more trees
n_trees_range = range(10, 201, 10)
oob_scores = []

for n_trees in n_trees_range:
    rf = RandomForestClassifier(
        n_estimators=n_trees,
        oob_score=True,
        random_state=RANDOM_STATE,
        n_jobs=-1
    )
    rf.fit(X_train, y_train)
    oob_scores.append(rf.oob_score_)

plt.figure(figsize=(10, 6))
plt.plot(n_trees_range, oob_scores, marker='o', linewidth=2)
plt.axhline(test_score, color='red', linestyle='--', 
            label=f'Actual Test Score: {test_score:.4f}', linewidth=2)
plt.xlabel('Number of Trees', fontsize=12)
plt.ylabel('OOB Score', fontsize=12)
plt.title('OOB Score Stabilization', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nOOB score converges to test score as we add more trees!")

## 5. Handling Imbalanced Data

### The Imbalanced Class Problem

**Problem**: When one class is much more common than another (e.g., fraud detection, disease diagnosis)
- Model may achieve high accuracy by always predicting majority class
- Fails to learn minority class patterns

### Random Forest Solutions

#### 5.1 Class Weighting
- Set `class_weight='balanced'`
- Automatically adjusts weights inversely proportional to class frequencies
- Formula: $w_i = \frac{n_{\text{samples}}}{n_{\text{classes}} \times n_{\text{samples in class } i}}$

#### 5.2 Balanced Bootstrap Sampling
- Set `class_weight='balanced_subsample'`
- Adjusts weights for each bootstrap sample
- More adaptive to local class distributions

#### 5.3 Custom Sampling Strategy
- Manually oversample minority class or undersample majority class
- Use libraries like imbalanced-learn

In [None]:
# Create imbalanced dataset
X_imb, y_imb = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    weights=[0.95, 0.05],  # 95% class 0, 5% class 1
    flip_y=0.01,
    random_state=RANDOM_STATE
)

X_train_imb, X_test_imb, y_train_imb, y_test_imb = train_test_split(
    X_imb, y_imb, test_size=0.3, random_state=RANDOM_STATE, stratify=y_imb
)

print("Imbalanced Dataset:")
print("=" * 50)
print(f"Training class distribution: {np.bincount(y_train_imb)}")
print(f"Imbalance ratio: {np.bincount(y_train_imb)[0] / np.bincount(y_train_imb)[1]:.1f}:1")
print(f"\nMinority class: {np.bincount(y_train_imb)[1]} samples ({np.bincount(y_train_imb)[1]/len(y_train_imb)*100:.1f}%)")

In [None]:
# Compare different approaches
# 1. Standard Random Forest (no balancing)
rf_standard = RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
rf_standard.fit(X_train_imb, y_train_imb)

# 2. Balanced class weights
rf_balanced = RandomForestClassifier(
    n_estimators=100, 
    class_weight='balanced',
    random_state=RANDOM_STATE, 
    n_jobs=-1
)
rf_balanced.fit(X_train_imb, y_train_imb)

# 3. Balanced subsample
rf_balanced_subsample = RandomForestClassifier(
    n_estimators=100, 
    class_weight='balanced_subsample',
    random_state=RANDOM_STATE, 
    n_jobs=-1
)
rf_balanced_subsample.fit(X_train_imb, y_train_imb)

# Evaluate all models
models = [
    ('Standard RF', rf_standard),
    ('Balanced Weights', rf_balanced),
    ('Balanced Subsample', rf_balanced_subsample)
]

print("\nPerformance on Imbalanced Data:")
print("=" * 80)

for name, model in models:
    y_pred = model.predict(X_test_imb)
    y_pred_proba = model.predict_proba(X_test_imb)[:, 1]
    
    # Calculate metrics
    accuracy = accuracy_score(y_test_imb, y_pred)
    roc_auc = roc_auc_score(y_test_imb, y_pred_proba)
    
    # Get per-class metrics from classification report
    report = classification_report(y_test_imb, y_pred, output_dict=True)
    minority_recall = report['1']['recall']
    minority_precision = report['1']['precision']
    
    print(f"\n{name}:")
    print(f"  Accuracy:           {accuracy:.4f}")
    print(f"  ROC-AUC:            {roc_auc:.4f}")
    print(f"  Minority Recall:    {minority_recall:.4f}  ← Can we find minority class?")
    print(f"  Minority Precision: {minority_precision:.4f}  ← Are predictions reliable?")

In [None]:
# Visualize ROC curves
plt.figure(figsize=(10, 8))

for name, model in models:
    y_pred_proba = model.predict_proba(X_test_imb)[:, 1]
    fpr, tpr, _ = roc_curve(y_test_imb, y_pred_proba)
    roc_auc = roc_auc_score(y_test_imb, y_pred_proba)
    
    plt.plot(fpr, tpr, linewidth=2, label=f'{name} (AUC = {roc_auc:.3f})')

plt.plot([0, 1], [0, 1], 'k--', linewidth=2, label='Random Classifier')
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curves: Handling Imbalanced Data', fontsize=14, fontweight='bold')
plt.legend(fontsize=10, loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey Insight:")
print("Balanced approaches improve minority class detection!")
print("Choose based on your use case:")
print("- High precision needed? Use standard or tune threshold")
print("- High recall needed? Use balanced weights")
print("- Best overall? Balanced subsample often wins")

## 6. Comparison: Bagging vs Random Forest

Let's solidify understanding by directly comparing standard bagging with Random Forests.

In [None]:
# Comprehensive comparison on wine dataset
wine_data = load_wine()
X_wine, y_wine = wine_data.data, wine_data.target

X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
    X_wine, y_wine, test_size=0.3, random_state=RANDOM_STATE
)

# Standard Bagging
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(random_state=RANDOM_STATE),
    n_estimators=100,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

# Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    random_state=RANDOM_STATE,
    n_jobs=-1
)

# Compare with cross-validation
bagging_scores = cross_val_score(bagging, X_wine, y_wine, cv=10)
rf_scores = cross_val_score(rf, X_wine, y_wine, cv=10)

# Create comparison DataFrame
comparison = pd.DataFrame({
    'Method': ['Bagging', 'Random Forest'],
    'Mean Accuracy': [bagging_scores.mean(), rf_scores.mean()],
    'Std Accuracy': [bagging_scores.std(), rf_scores.std()],
    'Min Accuracy': [bagging_scores.min(), rf_scores.min()],
    'Max Accuracy': [bagging_scores.max(), rf_scores.max()]
})

print("Bagging vs Random Forest (10-Fold CV):")
print("=" * 80)
print(comparison.to_string(index=False))

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Box plot
axes[0].boxplot([bagging_scores, rf_scores], labels=['Bagging', 'Random Forest'])
axes[0].set_ylabel('Accuracy', fontsize=11)
axes[0].set_title('Cross-Validation Score Distribution', fontsize=12, fontweight='bold')
axes[0].grid(axis='y', alpha=0.3)

# Bar plot with error bars
x = np.arange(2)
means = [bagging_scores.mean(), rf_scores.mean()]
stds = [bagging_scores.std(), rf_scores.std()]
axes[1].bar(x, means, yerr=stds, capsize=5, alpha=0.7, 
            color=['steelblue', 'coral'], edgecolor='black')
axes[1].set_xticks(x)
axes[1].set_xticklabels(['Bagging', 'Random Forest'])
axes[1].set_ylabel('Accuracy', fontsize=11)
axes[1].set_title('Mean Accuracy with Std Dev', fontsize=12, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nConclusion:")
improvement = (rf_scores.mean() - bagging_scores.mean()) / bagging_scores.mean() * 100
print(f"Random Forest improves over Bagging by {improvement:.2f}%")
print("Feature randomness creates more diverse trees → better ensemble!")

## Exercises

Test your understanding of Random Forests.

### Exercise 1: Optimal max_features Investigation

Create a classification dataset with 50 features where:
- 10 features are highly informative
- 40 features are noise

Train Random Forests with different `max_features` values (1, 5, 10, 'sqrt', 'log2', all features) and compare:
1. Cross-validation accuracy
2. Training time
3. Tree correlation (use predictions to calculate correlation)

**Questions**:
- How does max_features affect tree diversity?
- What's the optimal value for this dataset?
- Why doesn't using all features give the best result?

In [None]:
# Your code here


### Exercise 2: Feature Selection with Random Forests

Use the breast cancer dataset and:
1. Train a Random Forest and get feature importances
2. Identify features with near-zero importance
3. Train a new Random Forest using only top N most important features
4. Compare performance and training time

**Find the optimal N** that gives best validation accuracy with fewest features.

Bonus: Plot the Pareto frontier showing accuracy vs number of features.

In [None]:
# Your code here


### Exercise 3: Severe Imbalance Challenge

Create an extremely imbalanced dataset (99:1 ratio) and implement three strategies:

1. **Approach 1**: Random Forest with `class_weight='balanced'`
2. **Approach 2**: Manual oversampling of minority class + standard Random Forest
3. **Approach 3**: Adjust decision threshold on predicted probabilities

For each approach, calculate and compare:
- Accuracy
- Precision and Recall for minority class
- F1-score
- ROC-AUC

**Which approach works best and why?**

In [None]:
# Your code here


### Exercise 4: OOB vs Cross-Validation

Compare OOB evaluation with k-fold cross-validation:

1. Use a dataset of your choice
2. Train Random Forest with OOB scoring
3. Perform 5-fold and 10-fold cross-validation
4. Measure and compare:
   - Performance estimates (how close are they?)
   - Computation time
   - Variance in estimates

**When would you prefer OOB over cross-validation?**

In [None]:
# Your code here


## Summary

### Key Concepts

1. **Random Forest = Bagging + Feature Randomness**
   - Each tree trained on bootstrap sample (random rows)
   - Each split considers random subset of features (random columns)
   - Creates diverse, decorrelated trees

2. **Critical Hyperparameters**:
   - `n_estimators`: More trees → better (but diminishing returns after ~100-200)
   - `max_features`: Controls diversity (sqrt is often optimal)
   - `max_depth`, `min_samples_split`: Control overfitting
   - `class_weight`: Handle imbalanced data

3. **Feature Importance**:
   - Impurity-based: Fast but potentially biased
   - Permutation-based: Slower but more reliable
   - Use for interpretation, feature selection, debugging

4. **Out-of-Bag Evaluation**:
   - Free validation estimate without separate validation set
   - Each sample validated on trees that didn't see it
   - Nearly as good as cross-validation, much faster

5. **Imbalanced Data**:
   - Use `class_weight='balanced'` or `'balanced_subsample'`
   - Focus on ROC-AUC and per-class metrics, not just accuracy
   - Consider threshold tuning for production deployment

### Strengths of Random Forests

- Excellent out-of-box performance (minimal tuning needed)
- Handles mixed data types and missing values well
- Provides feature importance
- Resistant to overfitting (thanks to averaging)
- Parallelizable (fast training with n_jobs=-1)
- Works well for both classification and regression

### Limitations

- Can be slow on very large datasets
- Large model size (N trees to store)
- Less interpretable than single decision tree
- May not capture linear relationships as well as linear models
- Can struggle with extrapolation beyond training data range

### When to Use Random Forests

**Ideal for**:
- Tabular data with mixed feature types
- When you need good performance without extensive tuning
- Feature importance is valuable
- Baseline model before trying more complex methods

**Consider alternatives when**:
- Need maximum accuracy (try gradient boosting)
- Need interpretability (try single tree or linear model)
- Have very large datasets (try LightGBM/CatBoost)
- Working with images/text (try neural networks)

### What's Next?

In **Module 03: Boosting Fundamentals and AdaBoost**, we'll explore:
- Sequential ensemble learning (boosting)
- How boosting reduces bias instead of variance
- AdaBoost algorithm and adaptive sample weighting
- Comparison with bagging approaches

### Additional Resources

- **Original Paper**: "Random Forests" by Leo Breiman (2001)
- **Documentation**: [Scikit-learn Random Forest Guide](https://scikit-learn.org/stable/modules/ensemble.html#forest)
- **Book**: "The Elements of Statistical Learning" Chapter 15 (Hastie et al.)
- **Practical Guide**: [Tuning Random Forests](https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74)