# Random Forest: Complete Guide

## Table of Contents
1. [Introduction to Ensemble Methods](#1.-Introduction-to-Ensemble-Methods)
2. [Bootstrap Aggregating (Bagging)](#2.-Bootstrap-Aggregating-(Bagging))
3. [Random Forest Algorithm](#3.-Random-Forest-Algorithm)
4. [Out-of-Bag (OOB) Error Estimation](#4.-Out-of-Bag-(OOB)-Error-Estimation)
5. [Feature Importance](#5.-Feature-Importance)
6. [Implementation from Scratch](#6.-Implementation-from-Scratch)
7. [Scikit-learn Implementation](#7.-Scikit-learn-Implementation)
8. [Hyperparameter Tuning](#8.-Hyperparameter-Tuning)
9. [Comparison with Decision Tree](#9.-Comparison-with-Decision-Tree)
10. [Variable Importance Plots](#10.-Variable-Importance-Plots)
11. [Partial Dependence Plots](#11.-Partial-Dependence-Plots)
12. [Real-world Applications](#12.-Real-world-Applications)
13. [Practice Problems](#13.-Practice-Problems)

---

## 1. Introduction to Ensemble Methods

### What are Ensemble Methods?

**Ensemble methods** combine multiple machine learning models to create a more powerful predictive model. The key idea is that a group of "weak learners" can come together to form a "strong learner."

### Why Use Ensemble Methods?

1. **Reduced Overfitting**: By averaging multiple models, we reduce variance
2. **Improved Accuracy**: Combined predictions are often more accurate
3. **Robustness**: Less sensitive to noise and outliers
4. **Stability**: More consistent predictions across different datasets

### Types of Ensemble Methods

1. **Bagging (Bootstrap Aggregating)**
   - Train models on random subsets of data
   - Example: Random Forest

2. **Boosting**
   - Train models sequentially, focusing on errors
   - Example: AdaBoost, Gradient Boosting, XGBoost

3. **Stacking**
   - Combine different types of models
   - Meta-learner makes final prediction

### The Wisdom of Crowds

Ensemble methods are based on the "wisdom of crowds" principle:
- Individual predictions may be noisy
- Aggregate prediction is more stable and accurate
- Works best when individual models are diverse

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, BaggingClassifier
from sklearn.datasets import make_classification, make_regression, load_iris, load_breast_cancer, load_diabetes
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, mean_squared_error, r2_score
from sklearn.inspection import permutation_importance, PartialDependenceDisplay
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Random seed for reproducibility
np.random.seed(42)

## 2. Bootstrap Aggregating (Bagging)

### What is Bagging?

**Bagging** (Bootstrap Aggregating) is an ensemble technique that:
1. Creates multiple bootstrap samples from the training data
2. Trains a separate model on each sample
3. Aggregates predictions (voting for classification, averaging for regression)

### Bootstrap Sampling

- Sample **with replacement** from original dataset
- Each sample has same size as original
- On average, each bootstrap sample contains ~63.2% unique instances
- Remaining ~36.8% are "out-of-bag" (OOB) samples

### Mathematical Foundation

For classification:
$$\hat{y} = \text{mode}(\hat{y}_1, \hat{y}_2, ..., \hat{y}_B)$$

For regression:
$$\hat{y} = \frac{1}{B} \sum_{b=1}^{B} \hat{y}_b$$

where $B$ is the number of bootstrap samples.

In [None]:
# Demonstration of Bootstrap Sampling
def demonstrate_bootstrap():
    # Original dataset
    original_data = np.arange(1, 11)
    print("Original Data:", original_data)
    print("\nBootstrap Samples:")
    
    # Create 5 bootstrap samples
    for i in range(5):
        bootstrap_sample = np.random.choice(original_data, size=len(original_data), replace=True)
        unique_pct = len(np.unique(bootstrap_sample)) / len(original_data) * 100
        print(f"Sample {i+1}: {bootstrap_sample} (Unique: {unique_pct:.1f}%)")

demonstrate_bootstrap()

In [None]:
# Visualizing variance reduction through bagging
def visualize_bagging_variance_reduction():
    # Generate synthetic data
    X = np.linspace(0, 10, 100).reshape(-1, 1)
    y_true = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 5))
    
    # Single decision tree
    tree = DecisionTreeRegressor(max_depth=5, random_state=42)
    tree.fit(X, y_true)
    y_pred_tree = tree.predict(X)
    
    axes[0].scatter(X, y_true, alpha=0.5, label='Data')
    axes[0].plot(X, y_pred_tree, 'r-', linewidth=2, label='Single Tree')
    axes[0].set_title('Single Decision Tree\n(High Variance)', fontsize=12, fontweight='bold')
    axes[0].set_xlabel('X')
    axes[0].set_ylabel('y')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Multiple trees (bagging)
    n_estimators = 10
    predictions = []
    
    for i in range(n_estimators):
        # Bootstrap sample
        indices = np.random.choice(len(X), size=len(X), replace=True)
        X_boot, y_boot = X[indices], y_true[indices]
        
        # Train tree
        tree = DecisionTreeRegressor(max_depth=5, random_state=i)
        tree.fit(X_boot, y_boot)
        y_pred = tree.predict(X)
        predictions.append(y_pred)
        
        # Plot individual tree
        axes[1].plot(X, y_pred, alpha=0.3, linewidth=1)
    
    axes[1].scatter(X, y_true, alpha=0.5, label='Data')
    axes[1].set_title(f'{n_estimators} Individual Trees\n(Bootstrap Samples)', fontsize=12, fontweight='bold')
    axes[1].set_xlabel('X')
    axes[1].set_ylabel('y')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    # Bagged prediction (average)
    y_pred_bagged = np.mean(predictions, axis=0)
    
    axes[2].scatter(X, y_true, alpha=0.5, label='Data')
    axes[2].plot(X, y_pred_bagged, 'g-', linewidth=2, label='Bagged Prediction')
    axes[2].set_title('Bagged Prediction (Average)\n(Reduced Variance)', fontsize=12, fontweight='bold')
    axes[2].set_xlabel('X')
    axes[2].set_ylabel('y')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Calculate variance
    var_tree = np.var(y_pred_tree - y_true)
    var_bagged = np.var(y_pred_bagged - y_true)
    print(f"\nVariance Reduction:")
    print(f"Single Tree Variance: {var_tree:.4f}")
    print(f"Bagged Variance: {var_bagged:.4f}")
    print(f"Reduction: {(1 - var_bagged/var_tree)*100:.2f}%")

visualize_bagging_variance_reduction()

## 3. Random Forest Algorithm

### What is Random Forest?

**Random Forest** is an ensemble method that extends bagging by adding random feature selection:
1. Create bootstrap samples (like bagging)
2. At each split, randomly select a subset of features
3. Choose the best split from the selected features only
4. Aggregate predictions from all trees

### Key Differences from Bagging

| Aspect | Bagging | Random Forest |
|--------|---------|---------------|
| Data Sampling | Bootstrap samples | Bootstrap samples |
| Feature Selection | All features | Random subset at each split |
| Tree Correlation | Higher | Lower (more diverse) |
| Variance Reduction | Good | Better |

### Algorithm Steps

1. **For b = 1 to B:**
   - Draw a bootstrap sample of size n from training data
   - Grow a decision tree $T_b$ on this sample:
     - At each node, randomly select $m$ features from $p$ total features
     - Choose best split from these $m$ features only
     - Split the node and repeat until stopping criterion

2. **Prediction:**
   - Classification: Majority vote from all trees
   - Regression: Average prediction from all trees

### Hyperparameters

- **n_estimators**: Number of trees (typically 100-500)
- **max_features**: Number of features to consider at each split
  - Classification: $\sqrt{p}$ (default)
  - Regression: $p/3$ (default)
- **max_depth**: Maximum depth of trees
- **min_samples_split**: Minimum samples to split a node
- **min_samples_leaf**: Minimum samples in leaf node
- **bootstrap**: Whether to use bootstrap samples

In [None]:
# Visualizing Random Forest concept
def visualize_random_forest_concept():
    fig, axes = plt.subplots(2, 3, figsize=(18, 12))
    
    # Generate data
    X, y = make_classification(n_samples=200, n_features=2, n_redundant=0,
                                n_informative=2, n_clusters_per_class=1,
                                random_state=42)
    
    # Create mesh for decision boundary
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.1),
                         np.arange(y_min, y_max, 0.1))
    
    # Train individual trees
    for idx in range(5):
        ax = axes[idx // 3, idx % 3]
        
        # Bootstrap sample
        indices = np.random.choice(len(X), size=len(X), replace=True)
        X_boot, y_boot = X[indices], y[indices]
        
        # Train tree
        tree = DecisionTreeClassifier(max_depth=5, max_features='sqrt', random_state=idx)
        tree.fit(X_boot, y_boot)
        
        # Plot decision boundary
        Z = tree.predict(np.c_[xx.ravel(), yy.ravel()])
        Z = Z.reshape(xx.shape)
        ax.contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
        ax.scatter(X_boot[:, 0], X_boot[:, 1], c=y_boot, cmap='RdYlBu', 
                   edgecolors='black', alpha=0.7)
        ax.set_title(f'Tree {idx+1}', fontsize=11, fontweight='bold')
        ax.set_xlabel('Feature 1')
        ax.set_ylabel('Feature 2')
    
    # Random Forest (ensemble)
    ax = axes[1, 2]
    rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
    rf.fit(X, y)
    
    Z = rf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z, alpha=0.4, cmap='RdYlBu')
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', 
               edgecolors='black', alpha=0.7)
    ax.set_title('Random Forest\n(100 Trees Combined)', fontsize=11, fontweight='bold')
    ax.set_xlabel('Feature 1')
    ax.set_ylabel('Feature 2')
    
    plt.tight_layout()
    plt.show()

visualize_random_forest_concept()

## 4. Out-of-Bag (OOB) Error Estimation

### What is OOB Error?

**Out-of-Bag (OOB) error** is a method to estimate the test error of a Random Forest without using a separate validation set.

### How it Works

1. For each tree, ~36.8% of samples are not in the bootstrap sample (OOB samples)
2. Use these OOB samples to evaluate that tree's performance
3. For each data point, aggregate predictions from trees where it was OOB
4. Calculate error on these aggregated predictions

### Advantages

- **Free validation**: No need for separate validation set
- **Efficient**: Uses all data for training and validation
- **Reliable**: Often similar to cross-validation error

### Mathematical Formulation

For each observation $i$:
$$\hat{y}_i^{OOB} = \text{aggregate}\{\hat{y}_b(x_i) : i \notin S_b\}$$

where $S_b$ is the bootstrap sample for tree $b$.

OOB Error:
$$\text{OOB Error} = \frac{1}{n} \sum_{i=1}^{n} L(y_i, \hat{y}_i^{OOB})$$

In [None]:
# Demonstrating OOB Error Estimation
def demonstrate_oob_error():
    # Load data
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train Random Forest with OOB score
    rf_oob = RandomForestClassifier(n_estimators=100, oob_score=True, random_state=42)
    rf_oob.fit(X_train, y_train)
    
    # Calculate scores
    oob_score = rf_oob.oob_score_
    train_score = rf_oob.score(X_train, y_train)
    test_score = rf_oob.score(X_test, y_test)
    
    print("Score Comparison:")
    print(f"Training Score: {train_score:.4f}")
    print(f"OOB Score: {oob_score:.4f}")
    print(f"Test Score: {test_score:.4f}")
    print(f"\nOOB vs Test Difference: {abs(oob_score - test_score):.4f}")
    
    # Visualize
    scores = [train_score, oob_score, test_score]
    labels = ['Training', 'OOB', 'Test']
    colors = ['#3498db', '#e74c3c', '#2ecc71']
    
    plt.figure(figsize=(10, 6))
    bars = plt.bar(labels, scores, color=colors, alpha=0.7, edgecolor='black')
    plt.ylabel('Accuracy Score', fontsize=12)
    plt.title('Comparison of Training, OOB, and Test Scores', fontsize=14, fontweight='bold')
    plt.ylim(0.9, 1.0)
    plt.grid(True, alpha=0.3, axis='y')
    
    # Add value labels
    for bar, score in zip(bars, scores):
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2., height,
                f'{score:.4f}', ha='center', va='bottom', fontsize=11, fontweight='bold')
    
    plt.tight_layout()
    plt.show()

demonstrate_oob_error()

In [None]:
# OOB Error vs Number of Trees
def plot_oob_error_vs_trees():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Track errors
    n_trees = range(1, 201, 5)
    oob_errors = []
    test_errors = []
    
    for n in n_trees:
        rf = RandomForestClassifier(n_estimators=n, oob_score=True, random_state=42)
        rf.fit(X_train, y_train)
        oob_errors.append(1 - rf.oob_score_)
        test_errors.append(1 - rf.score(X_test, y_test))
    
    # Plot
    plt.figure(figsize=(12, 6))
    plt.plot(n_trees, oob_errors, 'o-', label='OOB Error', linewidth=2, markersize=4)
    plt.plot(n_trees, test_errors, 's-', label='Test Error', linewidth=2, markersize=4)
    plt.xlabel('Number of Trees', fontsize=12)
    plt.ylabel('Error Rate', fontsize=12)
    plt.title('OOB Error vs Test Error (Number of Trees)', fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    print(f"Final OOB Error: {oob_errors[-1]:.4f}")
    print(f"Final Test Error: {test_errors[-1]:.4f}")

plot_oob_error_vs_trees()

## 5. Feature Importance

### What is Feature Importance?

**Feature importance** measures the contribution of each feature to the model's predictions. Random Forests provide two main methods:

### 1. Mean Decrease in Impurity (MDI)

- Also called Gini importance
- Measures total reduction in node impurity by each feature
- Weighted by probability of reaching that node
- Fast to compute (available after training)

$$\text{Importance}(X_j) = \frac{1}{B} \sum_{b=1}^{B} \sum_{t \in T_b} \Delta i(t) \cdot \mathbb{1}(v(t) = X_j)$$

where:
- $\Delta i(t)$ is the impurity decrease at node $t$
- $v(t)$ is the feature used for split at node $t$

### 2. Mean Decrease in Accuracy (MDA)

- Also called permutation importance
- Shuffle feature values and measure decrease in accuracy
- More reliable but computationally expensive
- Works on any model

### Advantages of Random Forest Feature Importance

- **Non-linear relationships**: Captures complex interactions
- **No assumptions**: Works with any type of features
- **Ranking**: Easy to rank features by importance
- **Feature selection**: Can use for dimensionality reduction

In [None]:
# Feature Importance Demonstration
def demonstrate_feature_importance():
    # Load data
    data = load_breast_cancer()
    X, y = data.data, data.target
    feature_names = data.feature_names
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train Random Forest
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    # Method 1: MDI (built-in)
    mdi_importance = rf.feature_importances_
    
    # Method 2: MDA (permutation)
    perm_importance = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
    mda_importance = perm_importance.importances_mean
    
    # Create DataFrame
    importance_df = pd.DataFrame({
        'Feature': feature_names,
        'MDI': mdi_importance,
        'MDA': mda_importance
    })
    
    # Sort and get top 15
    importance_df = importance_df.sort_values('MDI', ascending=False).head(15)
    
    # Plot comparison
    fig, axes = plt.subplots(1, 2, figsize=(16, 8))
    
    # MDI
    axes[0].barh(range(len(importance_df)), importance_df['MDI'], color='skyblue', edgecolor='black')
    axes[0].set_yticks(range(len(importance_df)))
    axes[0].set_yticklabels(importance_df['Feature'])
    axes[0].set_xlabel('Importance Score', fontsize=12)
    axes[0].set_title('Mean Decrease in Impurity (MDI)\nGini Importance', 
                      fontsize=13, fontweight='bold')
    axes[0].invert_yaxis()
    axes[0].grid(True, alpha=0.3, axis='x')
    
    # MDA
    importance_df_mda = importance_df.sort_values('MDA', ascending=False)
    axes[1].barh(range(len(importance_df_mda)), importance_df_mda['MDA'], 
                 color='lightcoral', edgecolor='black')
    axes[1].set_yticks(range(len(importance_df_mda)))
    axes[1].set_yticklabels(importance_df_mda['Feature'])
    axes[1].set_xlabel('Importance Score', fontsize=12)
    axes[1].set_title('Mean Decrease in Accuracy (MDA)\nPermutation Importance', 
                      fontsize=13, fontweight='bold')
    axes[1].invert_yaxis()
    axes[1].grid(True, alpha=0.3, axis='x')
    
    plt.tight_layout()
    plt.show()
    
    # Print top 10
    print("\nTop 10 Features by MDI:")
    print(importance_df[['Feature', 'MDI']].head(10).to_string(index=False))

demonstrate_feature_importance()

## 6. Implementation from Scratch

Let's implement a basic Random Forest classifier from scratch to understand the algorithm better.

In [None]:
# Simple Decision Tree Node
class Node:
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature      # Feature index for split
        self.threshold = threshold  # Threshold value for split
        self.left = left           # Left child node
        self.right = right         # Right child node
        self.value = value         # Value if leaf node

# Simple Decision Tree
class SimpleDecisionTree:
    def __init__(self, max_depth=10, min_samples_split=2, max_features=None):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.root = None
    
    def fit(self, X, y):
        self.n_features = X.shape[1]
        self.root = self._grow_tree(X, y)
    
    def _grow_tree(self, X, y, depth=0):
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        
        # Stopping criteria
        if depth >= self.max_depth or n_classes == 1 or n_samples < self.min_samples_split:
            leaf_value = self._most_common_label(y)
            return Node(value=leaf_value)
        
        # Find best split
        feature_indices = np.random.choice(n_features, self.max_features, replace=False)
        best_feature, best_threshold = self._best_split(X, y, feature_indices)
        
        # Create child nodes
        left_indices = X[:, best_feature] < best_threshold
        right_indices = ~left_indices
        left = self._grow_tree(X[left_indices], y[left_indices], depth + 1)
        right = self._grow_tree(X[right_indices], y[right_indices], depth + 1)
        
        return Node(best_feature, best_threshold, left, right)
    
    def _best_split(self, X, y, feature_indices):
        best_gini = float('inf')
        best_feature = None
        best_threshold = None
        
        for feature in feature_indices:
            thresholds = np.unique(X[:, feature])
            for threshold in thresholds:
                gini = self._gini_impurity(X[:, feature], y, threshold)
                if gini < best_gini:
                    best_gini = gini
                    best_feature = feature
                    best_threshold = threshold
        
        return best_feature, best_threshold
    
    def _gini_impurity(self, X_column, y, threshold):
        left_indices = X_column < threshold
        right_indices = ~left_indices
        
        n = len(y)
        n_left, n_right = np.sum(left_indices), np.sum(right_indices)
        
        if n_left == 0 or n_right == 0:
            return float('inf')
        
        gini_left = self._gini(y[left_indices])
        gini_right = self._gini(y[right_indices])
        
        weighted_gini = (n_left / n) * gini_left + (n_right / n) * gini_right
        return weighted_gini
    
    def _gini(self, y):
        proportions = np.bincount(y) / len(y)
        return 1 - np.sum(proportions ** 2)
    
    def _most_common_label(self, y):
        return np.bincount(y).argmax()
    
    def predict(self, X):
        return np.array([self._traverse_tree(x, self.root) for x in X])
    
    def _traverse_tree(self, x, node):
        if node.value is not None:
            return node.value
        
        if x[node.feature] < node.threshold:
            return self._traverse_tree(x, node.left)
        return self._traverse_tree(x, node.right)

In [None]:
# Simple Random Forest Implementation
class SimpleRandomForest:
    def __init__(self, n_estimators=100, max_depth=10, min_samples_split=2, max_features='sqrt'):
        self.n_estimators = n_estimators
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.max_features = max_features
        self.trees = []
    
    def fit(self, X, y):
        self.trees = []
        n_features = X.shape[1]
        
        # Determine max_features
        if self.max_features == 'sqrt':
            max_features = int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            max_features = int(np.log2(n_features))
        else:
            max_features = n_features
        
        # Train trees
        for _ in range(self.n_estimators):
            # Bootstrap sample
            indices = np.random.choice(len(X), size=len(X), replace=True)
            X_bootstrap = X[indices]
            y_bootstrap = y[indices]
            
            # Train tree
            tree = SimpleDecisionTree(
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                max_features=max_features
            )
            tree.fit(X_bootstrap, y_bootstrap)
            self.trees.append(tree)
    
    def predict(self, X):
        # Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])
        
        # Majority vote
        predictions = []
        for i in range(X.shape[0]):
            predictions.append(np.bincount(tree_predictions[:, i].astype(int)).argmax())
        
        return np.array(predictions)
    
    def score(self, X, y):
        predictions = self.predict(X)
        return np.mean(predictions == y)

# Test our implementation
print("Testing Simple Random Forest Implementation...\n")

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train our Random Forest
print("Training custom Random Forest...")
custom_rf = SimpleRandomForest(n_estimators=50, max_depth=5)
custom_rf.fit(X_train, y_train)
custom_accuracy = custom_rf.score(X_test, y_test)

# Compare with sklearn
print("Training sklearn Random Forest...")
sklearn_rf = RandomForestClassifier(n_estimators=50, max_depth=5, random_state=42)
sklearn_rf.fit(X_train, y_train)
sklearn_accuracy = sklearn_rf.score(X_test, y_test)

print(f"\nResults:")
print(f"Custom Random Forest Accuracy: {custom_accuracy:.4f}")
print(f"Sklearn Random Forest Accuracy: {sklearn_accuracy:.4f}")
print(f"Difference: {abs(custom_accuracy - sklearn_accuracy):.4f}")

## 7. Scikit-learn Implementation

### Random Forest for Classification

Scikit-learn provides `RandomForestClassifier` for classification tasks:

```python
from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=None,        # Maximum depth of trees
    max_features='sqrt',   # Number of features to consider
    min_samples_split=2,   # Minimum samples to split
    min_samples_leaf=1,    # Minimum samples in leaf
    bootstrap=True,        # Use bootstrap samples
    oob_score=False,       # Calculate OOB score
    random_state=42
)
```

### Random Forest for Regression

For regression tasks, use `RandomForestRegressor`:

```python
from sklearn.ensemble import RandomForestRegressor

rf_reg = RandomForestRegressor(
    n_estimators=100,
    max_features='auto',   # For regression, default is n_features
    random_state=42
)
```

In [None]:
# Classification Example: Iris Dataset
def classification_example():
    print("=" * 70)
    print("RANDOM FOREST CLASSIFICATION EXAMPLE")
    print("=" * 70)
    
    # Load data
    data = load_iris()
    X, y = data.data, data.target
    feature_names = data.feature_names
    target_names = data.target_names
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train Random Forest
    rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf_clf.fit(X_train, y_train)
    
    # Predictions
    y_pred = rf_clf.predict(X_test)
    
    # Evaluation
    accuracy = accuracy_score(y_test, y_pred)
    print(f"\nAccuracy: {accuracy:.4f}")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=target_names))
    
    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=target_names, yticklabels=target_names)
    plt.title('Confusion Matrix - Iris Classification', fontsize=14, fontweight='bold')
    plt.ylabel('True Label', fontsize=12)
    plt.xlabel('Predicted Label', fontsize=12)
    plt.tight_layout()
    plt.show()
    
    # Feature Importance
    importance = rf_clf.feature_importances_
    indices = np.argsort(importance)[::-1]
    
    plt.figure(figsize=(10, 6))
    plt.bar(range(len(importance)), importance[indices], color='skyblue', edgecolor='black')
    plt.xticks(range(len(importance)), [feature_names[i] for i in indices], rotation=45, ha='right')
    plt.xlabel('Features', fontsize=12)
    plt.ylabel('Importance', fontsize=12)
    plt.title('Feature Importance - Iris Classification', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()

classification_example()

In [None]:
# Regression Example: Diabetes Dataset
def regression_example():
    print("=" * 70)
    print("RANDOM FOREST REGRESSION EXAMPLE")
    print("=" * 70)
    
    # Load data
    data = load_diabetes()
    X, y = data.data, data.target
    feature_names = data.feature_names
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train Random Forest
    rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
    rf_reg.fit(X_train, y_train)
    
    # Predictions
    y_pred = rf_reg.predict(X_test)
    
    # Evaluation
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    
    print(f"\nMean Squared Error: {mse:.2f}")
    print(f"Root Mean Squared Error: {rmse:.2f}")
    print(f"R² Score: {r2:.4f}")
    
    # Visualize predictions
    fig, axes = plt.subplots(1, 2, figsize=(16, 6))
    
    # Actual vs Predicted
    axes[0].scatter(y_test, y_pred, alpha=0.6, edgecolors='black')
    axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
                 'r--', linewidth=2, label='Perfect Prediction')
    axes[0].set_xlabel('Actual Values', fontsize=12)
    axes[0].set_ylabel('Predicted Values', fontsize=12)
    axes[0].set_title('Actual vs Predicted Values', fontsize=13, fontweight='bold')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Residuals
    residuals = y_test - y_pred
    axes[1].scatter(y_pred, residuals, alpha=0.6, edgecolors='black')
    axes[1].axhline(y=0, color='r', linestyle='--', linewidth=2)
    axes[1].set_xlabel('Predicted Values', fontsize=12)
    axes[1].set_ylabel('Residuals', fontsize=12)
    axes[1].set_title('Residual Plot', fontsize=13, fontweight='bold')
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Feature Importance
    importance = rf_reg.feature_importances_
    indices = np.argsort(importance)[::-1]
    
    plt.figure(figsize=(10, 6))
    plt.bar(range(len(importance)), importance[indices], color='lightcoral', edgecolor='black')
    plt.xticks(range(len(importance)), [feature_names[i] for i in indices], rotation=45, ha='right')
    plt.xlabel('Features', fontsize=12)
    plt.ylabel('Importance', fontsize=12)
    plt.title('Feature Importance - Diabetes Regression', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3, axis='y')
    plt.tight_layout()
    plt.show()

regression_example()

## 8. Hyperparameter Tuning

### Key Hyperparameters

#### 1. n_estimators
- **Definition**: Number of trees in the forest
- **Effect**: More trees → better performance but slower training
- **Typical range**: 100-500
- **Note**: Performance plateaus after certain point

#### 2. max_depth
- **Definition**: Maximum depth of each tree
- **Effect**: Deeper trees → more complex model, risk of overfitting
- **Typical range**: 5-20 or None (unlimited)

#### 3. max_features
- **Definition**: Number of features to consider at each split
- **Options**:
  - 'sqrt': $\sqrt{n\_features}$ (default for classification)
  - 'log2': $\log_2(n\_features)$
  - None: All features (default for regression)
  - int: Specific number
  - float: Percentage of features

#### 4. min_samples_split
- **Definition**: Minimum samples required to split a node
- **Effect**: Higher values → simpler trees, less overfitting
- **Typical range**: 2-20

#### 5. min_samples_leaf
- **Definition**: Minimum samples required in leaf node
- **Effect**: Higher values → smoother decision boundary
- **Typical range**: 1-10

### Tuning Strategies

1. **Grid Search**: Exhaustive search over parameter grid
2. **Random Search**: Random sampling from parameter distributions
3. **Bayesian Optimization**: Smart search using probabilistic model

In [None]:
# Effect of n_estimators
def tune_n_estimators():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    n_estimators_range = [1, 5, 10, 20, 50, 100, 200, 300, 500]
    train_scores = []
    test_scores = []
    
    for n in n_estimators_range:
        rf = RandomForestClassifier(n_estimators=n, random_state=42)
        rf.fit(X_train, y_train)
        train_scores.append(rf.score(X_train, y_train))
        test_scores.append(rf.score(X_test, y_test))
    
    plt.figure(figsize=(10, 6))
    plt.plot(n_estimators_range, train_scores, 'o-', label='Training Score', linewidth=2, markersize=8)
    plt.plot(n_estimators_range, test_scores, 's-', label='Test Score', linewidth=2, markersize=8)
    plt.xlabel('Number of Estimators', fontsize=12)
    plt.ylabel('Accuracy Score', fontsize=12)
    plt.title('Effect of n_estimators on Model Performance', fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.xscale('log')
    plt.tight_layout()
    plt.show()
    
    print("\nPerformance vs n_estimators:")
    for n, train, test in zip(n_estimators_range, train_scores, test_scores):
        print(f"n={n:3d}: Train={train:.4f}, Test={test:.4f}")

tune_n_estimators()

In [None]:
# Effect of max_depth
def tune_max_depth():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    max_depth_range = [1, 2, 3, 5, 7, 10, 15, 20, None]
    train_scores = []
    test_scores = []
    
    for depth in max_depth_range:
        rf = RandomForestClassifier(n_estimators=100, max_depth=depth, random_state=42)
        rf.fit(X_train, y_train)
        train_scores.append(rf.score(X_train, y_train))
        test_scores.append(rf.score(X_test, y_test))
    
    # For plotting, replace None with a large number
    plot_depths = [d if d is not None else 25 for d in max_depth_range]
    
    plt.figure(figsize=(10, 6))
    plt.plot(plot_depths, train_scores, 'o-', label='Training Score', linewidth=2, markersize=8)
    plt.plot(plot_depths, test_scores, 's-', label='Test Score', linewidth=2, markersize=8)
    plt.xlabel('Maximum Depth', fontsize=12)
    plt.ylabel('Accuracy Score', fontsize=12)
    plt.title('Effect of max_depth on Model Performance', fontsize=14, fontweight='bold')
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3)
    plt.xticks(plot_depths, [str(d) if d is not None else 'None' for d in max_depth_range])
    plt.tight_layout()
    plt.show()
    
    print("\nPerformance vs max_depth:")
    for depth, train, test in zip(max_depth_range, train_scores, test_scores):
        depth_str = str(depth) if depth is not None else 'None'
        print(f"depth={depth_str:>4s}: Train={train:.4f}, Test={test:.4f}")

tune_max_depth()

In [None]:
# Effect of max_features
def tune_max_features():
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    n_features = X.shape[1]
    max_features_range = ['sqrt', 'log2', 0.3, 0.5, 0.7, 1.0]
    train_scores = []
    test_scores = []
    labels = []
    
    for mf in max_features_range:
        rf = RandomForestClassifier(n_estimators=100, max_features=mf, random_state=42)
        rf.fit(X_train, y_train)
        train_scores.append(rf.score(X_train, y_train))
        test_scores.append(rf.score(X_test, y_test))
        
        if isinstance(mf, str):
            labels.append(mf)
        else:
            labels.append(f'{int(mf*n_features)}')
    
    x_pos = np.arange(len(max_features_range))
    width = 0.35
    
    plt.figure(figsize=(12, 6))
    plt.bar(x_pos - width/2, train_scores, width, label='Training Score', 
            color='skyblue', edgecolor='black')
    plt.bar(x_pos + width/2, test_scores, width, label='Test Score', 
            color='lightcoral', edgecolor='black')
    plt.xlabel('max_features', fontsize=12)
    plt.ylabel('Accuracy Score', fontsize=12)
    plt.title('Effect of max_features on Model Performance', fontsize=14, fontweight='bold')
    plt.xticks(x_pos, labels)
    plt.legend(fontsize=11)
    plt.grid(True, alpha=0.3, axis='y')
    plt.ylim(0.9, 1.0)
    plt.tight_layout()
    plt.show()
    
    print("\nPerformance vs max_features:")
    for mf, label, train, test in zip(max_features_range, labels, train_scores, test_scores):
        print(f"max_features={label:>4s}: Train={train:.4f}, Test={test:.4f}")

tune_max_features()

In [None]:
# Grid Search for Best Parameters
def grid_search_tuning():
    print("=" * 70)
    print("GRID SEARCH HYPERPARAMETER TUNING")
    print("=" * 70)
    
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Define parameter grid
    param_grid = {
        'n_estimators': [50, 100, 200],
        'max_depth': [5, 10, 20, None],
        'max_features': ['sqrt', 'log2'],
        'min_samples_split': [2, 5, 10],
        'min_samples_leaf': [1, 2, 4]
    }
    
    print("\nSearching over parameter grid:")
    print(param_grid)
    print(f"\nTotal combinations: {np.prod([len(v) for v in param_grid.values()])}")
    
    # Create and fit grid search
    rf = RandomForestClassifier(random_state=42)
    grid_search = GridSearchCV(rf, param_grid, cv=5, scoring='accuracy', n_jobs=-1, verbose=1)
    grid_search.fit(X_train, y_train)
    
    # Results
    print(f"\nBest Parameters:")
    for param, value in grid_search.best_params_.items():
        print(f"  {param}: {value}")
    
    print(f"\nBest Cross-Validation Score: {grid_search.best_score_:.4f}")
    
    # Test set performance
    best_rf = grid_search.best_estimator_
    test_score = best_rf.score(X_test, y_test)
    print(f"Test Set Score: {test_score:.4f}")
    
    # Compare with default parameters
    default_rf = RandomForestClassifier(random_state=42)
    default_rf.fit(X_train, y_train)
    default_score = default_rf.score(X_test, y_test)
    
    print(f"\nDefault Parameters Test Score: {default_score:.4f}")
    print(f"Improvement: {(test_score - default_score)*100:.2f}%")
    
    # Visualize top parameter combinations
    results_df = pd.DataFrame(grid_search.cv_results_)
    results_df = results_df.sort_values('rank_test_score').head(10)
    
    plt.figure(figsize=(12, 6))
    plt.barh(range(len(results_df)), results_df['mean_test_score'], color='skyblue', edgecolor='black')
    plt.yticks(range(len(results_df)), [f"Rank {i+1}" for i in range(len(results_df))])
    plt.xlabel('Mean CV Score', fontsize=12)
    plt.ylabel('Parameter Combination Rank', fontsize=12)
    plt.title('Top 10 Parameter Combinations', fontsize=14, fontweight='bold')
    plt.grid(True, alpha=0.3, axis='x')
    plt.tight_layout()
    plt.show()

grid_search_tuning()

## 9. Comparison with Single Decision Tree

### Why Random Forest is Better

| Aspect | Decision Tree | Random Forest |
|--------|---------------|---------------|
| **Overfitting** | High risk | Low risk (averaging reduces variance) |
| **Stability** | Sensitive to data changes | Robust and stable |
| **Accuracy** | Good | Better (ensemble effect) |
| **Interpretability** | High (single tree) | Lower (multiple trees) |
| **Training Time** | Fast | Slower (multiple trees) |
| **Prediction Time** | Fast | Slower (multiple predictions) |
| **Feature Importance** | Biased | More reliable |

### When to Use Each

**Use Decision Tree when:**
- Interpretability is crucial
- Dataset is small
- Speed is critical
- Simple relationships expected

**Use Random Forest when:**
- Accuracy is priority
- Dataset is large
- Complex relationships expected
- Robustness to noise needed

In [None]:
# Comprehensive Comparison
def compare_tree_vs_forest():
    print("=" * 70)
    print("DECISION TREE VS RANDOM FOREST COMPARISON")
    print("=" * 70)
    
    # Load data
    X, y = load_breast_cancer(return_X_y=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train models
    dt = DecisionTreeClassifier(random_state=42)
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    
    dt.fit(X_train, y_train)
    rf.fit(X_train, y_train)
    
    # Predictions
    dt_train_score = dt.score(X_train, y_train)
    dt_test_score = dt.score(X_test, y_test)
    rf_train_score = rf.score(X_train, y_train)
    rf_test_score = rf.score(X_test, y_test)
    
    # Cross-validation
    dt_cv_scores = cross_val_score(dt, X, y, cv=5)
    rf_cv_scores = cross_val_score(rf, X, y, cv=5)
    
    print("\nPerformance Metrics:")
    print("-" * 70)
    print(f"{'Metric':<25} {'Decision Tree':>20} {'Random Forest':>20}")
    print("-" * 70)
    print(f"{'Training Accuracy':<25} {dt_train_score:>20.4f} {rf_train_score:>20.4f}")
    print(f"{'Test Accuracy':<25} {dt_test_score:>20.4f} {rf_test_score:>20.4f}")
    print(f"{'CV Mean':<25} {dt_cv_scores.mean():>20.4f} {rf_cv_scores.mean():>20.4f}")
    print(f"{'CV Std':<25} {dt_cv_scores.std():>20.4f} {rf_cv_scores.std():>20.4f}")
    print(f"{'Overfit (Train-Test)':<25} {(dt_train_score-dt_test_score):>20.4f} {(rf_train_score-rf_test_score):>20.4f}")
    print("-" * 70)
    
    # Visualization
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # 1. Training vs Test Accuracy
    models = ['Decision Tree', 'Random Forest']
    train_scores = [dt_train_score, rf_train_score]
    test_scores = [dt_test_score, rf_test_score]
    
    x = np.arange(len(models))
    width = 0.35
    
    axes[0, 0].bar(x - width/2, train_scores, width, label='Training', color='skyblue', edgecolor='black')
    axes[0, 0].bar(x + width/2, test_scores, width, label='Test', color='lightcoral', edgecolor='black')
    axes[0, 0].set_ylabel('Accuracy', fontsize=11)
    axes[0, 0].set_title('Training vs Test Accuracy', fontsize=12, fontweight='bold')
    axes[0, 0].set_xticks(x)
    axes[0, 0].set_xticklabels(models)
    axes[0, 0].legend()
    axes[0, 0].grid(True, alpha=0.3, axis='y')
    axes[0, 0].set_ylim(0.9, 1.0)
    
    # 2. Cross-Validation Scores
    axes[0, 1].boxplot([dt_cv_scores, rf_cv_scores], labels=models, 
                       patch_artist=True,
                       boxprops=dict(facecolor='lightblue', edgecolor='black'),
                       medianprops=dict(color='red', linewidth=2))
    axes[0, 1].set_ylabel('Accuracy', fontsize=11)
    axes[0, 1].set_title('Cross-Validation Scores (5-Fold)', fontsize=12, fontweight='bold')
    axes[0, 1].grid(True, alpha=0.3, axis='y')
    
    # 3. Overfitting comparison
    overfitting = [dt_train_score - dt_test_score, rf_train_score - rf_test_score]
    colors = ['#e74c3c' if o > 0.05 else '#2ecc71' for o in overfitting]
    axes[1, 0].bar(models, overfitting, color=colors, edgecolor='black')
    axes[1, 0].set_ylabel('Overfitting Gap\n(Train - Test)', fontsize=11)
    axes[1, 0].set_title('Overfitting Comparison', fontsize=12, fontweight='bold')
    axes[1, 0].axhline(y=0.05, color='orange', linestyle='--', linewidth=2, label='Threshold')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3, axis='y')
    
    # 4. Learning curves
    train_sizes = np.linspace(0.1, 1.0, 10)
    dt_train_scores_lc = []
    dt_test_scores_lc = []
    rf_train_scores_lc = []
    rf_test_scores_lc = []
    
    for size in train_sizes:
        n_samples = int(len(X_train) * size)
        X_subset = X_train[:n_samples]
        y_subset = y_train[:n_samples]
        
        dt_temp = DecisionTreeClassifier(random_state=42)
        dt_temp.fit(X_subset, y_subset)
        dt_train_scores_lc.append(dt_temp.score(X_subset, y_subset))
        dt_test_scores_lc.append(dt_temp.score(X_test, y_test))
        
        rf_temp = RandomForestClassifier(n_estimators=100, random_state=42)
        rf_temp.fit(X_subset, y_subset)
        rf_train_scores_lc.append(rf_temp.score(X_subset, y_subset))
        rf_test_scores_lc.append(rf_temp.score(X_test, y_test))
    
    axes[1, 1].plot(train_sizes * len(X_train), dt_test_scores_lc, 'o-', 
                    label='Decision Tree', linewidth=2, markersize=6)
    axes[1, 1].plot(train_sizes * len(X_train), rf_test_scores_lc, 's-', 
                    label='Random Forest', linewidth=2, markersize=6)
    axes[1, 1].set_xlabel('Training Set Size', fontsize=11)
    axes[1, 1].set_ylabel('Test Accuracy', fontsize=11)
    axes[1, 1].set_title('Learning Curves', fontsize=12, fontweight='bold')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()

compare_tree_vs_forest()

## 10. Variable Importance Plots

Variable importance plots help us understand:
1. Which features contribute most to predictions
2. Which features can be removed (feature selection)
3. Domain insights about the problem

### Types of Importance Plots

1. **Bar Plot**: Standard horizontal/vertical bars
2. **Cumulative Importance**: Shows cumulative contribution
3. **Grouped Importance**: Groups related features
4. **Comparison Plot**: Compares different importance methods

In [None]:
# Comprehensive Variable Importance Visualization
def comprehensive_importance_plots():
    # Load data
    data = load_breast_cancer()
    X, y = data.data, data.target
    feature_names = data.feature_names
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train model
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X_train, y_train)
    
    # Get importances
    mdi_importance = rf.feature_importances_
    perm_result = permutation_importance(rf, X_test, y_test, n_repeats=10, random_state=42)
    perm_importance = perm_result.importances_mean
    
    # Create figure
    fig = plt.figure(figsize=(18, 12))
    gs = fig.add_gridspec(2, 2, hspace=0.3, wspace=0.3)
    
    # 1. Top features bar plot
    ax1 = fig.add_subplot(gs[0, 0])
    indices = np.argsort(mdi_importance)[::-1][:15]
    ax1.barh(range(len(indices)), mdi_importance[indices], color='skyblue', edgecolor='black')
    ax1.set_yticks(range(len(indices)))
    ax1.set_yticklabels([feature_names[i] for i in indices])
    ax1.set_xlabel('Importance Score', fontsize=11)
    ax1.set_title('Top 15 Features by MDI Importance', fontsize=12, fontweight='bold')
    ax1.invert_yaxis()
    ax1.grid(True, alpha=0.3, axis='x')
    
    # 2. Cumulative importance
    ax2 = fig.add_subplot(gs[0, 1])
    sorted_importance = np.sort(mdi_importance)[::-1]
    cumulative_importance = np.cumsum(sorted_importance)
    ax2.plot(range(1, len(cumulative_importance) + 1), cumulative_importance, 
             'o-', linewidth=2, markersize=4, color='darkblue')
    ax2.axhline(y=0.95, color='r', linestyle='--', linewidth=2, label='95% threshold')
    ax2.axhline(y=0.90, color='orange', linestyle='--', linewidth=2, label='90% threshold')
    ax2.set_xlabel('Number of Features', fontsize=11)
    ax2.set_ylabel('Cumulative Importance', fontsize=11)
    ax2.set_title('Cumulative Feature Importance', fontsize=12, fontweight='bold')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # Find number of features for 95% importance
    n_features_95 = np.argmax(cumulative_importance >= 0.95) + 1
    ax2.axvline(x=n_features_95, color='r', linestyle=':', alpha=0.5)
    ax2.text(n_features_95, 0.5, f'{n_features_95} features', rotation=90, 
             verticalalignment='center', fontsize=10)
    
    # 3. MDI vs Permutation importance
    ax3 = fig.add_subplot(gs[1, 0])
    indices = np.argsort(mdi_importance)[::-1][:15]
    x = np.arange(len(indices))
    width = 0.35
    
    ax3.barh(x - width/2, mdi_importance[indices], width, label='MDI', 
             color='skyblue', edgecolor='black')
    ax3.barh(x + width/2, perm_importance[indices], width, label='Permutation', 
             color='lightcoral', edgecolor='black')
    ax3.set_yticks(x)
    ax3.set_yticklabels([feature_names[i] for i in indices])
    ax3.set_xlabel('Importance Score', fontsize=11)
    ax3.set_title('MDI vs Permutation Importance', fontsize=12, fontweight='bold')
    ax3.legend()
    ax3.invert_yaxis()
    ax3.grid(True, alpha=0.3, axis='x')
    
    # 4. Feature importance with error bars (permutation)
    ax4 = fig.add_subplot(gs[1, 1])
    indices = np.argsort(perm_importance)[::-1][:15]
    perm_std = perm_result.importances_std[indices]
    
    ax4.barh(range(len(indices)), perm_importance[indices], 
             xerr=perm_std, color='lightgreen', edgecolor='black', capsize=5)
    ax4.set_yticks(range(len(indices)))
    ax4.set_yticklabels([feature_names[i] for i in indices])
    ax4.set_xlabel('Permutation Importance', fontsize=11)
    ax4.set_title('Permutation Importance with Std Dev', fontsize=12, fontweight='bold')
    ax4.invert_yaxis()
    ax4.grid(True, alpha=0.3, axis='x')
    
    plt.show()
    
    # Print feature selection recommendations
    print(f"\nFeature Selection Recommendations:")
    print(f"For 90% importance: Use top {np.argmax(cumulative_importance >= 0.90) + 1} features")
    print(f"For 95% importance: Use top {n_features_95} features")
    print(f"For 99% importance: Use top {np.argmax(cumulative_importance >= 0.99) + 1} features")

comprehensive_importance_plots()

## 11. Partial Dependence Plots

### What are Partial Dependence Plots (PDP)?

**Partial Dependence Plots** show the marginal effect of a feature on the predicted outcome:
- Shows relationship between feature and prediction
- Marginalizes over all other features
- Helps understand feature effects
- Useful for model interpretation

### Mathematical Definition

For feature $x_s$, the partial dependence function is:

$$PD(x_s) = \mathbb{E}_{x_c}[\hat{f}(x_s, x_c)] = \int \hat{f}(x_s, x_c) p(x_c) dx_c$$

where:
- $x_s$ is the feature of interest
- $x_c$ are all other features
- $\hat{f}$ is the model prediction
- $p(x_c)$ is the marginal distribution

### Types of PDPs

1. **1D PDP**: Effect of single feature
2. **2D PDP**: Interaction between two features
3. **ICE plots**: Individual Conditional Expectation (shows variability)

In [None]:
# Partial Dependence Plots
def create_partial_dependence_plots():
    # Load data
    data = load_breast_cancer()
    X, y = data.data, data.target
    feature_names = data.feature_names
    
    # Train model
    rf = RandomForestClassifier(n_estimators=100, random_state=42)
    rf.fit(X, y)
    
    # Get top 4 important features
    importances = rf.feature_importances_
    indices = np.argsort(importances)[::-1][:4]
    top_features = indices.tolist()
    
    print("Creating Partial Dependence Plots for top features:")
    for idx in top_features:
        print(f"  - {feature_names[idx]}")
    
    # Create 1D PDPs
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    axes = axes.ravel()
    
    for i, feature_idx in enumerate(top_features):
        display = PartialDependenceDisplay.from_estimator(
            rf, X, [feature_idx],
            feature_names=feature_names,
            ax=axes[i],
            kind='both',  # Shows both PD and ICE
            random_state=42
        )
        axes[i].set_title(f'PDP: {feature_names[feature_idx]}', 
                         fontsize=12, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Create 2D PDP (interaction between top 2 features)
    print(f"\nCreating 2D PDP for interaction between:")
    print(f"  - {feature_names[top_features[0]]}")
    print(f"  - {feature_names[top_features[1]]}")
    
    fig, ax = plt.subplots(figsize=(10, 8))
    display = PartialDependenceDisplay.from_estimator(
        rf, X, [(top_features[0], top_features[1])],
        feature_names=feature_names,
        ax=ax,
        kind='average'
    )
    plt.tight_layout()
    plt.show()

create_partial_dependence_plots()

## 12. Real-world Applications

### Common Applications

1. **Healthcare**
   - Disease prediction
   - Patient risk assessment
   - Drug discovery
   - Medical image analysis

2. **Finance**
   - Credit scoring
   - Fraud detection
   - Stock market prediction
   - Risk management

3. **E-commerce**
   - Customer churn prediction
   - Recommendation systems
   - Price optimization
   - Demand forecasting

4. **Marketing**
   - Customer segmentation
   - Campaign optimization
   - Lead scoring
   - Click-through rate prediction

5. **Manufacturing**
   - Quality control
   - Predictive maintenance
   - Process optimization
   - Defect detection

### Case Study: Credit Risk Assessment

In [None]:
# Credit Risk Assessment Case Study
def credit_risk_case_study():
    print("=" * 70)
    print("CASE STUDY: CREDIT RISK ASSESSMENT")
    print("=" * 70)
    
    # Simulate credit data
    np.random.seed(42)
    n_samples = 1000
    
    # Features
    age = np.random.randint(18, 70, n_samples)
    income = np.random.exponential(50000, n_samples)
    debt_ratio = np.random.uniform(0, 1, n_samples)
    credit_score = np.random.normal(650, 100, n_samples)
    num_accounts = np.random.randint(0, 10, n_samples)
    
    # Create target (default risk)
    # Higher risk if: low income, high debt ratio, low credit score
    risk_score = (
        -0.00001 * income +
        0.3 * debt_ratio +
        -0.002 * credit_score +
        np.random.normal(0, 0.1, n_samples)
    )
    default = (risk_score > np.median(risk_score)).astype(int)
    
    # Create DataFrame
    df = pd.DataFrame({
        'age': age,
        'income': income,
        'debt_ratio': debt_ratio,
        'credit_score': credit_score,
        'num_accounts': num_accounts,
        'default': default
    })
    
    print("\nDataset Overview:")
    print(df.describe())
    print(f"\nDefault Rate: {default.mean()*100:.2f}%")
    
    # Prepare data
    X = df.drop('default', axis=1).values
    y = df['default'].values
    feature_names = df.drop('default', axis=1).columns.tolist()
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Train Random Forest
    rf = RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42)
    rf.fit(X_train, y_train)
    
    # Predictions
    y_pred = rf.predict(X_test)
    y_pred_proba = rf.predict_proba(X_test)[:, 1]
    
    # Evaluation
    accuracy = accuracy_score(y_test, y_pred)
    print(f"\nModel Performance:")
    print(f"Accuracy: {accuracy:.4f}")
    
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))
    
    # Visualizations
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # 1. Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0, 0],
                xticklabels=['No Default', 'Default'],
                yticklabels=['No Default', 'Default'])
    axes[0, 0].set_title('Confusion Matrix', fontsize=12, fontweight='bold')
    axes[0, 0].set_ylabel('True Label')
    axes[0, 0].set_xlabel('Predicted Label')
    
    # 2. Feature Importance
    importance = rf.feature_importances_
    indices = np.argsort(importance)[::-1]
    axes[0, 1].barh(range(len(importance)), importance[indices], color='skyblue', edgecolor='black')
    axes[0, 1].set_yticks(range(len(importance)))
    axes[0, 1].set_yticklabels([feature_names[i] for i in indices])
    axes[0, 1].set_xlabel('Importance')
    axes[0, 1].set_title('Feature Importance', fontsize=12, fontweight='bold')
    axes[0, 1].invert_yaxis()
    axes[0, 1].grid(True, alpha=0.3, axis='x')
    
    # 3. Probability Distribution
    axes[1, 0].hist(y_pred_proba[y_test == 0], bins=30, alpha=0.6, 
                    label='No Default', color='green', edgecolor='black')
    axes[1, 0].hist(y_pred_proba[y_test == 1], bins=30, alpha=0.6, 
                    label='Default', color='red', edgecolor='black')
    axes[1, 0].set_xlabel('Predicted Probability')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].set_title('Default Probability Distribution', fontsize=12, fontweight='bold')
    axes[1, 0].legend()
    axes[1, 0].grid(True, alpha=0.3, axis='y')
    
    # 4. Risk Segmentation
    risk_thresholds = [0.3, 0.5, 0.7]
    risk_groups = ['Low Risk', 'Medium Risk', 'High Risk', 'Very High Risk']
    risk_assignment = np.digitize(y_pred_proba, risk_thresholds)
    
    risk_counts = [np.sum(risk_assignment == i) for i in range(4)]
    colors = ['#2ecc71', '#f39c12', '#e67e22', '#e74c3c']
    
    axes[1, 1].pie(risk_counts, labels=risk_groups, autopct='%1.1f%%',
                   colors=colors, startangle=90)
    axes[1, 1].set_title('Customer Risk Segmentation', fontsize=12, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Business Insights
    print("\n" + "=" * 70)
    print("BUSINESS INSIGHTS")
    print("=" * 70)
    
    for i, group in enumerate(risk_groups):
        count = risk_counts[i]
        pct = count / len(y_test) * 100
        print(f"{group:>15s}: {count:>4d} customers ({pct:>5.1f}%)")
    
    print("\nRecommendations:")
    print("  - Low Risk: Approve with standard rates")
    print("  - Medium Risk: Approve with slightly higher rates")
    print("  - High Risk: Approve with higher rates or collateral")
    print("  - Very High Risk: Reject or require significant collateral")

credit_risk_case_study()

## 13. Practice Problems

### Problem 1: Wine Quality Classification

Build a Random Forest classifier to predict wine quality (good vs bad) using the wine quality dataset.

**Tasks:**
1. Load and explore the data
2. Split into train/test sets
3. Train a Random Forest classifier
4. Evaluate performance
5. Find top 5 important features
6. Compare with Decision Tree

In [None]:
# Problem 1: Wine Quality Classification
from sklearn.datasets import load_wine

def wine_quality_problem():
    print("=" * 70)
    print("PROBLEM 1: WINE QUALITY CLASSIFICATION")
    print("=" * 70)
    
    # Load data
    data = load_wine()
    X, y = data.data, data.target
    feature_names = data.feature_names
    
    print("\nDataset Information:")
    print(f"Number of samples: {len(X)}")
    print(f"Number of features: {X.shape[1]}")
    print(f"Number of classes: {len(np.unique(y))}")
    
    # TODO: Your code here
    # 1. Split data into train/test (70/30)
    # 2. Train Random Forest with 100 trees
    # 3. Calculate accuracy
    # 4. Print classification report
    # 5. Plot feature importance
    # 6. Compare with Decision Tree
    
    print("\n[Your solution here]")

# Uncomment to test
# wine_quality_problem()

### Problem 2: Housing Price Prediction

Build a Random Forest regressor to predict housing prices.

**Tasks:**
1. Generate synthetic housing data
2. Train Random Forest regressor
3. Calculate RMSE and R² score
4. Create actual vs predicted plot
5. Tune hyperparameters using Grid Search
6. Compare tuned vs default model

In [None]:
# Problem 2: Housing Price Prediction
def housing_price_problem():
    print("=" * 70)
    print("PROBLEM 2: HOUSING PRICE PREDICTION")
    print("=" * 70)
    
    # Generate synthetic housing data
    X, y = make_regression(n_samples=1000, n_features=10, n_informative=8,
                          noise=10, random_state=42)
    
    feature_names = [f'Feature_{i}' for i in range(X.shape[1])]
    
    print("\nDataset Information:")
    print(f"Number of samples: {len(X)}")
    print(f"Number of features: {X.shape[1]}")
    
    # TODO: Your code here
    # 1. Split data into train/test (80/20)
    # 2. Train Random Forest regressor
    # 3. Calculate RMSE and R²
    # 4. Create actual vs predicted plot
    # 5. Perform Grid Search for best parameters
    # 6. Compare results
    
    print("\n[Your solution here]")

# Uncomment to test
# housing_price_problem()

### Problem 3: Customer Churn Prediction

Build a Random Forest model to predict customer churn.

**Tasks:**
1. Create synthetic customer data (age, tenure, monthly_charges, etc.)
2. Handle class imbalance if present
3. Train Random Forest with OOB scoring
4. Calculate feature importances
5. Create partial dependence plots for top 3 features
6. Provide business recommendations

In [None]:
# Problem 3: Customer Churn Prediction
def customer_churn_problem():
    print("=" * 70)
    print("PROBLEM 3: CUSTOMER CHURN PREDICTION")
    print("=" * 70)
    
    # Generate synthetic customer data
    np.random.seed(42)
    n_samples = 2000
    
    age = np.random.randint(18, 80, n_samples)
    tenure = np.random.randint(0, 72, n_samples)  # months
    monthly_charges = np.random.uniform(20, 200, n_samples)
    total_charges = monthly_charges * tenure + np.random.normal(0, 100, n_samples)
    num_products = np.random.randint(1, 5, n_samples)
    
    # Create churn (more likely if: high charges, low tenure)
    churn_prob = (
        0.005 * monthly_charges +
        -0.01 * tenure +
        -0.05 * num_products +
        np.random.normal(0, 0.5, n_samples)
    )
    churn = (churn_prob > np.median(churn_prob)).astype(int)
    
    # Create DataFrame
    df = pd.DataFrame({
        'age': age,
        'tenure': tenure,
        'monthly_charges': monthly_charges,
        'total_charges': total_charges,
        'num_products': num_products,
        'churn': churn
    })
    
    print("\nDataset Overview:")
    print(df.describe())
    print(f"\nChurn Rate: {churn.mean()*100:.2f}%")
    
    # TODO: Your code here
    # 1. Split data and train Random Forest with OOB
    # 2. Evaluate model performance
    # 3. Plot feature importances
    # 4. Create partial dependence plots
    # 5. Identify high-risk customer segments
    # 6. Provide retention recommendations
    
    print("\n[Your solution here]")

# Uncomment to test
# customer_churn_problem()

## Summary and Key Takeaways

### Key Concepts

1. **Ensemble Methods**: Combine multiple models for better performance
2. **Bagging**: Bootstrap aggregating reduces variance
3. **Random Forest**: Bagging + random feature selection
4. **OOB Error**: Free validation without separate test set
5. **Feature Importance**: Understanding variable contributions

### Advantages of Random Forest

- Reduces overfitting compared to single trees
- Handles large datasets efficiently
- Works well with high-dimensional data
- Provides feature importance
- Robust to outliers and noise
- No need for feature scaling
- Handles missing values well

### Disadvantages

- Less interpretable than single tree
- Slower training and prediction
- Memory intensive
- Can overfit on noisy datasets
- Biased toward categorical features with many levels

### Best Practices

1. **Start with defaults**: Often work well
2. **Increase n_estimators**: More trees usually help (with diminishing returns)
3. **Use OOB score**: For quick validation
4. **Check feature importance**: For insights and feature selection
5. **Tune max_depth**: Control complexity
6. **Monitor training time**: Balance accuracy and speed
7. **Cross-validate**: Ensure robust performance

### When to Use Random Forest

**Use Random Forest when:**
- High accuracy is priority
- Large dataset available
- Feature importance needed
- Robustness to overfitting desired
- Mixed feature types present

**Consider alternatives when:**
- Interpretability is critical (use Decision Tree)
- Very large scale data (use Linear Models)
- Real-time predictions needed (use simpler models)
- Memory is constrained (use single tree)

### Further Learning

1. **Extra Trees**: Extremely Randomized Trees
2. **Gradient Boosting**: XGBoost, LightGBM, CatBoost
3. **Stacking**: Combining different model types
4. **Feature Engineering**: Creating better features
5. **Hyperparameter Optimization**: Bayesian optimization, Optuna

---

**Resources:**
- Scikit-learn Documentation: https://scikit-learn.org/stable/modules/ensemble.html
- "The Elements of Statistical Learning" by Hastie, Tibshirani, and Friedman
- "Random Forests" by Leo Breiman (2001)
- Kaggle competitions for practice

---

**End of Notebook**