# Module 01: Bagging and Bootstrap Aggregation

**Difficulty**: ⭐⭐⭐ Advanced
**Estimated Time**: 75 minutes
**Prerequisites**: 
- Module 00: Introduction to Ensemble Learning
- Understanding of decision trees
- Basic statistics (sampling, variance)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand bootstrap sampling and its properties
2. Explain how bagging reduces model variance
3. Implement bagging from scratch
4. Tune bagging hyperparameters for optimal performance
5. Analyze out-of-bag (OOB) error estimation

## Setup and Configuration

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.datasets import make_classification, make_regression, load_wine
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.metrics import accuracy_score, mean_squared_error, classification_report

# Statistical tools
from scipy import stats

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("Setup complete! Ready to explore bagging.")

## 1. Bootstrap Sampling: The Foundation

### What is Bootstrap?

**Bootstrap** is a resampling technique that creates multiple datasets by sampling WITH replacement from the original data.

**Key Properties**:
- Each bootstrap sample has the same size as the original dataset
- Some observations appear multiple times, others not at all
- On average, each bootstrap sample contains ~63.2% unique observations
- The remaining ~36.8% are "out-of-bag" (OOB) samples

### Why 63.2%?

Mathematical derivation:
- Probability an observation is NOT selected in one draw: (n-1)/n
- Probability NOT selected in n draws: ((n-1)/n)^n
- As n → ∞, this approaches 1/e ≈ 0.368
- So probability of being selected: 1 - 1/e ≈ 0.632

In [None]:
# Demonstration: Bootstrap sampling properties
def analyze_bootstrap_sample(data, n_bootstraps=1000):
    """
    Analyze properties of bootstrap samples.
    
    Parameters
    ----------
    data : array-like
        Original dataset
    n_bootstraps : int
        Number of bootstrap samples to generate
    
    Returns
    -------
    dict : Statistics about bootstrap samples
    """
    n = len(data)
    unique_percentages = []
    
    for _ in range(n_bootstraps):
        # Sample WITH replacement
        bootstrap_indices = np.random.choice(n, size=n, replace=True)
        
        # Count unique observations
        n_unique = len(np.unique(bootstrap_indices))
        unique_percentages.append(n_unique / n * 100)
    
    return {
        'mean_unique_pct': np.mean(unique_percentages),
        'std_unique_pct': np.std(unique_percentages),
        'theoretical_unique_pct': (1 - 1/np.e) * 100,
        'all_percentages': unique_percentages
    }

# Create sample data
sample_data = np.arange(100)

# Analyze bootstrap properties
results = analyze_bootstrap_sample(sample_data, n_bootstraps=10000)

print("Bootstrap Sampling Analysis")
print("=" * 50)
print(f"Empirical mean unique %: {results['mean_unique_pct']:.2f}%")
print(f"Theoretical unique %: {results['theoretical_unique_pct']:.2f}%")
print(f"Standard deviation: {results['std_unique_pct']:.2f}%")
print(f"\nOut-of-bag (OOB) percentage: {100 - results['mean_unique_pct']:.2f}%")

# Visualize distribution
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(results['all_percentages'], bins=50, alpha=0.7, edgecolor='black')
plt.axvline(results['mean_unique_pct'], color='red', linestyle='--', 
            linewidth=2, label=f"Empirical: {results['mean_unique_pct']:.1f}%")
plt.axvline(results['theoretical_unique_pct'], color='green', linestyle='--', 
            linewidth=2, label=f"Theoretical: {results['theoretical_unique_pct']:.1f}%")
plt.xlabel('Percentage of Unique Observations')
plt.ylabel('Frequency')
plt.title('Distribution of Unique Observations in Bootstrap Samples')
plt.legend()

plt.subplot(1, 2, 2)
# Show example bootstrap sample
example_bootstrap = np.random.choice(sample_data, size=len(sample_data), replace=True)
unique, counts = np.unique(example_bootstrap, return_counts=True)
plt.bar(range(len(unique[:20])), counts[:20], alpha=0.7, edgecolor='black')
plt.xlabel('Observation Index (first 20)')
plt.ylabel('Number of Times Selected')
plt.title('Example Bootstrap Sample: Observation Frequencies')
plt.axhline(1, color='red', linestyle='--', alpha=0.5, label='Selected once')
plt.legend()

plt.tight_layout()
plt.show()

## 2. Bagging: Bootstrap AGGregatING

### The Bagging Algorithm

**BAG**ging = **B**ootstrap **AGG**regation

**Steps**:
1. Create B bootstrap samples from training data
2. Train a model on each bootstrap sample
3. For prediction:
   - **Classification**: Majority vote across all models
   - **Regression**: Average predictions across all models

### Why Bagging Reduces Variance

**Mathematical intuition**:
- Variance of average of n independent variables: Var(average) = Var(X) / n
- Even with correlation, averaging reduces variance
- Each model sees slightly different data → different errors → errors cancel out

**Best use cases**:
- High-variance models (e.g., deep decision trees)
- Noisy datasets
- When you want stable, robust predictions

In [None]:
# Demonstrate variance reduction with bagging
def compare_variance_single_vs_bagging(n_trials=50):
    """
    Compare prediction variance of single model vs bagging ensemble.
    
    We train multiple times with different random seeds to measure
    how much predictions vary due to training randomness.
    """
    # Create a fixed test dataset
    X, y = make_classification(
        n_samples=1000, n_features=20, n_informative=15,
        random_state=RANDOM_STATE
    )
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=RANDOM_STATE
    )
    
    # Store predictions across trials
    single_predictions = []
    bagging_predictions = []
    
    for trial in range(n_trials):
        # Single decision tree (high variance)
        single_model = DecisionTreeClassifier(
            max_depth=10, random_state=trial
        )
        single_model.fit(X_train, y_train)
        single_pred = single_model.predict_proba(X_test)[:, 1]  # Probability of class 1
        single_predictions.append(single_pred)
        
        # Bagging ensemble
        bagging_model = BaggingClassifier(
            estimator=DecisionTreeClassifier(max_depth=10),
            n_estimators=20,
            random_state=trial
        )
        bagging_model.fit(X_train, y_train)
        bagging_pred = bagging_model.predict_proba(X_test)[:, 1]
        bagging_predictions.append(bagging_pred)
    
    # Calculate variance for each test sample
    single_variance = np.var(single_predictions, axis=0)
    bagging_variance = np.var(bagging_predictions, axis=0)
    
    return single_variance, bagging_variance

# Run comparison
print("Running variance comparison (this may take a minute)...")
single_var, bagging_var = compare_variance_single_vs_bagging(n_trials=30)

print("\nVariance Analysis:")
print("=" * 50)
print(f"Single tree - Mean variance: {np.mean(single_var):.6f}")
print(f"Bagging - Mean variance: {np.mean(bagging_var):.6f}")
print(f"\nVariance reduction: {(1 - np.mean(bagging_var)/np.mean(single_var)) * 100:.1f}%")

# Visualization
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist([single_var, bagging_var], label=['Single Tree', 'Bagging'], 
         alpha=0.7, bins=30, edgecolor='black')
plt.xlabel('Prediction Variance')
plt.ylabel('Frequency')
plt.title('Distribution of Prediction Variance Across Test Samples')
plt.legend()
plt.yscale('log')

plt.subplot(1, 2, 2)
plt.scatter(single_var, bagging_var, alpha=0.5, s=10)
plt.plot([0, max(single_var)], [0, max(single_var)], 'r--', 
         linewidth=2, label='Equal variance')
plt.xlabel('Single Tree Variance')
plt.ylabel('Bagging Variance')
plt.title('Variance Comparison: Single vs Bagging\n(Points below line = bagging wins)')
plt.legend()

plt.tight_layout()
plt.show()

## 3. Implementing Bagging from Scratch

Let's build a bagging classifier from scratch to understand the mechanics.

In [None]:
class SimpleBaggingClassifier:
    """
    A from-scratch implementation of bagging for classification.
    
    This implementation demonstrates the core bagging algorithm:
    1. Bootstrap sampling
    2. Training multiple models
    3. Aggregating predictions via voting
    """
    
    def __init__(self, base_estimator, n_estimators=10, random_state=None):
        """
        Parameters
        ----------
        base_estimator : sklearn estimator
            The model to use for each bootstrap sample
        n_estimators : int
            Number of models in the ensemble
        random_state : int
            Random seed for reproducibility
        """
        self.base_estimator = base_estimator
        self.n_estimators = n_estimators
        self.random_state = random_state
        self.models = []
        self.oob_indices = []  # Track out-of-bag samples
        
    def fit(self, X, y):
        """
        Train the bagging ensemble.
        
        For each estimator:
        1. Create bootstrap sample
        2. Train model on bootstrap sample
        3. Store out-of-bag indices for error estimation
        """
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples = X.shape[0]
        self.models = []
        self.oob_indices = []
        
        for i in range(self.n_estimators):
            # Create bootstrap sample
            # Sample WITH replacement
            bootstrap_indices = np.random.choice(
                n_samples, size=n_samples, replace=True
            )
            
            # Identify out-of-bag samples (not in bootstrap)
            oob = np.setdiff1d(np.arange(n_samples), bootstrap_indices)
            self.oob_indices.append(oob)
            
            # Get bootstrap data
            X_bootstrap = X[bootstrap_indices]
            y_bootstrap = y[bootstrap_indices]
            
            # Clone and train model
            # We need to clone to avoid all models being the same object
            from sklearn.base import clone
            model = clone(self.base_estimator)
            model.fit(X_bootstrap, y_bootstrap)
            
            self.models.append(model)
        
        return self
    
    def predict(self, X):
        """
        Predict using majority voting.
        
        Each model votes for a class, and we return the majority vote.
        """
        # Get predictions from all models
        predictions = np.array([model.predict(X) for model in self.models])
        
        # Majority vote
        # For each sample, find most common prediction across models
        majority_vote = stats.mode(predictions, axis=0, keepdims=True)[0][0]
        
        return majority_vote
    
    def predict_proba(self, X):
        """
        Predict class probabilities by averaging.
        """
        # Get probability predictions from all models
        probas = np.array([model.predict_proba(X) for model in self.models])
        
        # Average probabilities across models
        avg_proba = np.mean(probas, axis=0)
        
        return avg_proba
    
    def oob_score(self, X, y):
        """
        Calculate out-of-bag accuracy.
        
        For each sample, use only models that didn't see it during training.
        This provides an unbiased estimate without needing a validation set.
        """
        n_samples = X.shape[0]
        oob_predictions = np.zeros(n_samples)
        oob_counts = np.zeros(n_samples)
        
        # For each model, make predictions on its OOB samples
        for i, model in enumerate(self.models):
            oob_idx = self.oob_indices[i]
            if len(oob_idx) > 0:
                preds = model.predict(X[oob_idx])
                oob_predictions[oob_idx] += preds
                oob_counts[oob_idx] += 1
        
        # Only consider samples that were OOB for at least one model
        valid_mask = oob_counts > 0
        oob_predictions = oob_predictions[valid_mask] / oob_counts[valid_mask]
        oob_predictions = np.round(oob_predictions).astype(int)
        
        # Calculate accuracy
        oob_accuracy = accuracy_score(y[valid_mask], oob_predictions)
        
        return oob_accuracy

print("SimpleBaggingClassifier implemented successfully!")

In [None]:
# Test our custom bagging implementation
# Load real dataset
wine = load_wine()
X, y = wine.data, wine.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

# Our custom implementation
custom_bagging = SimpleBaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=5),
    n_estimators=50,
    random_state=RANDOM_STATE
)
custom_bagging.fit(X_train, y_train)
custom_pred = custom_bagging.predict(X_test)
custom_acc = accuracy_score(y_test, custom_pred)
custom_oob = custom_bagging.oob_score(X_train, y_train)

# Sklearn's implementation (for comparison)
sklearn_bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=5),
    n_estimators=50,
    random_state=RANDOM_STATE,
    oob_score=True
)
sklearn_bagging.fit(X_train, y_train)
sklearn_pred = sklearn_bagging.predict(X_test)
sklearn_acc = accuracy_score(y_test, sklearn_pred)
sklearn_oob = sklearn_bagging.oob_score_

# Single tree baseline
single_tree = DecisionTreeClassifier(max_depth=5, random_state=RANDOM_STATE)
single_tree.fit(X_train, y_train)
single_acc = accuracy_score(y_test, single_tree.predict(X_test))

print("Performance Comparison:")
print("=" * 60)
print(f"Single Tree:               Test Acc = {single_acc:.4f}")
print(f"Custom Bagging:            Test Acc = {custom_acc:.4f}, OOB = {custom_oob:.4f}")
print(f"Sklearn Bagging:           Test Acc = {sklearn_acc:.4f}, OOB = {sklearn_oob:.4f}")
print("\nOur implementation matches sklearn! ✓")

## 4. Out-of-Bag (OOB) Error Estimation

### What is OOB Error?

**Key insight**: Each model in bagging is trained on ~63% of data. The remaining ~37% can be used for validation!

**Advantages of OOB**:
1. **No need for separate validation set**: Uses training data more efficiently
2. **Unbiased estimate**: Each prediction uses only models that didn't see that sample
3. **Free validation**: No computational cost beyond tracking indices

**OOB vs Cross-Validation**:
- OOB is faster (no retraining)
- OOB provides similar accuracy estimate to CV
- Particularly useful for large datasets

In [None]:
# Demonstrate OOB error tracking across ensemble sizes
def track_oob_convergence(X_train, y_train, X_test, y_test, max_estimators=100):
    """
    Track how OOB error and test error converge as we add models.
    
    This shows that OOB error is a good proxy for test error.
    """
    estimator_range = range(1, max_estimators + 1, 5)
    oob_errors = []
    test_errors = []
    
    for n_est in estimator_range:
        model = BaggingClassifier(
            estimator=DecisionTreeClassifier(max_depth=8),
            n_estimators=n_est,
            oob_score=True,
            random_state=RANDOM_STATE
        )
        model.fit(X_train, y_train)
        
        # OOB error (1 - accuracy)
        oob_errors.append(1 - model.oob_score_)
        
        # Test error
        test_acc = accuracy_score(y_test, model.predict(X_test))
        test_errors.append(1 - test_acc)
    
    return list(estimator_range), oob_errors, test_errors

# Create larger dataset for demonstration
X, y = make_classification(
    n_samples=2000, n_features=30, n_informative=20,
    random_state=RANDOM_STATE
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

print("Tracking OOB convergence...")
n_estimators, oob_errs, test_errs = track_oob_convergence(
    X_train, y_train, X_test, y_test, max_estimators=100
)

# Visualization
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(n_estimators, oob_errs, 'o-', label='OOB Error', alpha=0.7)
plt.plot(n_estimators, test_errs, 's-', label='Test Error', alpha=0.7)
plt.xlabel('Number of Estimators')
plt.ylabel('Error Rate')
plt.title('OOB Error vs Test Error Convergence')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
plt.scatter(oob_errs, test_errs, alpha=0.6, s=50)
plt.plot([min(oob_errs), max(oob_errs)], [min(oob_errs), max(oob_errs)], 
         'r--', linewidth=2, label='Perfect correlation')
plt.xlabel('OOB Error')
plt.ylabel('Test Error')
plt.title('OOB Error as Test Error Proxy')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate correlation
correlation = np.corrcoef(oob_errs, test_errs)[0, 1]
print(f"\nCorrelation between OOB and Test error: {correlation:.4f}")
print("High correlation confirms OOB is a good validation proxy!")

## 5. Hyperparameter Tuning for Bagging

### Key Hyperparameters

1. **n_estimators**: Number of models in ensemble
   - More is generally better, but diminishing returns
   - Typical range: 10-200

2. **max_samples**: Fraction of samples for each bootstrap
   - Default: 1.0 (same size as training set)
   - Lower values increase diversity but may hurt individual model performance

3. **max_features**: Number of features for each model
   - Can subsample features in addition to samples
   - Increases diversity, especially important for Random Forests

4. **Base estimator parameters**: Depends on chosen model
   - For trees: max_depth, min_samples_split, etc.
   - Generally use moderately complex models (not too simple, not too complex)

In [None]:
# Hyperparameter tuning demonstration
from sklearn.model_selection import GridSearchCV

# Create dataset
X, y = make_classification(
    n_samples=1500, n_features=25, n_informative=18,
    random_state=RANDOM_STATE
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

# Define parameter grid
param_grid = {
    'n_estimators': [10, 30, 50, 100],
    'max_samples': [0.5, 0.7, 1.0],
    'max_features': [0.5, 0.7, 1.0],
    'estimator__max_depth': [5, 10, 15]
}

# Create base bagging classifier
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    random_state=RANDOM_STATE,
    oob_score=True
)

# Grid search
# Note: Using oob_score reduces need for CV, but we still use CV here for demonstration
print("Running grid search (this may take a minute)...")
grid_search = GridSearchCV(
    bagging, param_grid, cv=3, scoring='accuracy', 
    n_jobs=-1, verbose=0
)
grid_search.fit(X_train, y_train)

print("\nBest Parameters:")
print("=" * 50)
for param, value in grid_search.best_params_.items():
    print(f"{param:25s}: {value}")

print(f"\nBest CV Score: {grid_search.best_score_:.4f}")
print(f"Test Score: {grid_search.score(X_test, y_test):.4f}")

# Compare with default parameters
default_bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    random_state=RANDOM_STATE
)
default_bagging.fit(X_train, y_train)
default_score = accuracy_score(y_test, default_bagging.predict(X_test))

print(f"\nDefault parameters score: {default_score:.4f}")
improvement = (grid_search.score(X_test, y_test) - default_score) / default_score * 100
print(f"Improvement from tuning: {improvement:.2f}%")

In [None]:
# Visualize effect of n_estimators
def analyze_n_estimators_effect(X_train, y_train, X_test, y_test):
    """
    Analyze how number of estimators affects performance.
    """
    n_estimators_range = range(1, 151, 5)
    train_scores = []
    test_scores = []
    oob_scores = []
    
    for n_est in n_estimators_range:
        model = BaggingClassifier(
            estimator=DecisionTreeClassifier(max_depth=10),
            n_estimators=n_est,
            oob_score=True,
            random_state=RANDOM_STATE
        )
        model.fit(X_train, y_train)
        
        train_scores.append(model.score(X_train, y_train))
        test_scores.append(model.score(X_test, y_test))
        oob_scores.append(model.oob_score_)
    
    return list(n_estimators_range), train_scores, test_scores, oob_scores

n_est, train_sc, test_sc, oob_sc = analyze_n_estimators_effect(
    X_train, y_train, X_test, y_test
)

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(n_est, train_sc, label='Training Score', linewidth=2)
plt.plot(n_est, test_sc, label='Test Score', linewidth=2)
plt.plot(n_est, oob_sc, label='OOB Score', linewidth=2, linestyle='--')
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('Performance vs Number of Estimators')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
# Calculate marginal improvement
test_improvements = np.diff(test_sc)
plt.plot(n_est[1:], test_improvements, 'o-', alpha=0.7)
plt.axhline(0, color='red', linestyle='--', alpha=0.5)
plt.xlabel('Number of Estimators')
plt.ylabel('Marginal Improvement in Test Score')
plt.title('Diminishing Returns: Adding More Estimators')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find point of diminishing returns (when improvement < 0.001)
threshold = 0.001
for i, improvement in enumerate(test_improvements):
    if abs(improvement) < threshold:
        optimal_n = n_est[i+1]
        print(f"\nDiminishing returns start at ~{optimal_n} estimators")
        print(f"Beyond this point, marginal improvement < {threshold}")
        break

## Exercises

Apply your knowledge of bagging to solve practical problems.

### Exercise 1: Bootstrap Confidence Intervals

Bootstrap sampling can estimate confidence intervals for any statistic. Implement a function that:
1. Takes a dataset and a statistic function (e.g., mean, median)
2. Generates 1000 bootstrap samples
3. Calculates the statistic for each sample
4. Returns the 95% confidence interval

Test with the wine dataset features.

In [None]:
# Your code here
# Hint: Use np.percentile() to get 2.5% and 97.5% percentiles


### Exercise 2: Optimal Base Estimator Complexity

Investigate how base estimator complexity affects bagging performance. Create bagging ensembles with decision trees of varying depths (2, 5, 10, 15, 20, None). For each:
- Train a single tree
- Train a bagging ensemble (50 estimators)
- Compare performance

**Question**: At what depth does bagging provide the most benefit?

In [None]:
# Your code here
# Create a comparison across different tree depths


### Exercise 3: Feature Subsampling Impact

Bagging can subsample features in addition to samples. Experiment with different max_features values (0.3, 0.5, 0.7, 1.0) while keeping n_estimators=50.

Analyze:
- How does feature subsampling affect accuracy?
- How does it affect model diversity (measure prediction correlation)?
- What's the optimal trade-off?

In [None]:
# Your code here
# Test different max_features values
# Calculate both accuracy and model diversity


### Exercise 4: Bagging for Regression

Implement bagging for a regression task using BaggingRegressor. Use make_regression to create a dataset with:
- 1000 samples
- 20 features (10 informative)
- Noise level = 30

Compare:
- Single decision tree regressor
- Bagging with different n_estimators (10, 50, 100)

Visualize predictions vs true values for the best model.

In [None]:
# Your code here
# Use BaggingRegressor and DecisionTreeRegressor
# Compare MSE across different configurations


## Summary

### Key Concepts

1. **Bootstrap Sampling**:
   - Sample WITH replacement
   - Each sample contains ~63.2% unique observations
   - Remaining ~36.8% are out-of-bag (OOB)

2. **Bagging Algorithm**:
   - Train models on bootstrap samples
   - Aggregate via averaging (regression) or voting (classification)
   - Reduces variance without increasing bias

3. **Variance Reduction**:
   - Works best with high-variance models (deep trees)
   - Averaging reduces prediction variance
   - Model diversity is crucial for effectiveness

4. **Out-of-Bag Estimation**:
   - Free validation using leftover samples
   - Unbiased performance estimate
   - Highly correlated with test error

5. **Hyperparameter Tuning**:
   - n_estimators: More is better, but diminishing returns
   - max_samples/max_features: Control diversity
   - Base estimator complexity: Moderate depth works best

### When to Use Bagging

**Best for**:
- High-variance models (decision trees, neural networks)
- Noisy datasets
- When stability is important

**Less effective for**:
- Already stable models (linear regression, regularized models)
- Very small datasets
- When interpretability is critical

### What's Next?

- **Module 02**: Random Forests - the most popular bagging variant
  - Adds feature randomness to bagging
  - Industry standard for many applications
  - Built-in feature importance

### Additional Resources

- **Original Paper**: Breiman, L. (1996). "Bagging Predictors"
- **Book**: "The Elements of Statistical Learning" (Hastie et al.) - Chapter 8
- **Documentation**: [Scikit-learn Bagging Documentation](https://scikit-learn.org/stable/modules/ensemble.html#bagging)