# Module 01: Bagging and Bootstrap Aggregation

**Difficulty**: ‚≠ê‚≠ê‚≠ê Advanced

**Estimated Time**: 75 minutes

**Prerequisites**: 
- Module 00: Introduction to Ensemble Learning
- Understanding of decision trees
- Basic probability and sampling concepts

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand bootstrap sampling and why it creates model diversity
2. Implement bagging from scratch and with scikit-learn
3. Explain how bagging reduces variance without increasing bias
4. Use out-of-bag (OOB) error estimation for model evaluation
5. Determine optimal number of estimators for bagging ensembles
6. Apply bagging to both classification and regression problems

---

## 1. Setup and Imports

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter

# Scikit-learn models and utilities
from sklearn.datasets import make_classification, make_regression, load_wine
from sklearn.model_selection import (
    train_test_split, 
    cross_val_score,
    learning_curve
)
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import BaggingClassifier, BaggingRegressor
from sklearn.metrics import (
    accuracy_score, 
    mean_squared_error,
    r2_score,
    confusion_matrix
)

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Set random seeds for reproducibility
np.random.seed(42)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Setup complete!")

## 2. Understanding Bootstrap Sampling

### What is Bootstrap?

**Bootstrap sampling** is a resampling technique where we:
1. Randomly sample n observations from dataset of size n
2. Sample **with replacement** (same observation can appear multiple times)
3. Create multiple different training sets from the same original data

### Why Bootstrap?

**Problem**: We want to train multiple models for ensemble, but we only have one dataset.

**Solution**: Create multiple "synthetic" datasets through bootstrap sampling.

### Mathematical Properties

**Key insight**: Each bootstrap sample contains approximately **63.2%** unique observations from original data.

Probability an observation is **selected** in one draw: $\frac{1}{n}$

Probability it's **NOT selected** in one draw: $1 - \frac{1}{n}$

Probability it's **NOT selected** in n draws: $\left(1 - \frac{1}{n}\right)^n$

As $n \to \infty$: $\left(1 - \frac{1}{n}\right)^n \to \frac{1}{e} \approx 0.368$

Therefore, probability observation **IS selected**: $1 - 0.368 = 0.632$ ‚ú®

### Demonstration

In [None]:
# Demonstrate bootstrap sampling
original_data = np.array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J'])
n_samples = len(original_data)

# Create 5 bootstrap samples
n_bootstraps = 5
bootstrap_samples = []

print(f"Original dataset: {original_data}\n")
print("Bootstrap Samples:")
print("=" * 60)

for i in range(n_bootstraps):
    # Sample with replacement
    bootstrap_idx = np.random.choice(n_samples, size=n_samples, replace=True)
    bootstrap_sample = original_data[bootstrap_idx]
    bootstrap_samples.append(bootstrap_sample)
    
    # Calculate statistics
    unique_items = np.unique(bootstrap_sample)
    unique_pct = len(unique_items) / n_samples * 100
    
    # Show sample composition
    counter = Counter(bootstrap_sample)
    
    print(f"\nSample {i+1}: {bootstrap_sample}")
    print(f"  Unique items: {len(unique_items)}/10 ({unique_pct:.1f}%)")
    print(f"  Item counts: {dict(counter)}")
    print(f"  Missing from original: {set(original_data) - set(unique_items)}")

print("\n" + "=" * 60)
print("\nüìä Key Observations:")
print("  1. Each sample has same size as original (10)")
print("  2. Some items appear multiple times (duplication)")
print("  3. Some original items are missing (out-of-bag samples)")
print("  4. Each sample is different (diversity!)")

### Verify the 63.2% Rule

In [None]:
# Empirically verify the 63.2% rule
def calculate_unique_percentage(n_samples, n_bootstraps=1000):
    """
    Calculate average percentage of unique samples in bootstrap.
    
    Args:
        n_samples: Size of original dataset
        n_bootstraps: Number of bootstrap samples to create
    
    Returns:
        Average percentage of unique samples
    """
    unique_percentages = []
    
    for _ in range(n_bootstraps):
        # Create bootstrap sample
        bootstrap_idx = np.random.choice(n_samples, size=n_samples, replace=True)
        unique_count = len(np.unique(bootstrap_idx))
        unique_pct = unique_count / n_samples * 100
        unique_percentages.append(unique_pct)
    
    return np.mean(unique_percentages)

# Test with different dataset sizes
dataset_sizes = [10, 50, 100, 500, 1000, 5000]
results = []

for size in dataset_sizes:
    avg_pct = calculate_unique_percentage(size)
    results.append(avg_pct)
    print(f"n={size:5d}: {avg_pct:.2f}% unique (expected: 63.2%)")

# Visualize convergence to 63.2%
plt.figure(figsize=(10, 6))
plt.plot(dataset_sizes, results, marker='o', linewidth=2, markersize=8, label='Empirical')
plt.axhline(y=63.2, color='red', linestyle='--', linewidth=2, label='Theoretical (63.2%)')
plt.xlabel('Dataset Size (n)', fontsize=12)
plt.ylabel('% Unique Samples in Bootstrap', fontsize=12)
plt.title('Bootstrap Sampling: Convergence to 63.2% Rule', fontsize=14, fontweight='bold')
plt.xscale('log')
plt.grid(True, alpha=0.3)
plt.legend(fontsize=11)
plt.tight_layout()
plt.show()

print(f"\n‚úÖ As dataset size increases, percentage converges to 63.2%!")

### üéØ Exercise 1: Bootstrap Sampling Properties

Explore bootstrap sampling behavior:

1. Create a dataset of 100 samples
2. Generate 50 bootstrap samples
3. For each original sample, count how many times it appears across all bootstrap samples
4. Plot the distribution of counts
5. What distribution does this follow? (Hint: Think about repeated independent trials)

In [None]:
# Your code here


## 3. Bagging: Bootstrap Aggregating

### The Bagging Algorithm

**Bootstrap Aggregating (Bagging)** combines bootstrap sampling with model aggregation:

1. **Bootstrap**: Create B bootstrap samples from training data
2. **Train**: Train a model on each bootstrap sample
3. **Aggregate**: Combine predictions
   - Classification: Majority vote
   - Regression: Average predictions

### Why Bagging Reduces Variance

**Single Model Variance**: $\sigma^2$

**Averaged Model Variance** (if models are independent): $\frac{\sigma^2}{B}$

**Key**: Even if models are partially correlated, averaging still reduces variance!

If correlation is $\rho$:

$$\text{Ensemble Variance} = \rho\sigma^2 + \frac{1-\rho}{B}\sigma^2$$

### Bagging from Scratch

In [None]:
class SimpleBaggingClassifier:
    """
    Simple implementation of Bagging Classifier from scratch.
    
    This helps understand the core algorithm before using sklearn.
    """
    
    def __init__(self, base_estimator, n_estimators=10, random_state=None):
        """
        Args:
            base_estimator: Base model to use (e.g., DecisionTreeClassifier)
            n_estimators: Number of models in ensemble
            random_state: Random seed for reproducibility
        """
        self.base_estimator = base_estimator
        self.n_estimators = n_estimators
        self.random_state = random_state
        self.estimators_ = []
    
    def fit(self, X, y):
        """
        Train ensemble on data.
        
        Args:
            X: Feature matrix (n_samples, n_features)
            y: Target vector (n_samples,)
        """
        # Set random seed if provided
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples = X.shape[0]
        self.estimators_ = []
        
        for i in range(self.n_estimators):
            # Create bootstrap sample
            bootstrap_idx = np.random.choice(
                n_samples, 
                size=n_samples, 
                replace=True
            )
            X_bootstrap = X[bootstrap_idx]
            y_bootstrap = y[bootstrap_idx]
            
            # Train model on bootstrap sample
            # Note: We need to clone the base estimator to avoid retraining same model
            from sklearn.base import clone
            estimator = clone(self.base_estimator)
            estimator.fit(X_bootstrap, y_bootstrap)
            
            self.estimators_.append(estimator)
        
        return self
    
    def predict(self, X):
        """
        Predict using majority voting.
        
        Args:
            X: Feature matrix (n_samples, n_features)
        
        Returns:
            predictions: Predicted classes (n_samples,)
        """
        # Get predictions from all estimators
        predictions = np.array([
            estimator.predict(X) 
            for estimator in self.estimators_
        ])
        
        # Majority vote: take mode along axis 0 (across estimators)
        from scipy import stats
        majority_vote = stats.mode(predictions, axis=0, keepdims=True)[0].flatten()
        
        return majority_vote

# Test our implementation
X, y = make_classification(
    n_samples=500, 
    n_features=10, 
    n_informative=8,
    random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Compare single tree vs our bagging implementation
single_tree = DecisionTreeClassifier(max_depth=10, random_state=42)
single_tree.fit(X_train, y_train)
single_pred = single_tree.predict(X_test)
single_acc = accuracy_score(y_test, single_pred)

bagging = SimpleBaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=10),
    n_estimators=10,
    random_state=42
)
bagging.fit(X_train, y_train)
bagging_pred = bagging.predict(X_test)
bagging_acc = accuracy_score(y_test, bagging_pred)

print("üìä Performance Comparison:")
print(f"Single Decision Tree: {single_acc:.4f}")
print(f"Bagging (10 trees):   {bagging_acc:.4f}")
print(f"Improvement:          {bagging_acc - single_acc:.4f}")
print(f"\n‚úÖ Bagging reduces overfitting and improves generalization!")

## 4. Scikit-learn BaggingClassifier

Now let's use the professional implementation from scikit-learn, which offers:
- Feature subsampling (in addition to sample subsampling)
- Out-of-bag score estimation
- Parallel processing
- More efficient implementation

In [None]:
# Load a real dataset
wine = load_wine()
X_wine = wine.data
y_wine = wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X_wine, y_wine, test_size=0.3, random_state=42
)

print(f"Dataset: {wine.DESCR.split('**')[1].split('**')[0].strip()}")
print(f"Samples: {len(X_wine)}")
print(f"Features: {len(wine.feature_names)}")
print(f"Classes: {wine.target_names}")
print(f"\nClass distribution:")
for i, name in enumerate(wine.target_names):
    count = np.sum(y_wine == i)
    print(f"  {name}: {count} samples")

In [None]:
# Create and train BaggingClassifier
bagging_clf = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=10),
    n_estimators=50,
    max_samples=1.0,  # Use 100% of samples for each bootstrap
    max_features=1.0,  # Use 100% of features
    bootstrap=True,  # Sample with replacement
    bootstrap_features=False,  # Don't bootstrap features
    oob_score=True,  # Calculate out-of-bag score
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

# Train
bagging_clf.fit(X_train, y_train)

# Evaluate
train_acc = bagging_clf.score(X_train, y_train)
test_acc = bagging_clf.score(X_test, y_test)
oob_acc = bagging_clf.oob_score_

print("\nüìä Bagging Performance:")
print(f"Training Accuracy:   {train_acc:.4f}")
print(f"Test Accuracy:       {test_acc:.4f}")
print(f"OOB Accuracy:        {oob_acc:.4f}")
print(f"\nNote: OOB score is close to test score without needing a validation set!")

### üéØ Exercise 2: Hyperparameter Exploration

Experiment with BaggingClassifier hyperparameters:

1. **max_samples**: Try [0.5, 0.7, 1.0] - what happens with smaller bootstrap samples?
2. **max_features**: Try [0.5, 0.7, 1.0] - does feature subsampling help?
3. **base estimator depth**: Try max_depth=[3, 5, 10, 20] - which works best?

Create a visualization comparing different configurations.

In [None]:
# Your code here


## 5. Out-of-Bag (OOB) Error Estimation

### What is OOB Error?

Remember: Each bootstrap sample contains ~63.2% of original data. The remaining ~36.8% are **out-of-bag (OOB) samples**.

**OOB Score Calculation**:
1. For each training sample, find all models that didn't use it during training
2. Get predictions from only those models (like a validation set)
3. Calculate accuracy using these OOB predictions

**Benefits**:
- Free validation score without needing separate validation set
- More training data (no need to hold out validation set)
- Unbiased estimate of generalization error

### OOB vs Cross-Validation

In [None]:
# Compare OOB score with cross-validation
from sklearn.model_selection import cross_val_score
import time

# Create bagging model with OOB
bagging_oob = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=8),
    n_estimators=100,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)

# Measure OOB time
start = time.time()
bagging_oob.fit(X_train, y_train)
oob_time = time.time() - start
oob_score = bagging_oob.oob_score_

# Create bagging model for CV
bagging_cv = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=8),
    n_estimators=100,
    oob_score=False,
    random_state=42,
    n_jobs=-1
)

# Measure CV time
start = time.time()
cv_scores = cross_val_score(bagging_cv, X_train, y_train, cv=5, n_jobs=-1)
cv_time = time.time() - start
cv_score = cv_scores.mean()

# Compare
print("\n‚ö° OOB vs Cross-Validation Comparison:")
print("=" * 50)
print(f"\nOOB Score:")
print(f"  Accuracy: {oob_score:.4f}")
print(f"  Time:     {oob_time:.2f}s")
print(f"\n5-Fold CV Score:")
print(f"  Accuracy: {cv_score:.4f} (¬±{cv_scores.std():.4f})")
print(f"  Time:     {cv_time:.2f}s")
print(f"\nüí° OOB is {cv_time/oob_time:.1f}√ó faster and gives similar accuracy!")

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Accuracy comparison
methods = ['OOB', '5-Fold CV']
scores = [oob_score, cv_score]
ax1.bar(methods, scores, color=['skyblue', 'lightcoral'])
ax1.set_ylabel('Accuracy')
ax1.set_title('Accuracy Comparison')
ax1.set_ylim(0.8, 1.0)
for i, v in enumerate(scores):
    ax1.text(i, v + 0.01, f'{v:.4f}', ha='center', fontweight='bold')

# Time comparison
times = [oob_time, cv_time]
ax2.bar(methods, times, color=['skyblue', 'lightcoral'])
ax2.set_ylabel('Time (seconds)')
ax2.set_title('Computation Time Comparison')
for i, v in enumerate(times):
    ax2.text(i, v + 0.1, f'{v:.2f}s', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

### üéØ Exercise 3: OOB Score Stability

Investigate how OOB score changes with number of estimators:

1. Train bagging models with n_estimators = [10, 25, 50, 100, 200, 500]
2. Record OOB score and test score for each
3. Plot both scores vs number of estimators
4. At what point does OOB score stabilize?
5. How well does OOB score predict test score?

In [None]:
# Your code here


## 6. Optimal Number of Estimators

### How Many Models Should We Bag?

Trade-offs:
- **More estimators** ‚Üí Better performance (up to a point)
- **More estimators** ‚Üí Longer training and prediction time
- **More estimators** ‚Üí Higher memory usage

**General rule**: Performance improvement plateaus after a certain point.

In [None]:
# Study effect of number of estimators
n_estimators_range = [1, 5, 10, 25, 50, 100, 150, 200, 300, 500]

train_scores = []
test_scores = []
oob_scores = []
training_times = []

for n_est in n_estimators_range:
    # Train model
    start = time.time()
    model = BaggingClassifier(
        estimator=DecisionTreeClassifier(max_depth=10),
        n_estimators=n_est,
        oob_score=True,
        random_state=42,
        n_jobs=-1
    )
    model.fit(X_train, y_train)
    elapsed = time.time() - start
    
    # Record scores
    train_scores.append(model.score(X_train, y_train))
    test_scores.append(model.score(X_test, y_test))
    oob_scores.append(model.oob_score_)
    training_times.append(elapsed)
    
    print(f"n={n_est:3d}: Train={train_scores[-1]:.4f}, Test={test_scores[-1]:.4f}, "
          f"OOB={oob_scores[-1]:.4f}, Time={elapsed:.2f}s")

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy vs number of estimators
ax1.plot(n_estimators_range, train_scores, marker='o', label='Training', linewidth=2)
ax1.plot(n_estimators_range, test_scores, marker='s', label='Test', linewidth=2)
ax1.plot(n_estimators_range, oob_scores, marker='^', label='OOB', linewidth=2, linestyle='--')
ax1.set_xlabel('Number of Estimators', fontsize=12)
ax1.set_ylabel('Accuracy', fontsize=12)
ax1.set_title('Performance vs Ensemble Size', fontsize=13, fontweight='bold')
ax1.legend(fontsize=10)
ax1.grid(True, alpha=0.3)
ax1.set_xscale('log')

# Training time vs number of estimators
ax2.plot(n_estimators_range, training_times, marker='o', color='red', linewidth=2)
ax2.set_xlabel('Number of Estimators', fontsize=12)
ax2.set_ylabel('Training Time (seconds)', fontsize=12)
ax2.set_title('Computational Cost vs Ensemble Size', fontsize=13, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.set_xscale('log')

plt.tight_layout()
plt.show()

# Find optimal point (elbow)
best_idx = np.argmax(test_scores)
best_n = n_estimators_range[best_idx]
best_score = test_scores[best_idx]

print(f"\n‚úÖ Optimal configuration:")
print(f"  n_estimators: {best_n}")
print(f"  Test accuracy: {best_score:.4f}")
print(f"\nüí° Beyond {best_n} estimators, improvement is marginal!")

## 7. Bagging for Regression

Bagging works equally well for regression problems:
- Bootstrap sampling stays the same
- Aggregation uses **averaging** instead of voting
- Reduces variance in predictions

In [None]:
# Create regression dataset
X_reg, y_reg = make_regression(
    n_samples=400,
    n_features=10,
    n_informative=8,
    noise=20,
    random_state=42
)

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.3, random_state=42
)

# Single decision tree regressor
single_tree_reg = DecisionTreeRegressor(max_depth=10, random_state=42)
single_tree_reg.fit(X_train_reg, y_train_reg)
single_pred = single_tree_reg.predict(X_test_reg)
single_mse = mean_squared_error(y_test_reg, single_pred)
single_r2 = r2_score(y_test_reg, single_pred)

# Bagging regressor
bagging_reg = BaggingRegressor(
    estimator=DecisionTreeRegressor(max_depth=10),
    n_estimators=100,
    oob_score=True,
    random_state=42,
    n_jobs=-1
)
bagging_reg.fit(X_train_reg, y_train_reg)
bagging_pred = bagging_reg.predict(X_test_reg)
bagging_mse = mean_squared_error(y_test_reg, bagging_pred)
bagging_r2 = r2_score(y_test_reg, bagging_pred)

# Compare
print("\nüìä Regression Performance:")
print("=" * 50)
print(f"\nSingle Decision Tree:")
print(f"  MSE: {single_mse:.2f}")
print(f"  R¬≤:  {single_r2:.4f}")
print(f"\nBagging (100 trees):")
print(f"  MSE: {bagging_mse:.2f}")
print(f"  R¬≤:  {bagging_r2:.4f}")
print(f"  OOB R¬≤: {bagging_reg.oob_score_:.4f}")
print(f"\n‚úÖ MSE reduction: {(1 - bagging_mse/single_mse)*100:.1f}%")

# Visualize predictions
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Single tree
ax1.scatter(y_test_reg, single_pred, alpha=0.6, edgecolors='black')
ax1.plot([y_test_reg.min(), y_test_reg.max()], 
         [y_test_reg.min(), y_test_reg.max()], 
         'r--', linewidth=2, label='Perfect Prediction')
ax1.set_xlabel('True Values', fontsize=12)
ax1.set_ylabel('Predicted Values', fontsize=12)
ax1.set_title(f'Single Tree (R¬≤={single_r2:.4f})', fontsize=13, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Bagging
ax2.scatter(y_test_reg, bagging_pred, alpha=0.6, edgecolors='black', color='green')
ax2.plot([y_test_reg.min(), y_test_reg.max()], 
         [y_test_reg.min(), y_test_reg.max()], 
         'r--', linewidth=2, label='Perfect Prediction')
ax2.set_xlabel('True Values', fontsize=12)
ax2.set_ylabel('Predicted Values', fontsize=12)
ax2.set_title(f'Bagging (R¬≤={bagging_r2:.4f})', fontsize=13, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### üéØ Exercise 4: Variance Reduction in Regression

Empirically demonstrate variance reduction:

1. Train 20 different single decision trees (different random_state)
2. Get predictions from all 20 trees on test set
3. Calculate variance of predictions for each test sample
4. Train bagging model with 20 estimators
5. Compare prediction variance between individual trees and bagging
6. Visualize the variance reduction

In [None]:
# Your code here


## 8. Summary and Next Steps

### üéì Key Takeaways

1. **Bootstrap Sampling**:
   - Samples with replacement create diverse training sets
   - Each bootstrap contains ~63.2% unique samples
   - Remaining ~36.8% are out-of-bag (OOB) samples

2. **Bagging Algorithm**:
   - Train multiple models on bootstrap samples
   - Aggregate via voting (classification) or averaging (regression)
   - Reduces variance without increasing bias

3. **Out-of-Bag Estimation**:
   - Free validation score using OOB samples
   - Faster than cross-validation
   - Provides unbiased generalization estimate

4. **Hyperparameters**:
   - `n_estimators`: More is better (with diminishing returns)
   - `max_samples`: Usually 1.0 works well
   - `max_features`: Can help with very high-dimensional data
   - Base estimator: Use high-variance models (deep trees)

5. **When to Use Bagging**:
   - ‚úÖ Base model has high variance (overfitting)
   - ‚úÖ Sufficient training data available
   - ‚úÖ Want to reduce overfitting without changing base model
   - ‚ùå Base model has high bias (underfitting) - use boosting instead

### üìö What's Next?

- **Module 02**: Random Forests (specialized bagging with feature randomness)
- **Module 03**: AdaBoost (sequential ensemble that reduces bias)
- **Module 04**: Gradient Boosting (more powerful sequential ensemble)

### üéØ Practice Recommendations

1. Apply bagging to your own dataset
2. Compare single model vs bagging performance
3. Use OOB score for model selection
4. Experiment with different base estimators (SVM, KNN, etc.)

### üìñ Additional Resources

- **Original Paper**: Breiman, L. (1996). "Bagging Predictors"
- **Sklearn User Guide**: https://scikit-learn.org/stable/modules/ensemble.html#bagging
- **Elements of Statistical Learning**: Chapter 8 (Model Averaging)

---

**üöÄ Ready for Random Forests? Let's move to Module 02!**