# Lab 4: Ensemble Methods

Ensemble methods combine multiple models to create a more powerful predictor. The principle is simple: many weak learners together can form a strong learner.

## Learning Objectives

By the end of this lab, you will:
- Understand ensemble learning principles
- Implement bagging and bootstrap aggregating
- Build Random Forests from scratch
- Understand boosting algorithms (AdaBoost, Gradient Boosting)
- Use XGBoost and LightGBM
- Apply stacking and blending
- Analyze feature importance

## Why Ensembles Work

**Wisdom of Crowds**: Multiple diverse models make better predictions than any single model.

**Key Requirements:**
1. Models should be diverse (make different errors)
2. Models should perform better than random guessing
3. Errors should be relatively independent

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from typing import List, Tuple
from collections import Counter
from sklearn.datasets import make_classification, load_breast_cancer, load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
    BaggingClassifier, RandomForestClassifier, AdaBoostClassifier,
    GradientBoostingClassifier, VotingClassifier, StackingClassifier
)
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Part 1: Bagging (Bootstrap Aggregating)

**Bagging** trains multiple models on different random subsets of the training data (with replacement).

### Algorithm:
1. Create k bootstrap samples from training data
2. Train a model on each bootstrap sample
3. For prediction:
   - Classification: Majority vote
   - Regression: Average predictions

**Benefits:**
- Reduces variance
- Helps prevent overfitting
- Particularly effective for high-variance models (e.g., decision trees)

In [None]:
class BaggingClassifierScratch:
    """
    Bagging Classifier implemented from scratch.
    
    Parameters:
    -----------
    base_estimator : object
        Base model to use
    n_estimators : int
        Number of base models
    """
    
    def __init__(self, base_estimator, n_estimators: int = 10):
        self.base_estimator = base_estimator
        self.n_estimators = n_estimators
        self.estimators = []
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Train bagging ensemble.
        """
        n_samples = X.shape[0]
        self.estimators = []
        
        for _ in range(self.n_estimators):
            # Bootstrap sample
            indices = np.random.choice(n_samples, n_samples, replace=True)
            X_bootstrap = X[indices]
            y_bootstrap = y[indices]
            
            # Train model on bootstrap sample
            # Import the class to create a new instance
            from sklearn.base import clone
            estimator = clone(self.base_estimator)
            estimator.fit(X_bootstrap, y_bootstrap)
            self.estimators.append(estimator)
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict using majority vote.
        """
        predictions = np.array([estimator.predict(X) for estimator in self.estimators])
        # Majority vote for each sample
        return np.array([Counter(predictions[:, i]).most_common(1)[0][0] 
                        for i in range(X.shape[0])])

In [None]:
# Generate data
X, y = make_classification(n_samples=500, n_features=20, n_informative=15,
                          n_redundant=5, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Compare single tree vs bagging
single_tree = DecisionTreeClassifier(random_state=42)
single_tree.fit(X_train, y_train)

bagging_scratch = BaggingClassifierScratch(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=10
)
bagging_scratch.fit(X_train, y_train)

bagging_sklearn = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(random_state=42),
    n_estimators=10,
    random_state=42
)
bagging_sklearn.fit(X_train, y_train)

print("Single Decision Tree:")
print(f"Training Accuracy: {single_tree.score(X_train, y_train):.3f}")
print(f"Test Accuracy: {single_tree.score(X_test, y_test):.3f}")

print("\nBagging (From Scratch):")
print(f"Test Accuracy: {accuracy_score(y_test, bagging_scratch.predict(X_test)):.3f}")

print("\nBagging (Scikit-learn):")
print(f"Training Accuracy: {bagging_sklearn.score(X_train, y_train):.3f}")
print(f"Test Accuracy: {bagging_sklearn.score(X_test, y_test):.3f}")

print("\nBagging reduces overfitting and improves test performance!")

In [None]:
# Effect of number of estimators
n_estimators_range = range(1, 51)
train_scores = []
test_scores = []

for n in n_estimators_range:
    bagging = BaggingClassifier(
        base_estimator=DecisionTreeClassifier(random_state=42),
        n_estimators=n,
        random_state=42
    )
    bagging.fit(X_train, y_train)
    train_scores.append(bagging.score(X_train, y_train))
    test_scores.append(bagging.score(X_test, y_test))

plt.figure(figsize=(10, 6))
plt.plot(n_estimators_range, train_scores, label='Training', linewidth=2)
plt.plot(n_estimators_range, test_scores, label='Test', linewidth=2)
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('Bagging: Effect of Number of Estimators')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

## Part 2: Random Forests

**Random Forest** is an extension of bagging that adds an additional layer of randomness:
- Uses bagging (bootstrap samples)
- At each split, considers only a random subset of features

### Key Hyperparameters:
- `n_estimators`: Number of trees
- `max_features`: Number of features to consider for each split
- `max_depth`: Maximum depth of trees
- `min_samples_split`: Minimum samples required to split

### Advantages:
- Very high accuracy
- Handles high-dimensional data well
- Provides feature importance
- Resistant to overfitting
- Works with missing values

In [None]:
class SimpleRandomForest:
    """
    Simplified Random Forest implementation.
    """
    
    def __init__(self, n_estimators: int = 10, max_features: str = 'sqrt',
                max_depth: int = None):
        self.n_estimators = n_estimators
        self.max_features = max_features
        self.max_depth = max_depth
        self.trees = []
        self.n_features_per_tree = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Train random forest.
        """
        n_samples, n_features = X.shape
        
        # Determine number of features per split
        if self.max_features == 'sqrt':
            self.n_features_per_tree = int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            self.n_features_per_tree = int(np.log2(n_features))
        else:
            self.n_features_per_tree = n_features
        
        self.trees = []
        
        for _ in range(self.n_estimators):
            # Bootstrap sample
            indices = np.random.choice(n_samples, n_samples, replace=True)
            X_bootstrap = X[indices]
            y_bootstrap = y[indices]
            
            # Train tree with feature subsampling
            tree = DecisionTreeClassifier(
                max_depth=self.max_depth,
                max_features=self.n_features_per_tree,
                random_state=None
            )
            tree.fit(X_bootstrap, y_bootstrap)
            self.trees.append(tree)
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict using majority vote.
        """
        predictions = np.array([tree.predict(X) for tree in self.trees])
        return np.array([Counter(predictions[:, i]).most_common(1)[0][0] 
                        for i in range(X.shape[0])])

In [None]:
# Load breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
                                                     random_state=42, stratify=y)

# Compare implementations
rf_scratch = SimpleRandomForest(n_estimators=100, max_features='sqrt', max_depth=10)
rf_scratch.fit(X_train, y_train)

rf_sklearn = RandomForestClassifier(n_estimators=100, max_features='sqrt',
                                   max_depth=10, random_state=42)
rf_sklearn.fit(X_train, y_train)

print("Random Forest (From Scratch):")
print(f"Test Accuracy: {accuracy_score(y_test, rf_scratch.predict(X_test)):.3f}")

print("\nRandom Forest (Scikit-learn):")
print(f"Training Accuracy: {rf_sklearn.score(X_train, y_train):.3f}")
print(f"Test Accuracy: {rf_sklearn.score(X_test, y_test):.3f}")

print("\nClassification Report:")
print(classification_report(y_test, rf_sklearn.predict(X_test), 
                          target_names=data.target_names))

In [None]:
# Feature importance
feature_importance = pd.DataFrame({
    'feature': data.feature_names,
    'importance': rf_sklearn.feature_importances_
}).sort_values('importance', ascending=False)

plt.figure(figsize=(10, 8))
plt.barh(feature_importance['feature'][:15], feature_importance['importance'][:15])
plt.xlabel('Importance')
plt.title('Top 15 Feature Importances (Random Forest)')
plt.gca().invert_yaxis()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nTop 10 Most Important Features:")
print(feature_importance.head(10))

## Part 3: Boosting

**Boosting** trains models sequentially, with each model focusing on the errors of previous models.

### AdaBoost (Adaptive Boosting)

1. Start with equal weights for all samples
2. Train a weak learner
3. Increase weights for misclassified samples
4. Repeat, giving more weight to models that perform better

### Key Idea:
- Forces subsequent models to focus on hard-to-classify examples
- Reduces bias (unlike bagging which reduces variance)

In [None]:
# AdaBoost
adaboost = AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1),  # Weak learners (stumps)
    n_estimators=100,
    random_state=42
)
adaboost.fit(X_train, y_train)

print("AdaBoost Results:")
print(f"Training Accuracy: {adaboost.score(X_train, y_train):.3f}")
print(f"Test Accuracy: {adaboost.score(X_test, y_test):.3f}")

# Effect of number of estimators
n_estimators_range = range(1, 101, 5)
train_scores = []
test_scores = []

for n in n_estimators_range:
    ada = AdaBoostClassifier(
        base_estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=n,
        random_state=42
    )
    ada.fit(X_train, y_train)
    train_scores.append(ada.score(X_train, y_train))
    test_scores.append(ada.score(X_test, y_test))

plt.figure(figsize=(10, 6))
plt.plot(n_estimators_range, train_scores, label='Training', linewidth=2)
plt.plot(n_estimators_range, test_scores, label='Test', linewidth=2)
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('AdaBoost: Effect of Number of Estimators')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

### Gradient Boosting

More sophisticated than AdaBoost:
1. Start with a simple model (often just the mean)
2. Compute residuals (errors)
3. Train next model to predict residuals
4. Add predictions to ensemble
5. Repeat

**Key Hyperparameters:**
- `learning_rate`: Shrinks contribution of each tree (lower = more robust, needs more trees)
- `n_estimators`: Number of boosting stages
- `max_depth`: Tree depth (typically shallow, 3-8)
- `subsample`: Fraction of samples to use for each tree

In [None]:
# Gradient Boosting
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gb.fit(X_train, y_train)

print("Gradient Boosting Results:")
print(f"Training Accuracy: {gb.score(X_train, y_train):.3f}")
print(f"Test Accuracy: {gb.score(X_test, y_test):.3f}")

# Learning rate effect
learning_rates = [0.01, 0.05, 0.1, 0.2, 0.5]
test_scores_lr = []

for lr in learning_rates:
    gb_lr = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=lr,
        max_depth=3,
        random_state=42
    )
    gb_lr.fit(X_train, y_train)
    test_scores_lr.append(gb_lr.score(X_test, y_test))

plt.figure(figsize=(10, 6))
plt.plot(learning_rates, test_scores_lr, 'o-', linewidth=2, markersize=8)
plt.xlabel('Learning Rate')
plt.ylabel('Test Accuracy')
plt.title('Gradient Boosting: Effect of Learning Rate')
plt.grid(True, alpha=0.3)
plt.show()

## Part 4: XGBoost and LightGBM

Modern, highly optimized implementations of gradient boosting.

### XGBoost (Extreme Gradient Boosting)
- Regularization to prevent overfitting
- Parallel processing
- Handling missing values
- Tree pruning

### LightGBM
- Faster training speed
- Lower memory usage
- Better accuracy
- Leaf-wise tree growth

In [None]:
try:
    import xgboost as xgb
    from lightgbm import LGBMClassifier
    
    # XGBoost
    xgb_model = xgb.XGBClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42,
        use_label_encoder=False,
        eval_metric='logloss'
    )
    xgb_model.fit(X_train, y_train)
    
    # LightGBM
    lgbm_model = LGBMClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        random_state=42
    )
    lgbm_model.fit(X_train, y_train)
    
    print("XGBoost Results:")
    print(f"Test Accuracy: {xgb_model.score(X_test, y_test):.3f}")
    
    print("\nLightGBM Results:")
    print(f"Test Accuracy: {lgbm_model.score(X_test, y_test):.3f}")
    
    # Feature importance comparison
    fig, axes = plt.subplots(1, 2, figsize=(14, 6))
    
    # XGBoost importance
    xgb_importance = pd.DataFrame({
        'feature': data.feature_names,
        'importance': xgb_model.feature_importances_
    }).sort_values('importance', ascending=False).head(10)
    
    axes[0].barh(xgb_importance['feature'], xgb_importance['importance'])
    axes[0].set_xlabel('Importance')
    axes[0].set_title('XGBoost Feature Importance')
    axes[0].invert_yaxis()
    axes[0].grid(True, alpha=0.3)
    
    # LightGBM importance
    lgbm_importance = pd.DataFrame({
        'feature': data.feature_names,
        'importance': lgbm_model.feature_importances_
    }).sort_values('importance', ascending=False).head(10)
    
    axes[1].barh(lgbm_importance['feature'], lgbm_importance['importance'])
    axes[1].set_xlabel('Importance')
    axes[1].set_title('LightGBM Feature Importance')
    axes[1].invert_yaxis()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
except ImportError:
    print("XGBoost or LightGBM not installed. Install with:")
    print("pip install xgboost lightgbm")

## Part 5: Voting and Stacking

### Voting Classifier

Combines predictions from multiple different algorithms:
- **Hard voting**: Majority vote
- **Soft voting**: Average probabilities (usually better)

### Stacking (Stacked Generalization)

Uses a meta-learner to combine base model predictions:
1. Train base models on training data
2. Use base model predictions as features
3. Train meta-model on these features

In [None]:
# Voting Classifier
voting_clf = VotingClassifier(
    estimators=[
        ('lr', LogisticRegression(max_iter=1000, random_state=42)),
        ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
        ('svc', SVC(probability=True, random_state=42))
    ],
    voting='soft'
)
voting_clf.fit(X_train, y_train)

print("Voting Classifier Results:")
print(f"Test Accuracy: {voting_clf.score(X_test, y_test):.3f}")

# Compare with individual models
print("\nIndividual Model Scores:")
for name, model in voting_clf.named_estimators_.items():
    print(f"{name}: {model.score(X_test, y_test):.3f}")

In [None]:
# Stacking Classifier
stacking_clf = StackingClassifier(
    estimators=[
        ('lr', LogisticRegression(max_iter=1000, random_state=42)),
        ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
        ('knn', KNeighborsClassifier(n_neighbors=5))
    ],
    final_estimator=GradientBoostingClassifier(n_estimators=50, random_state=42),
    cv=5
)
stacking_clf.fit(X_train, y_train)

print("Stacking Classifier Results:")
print(f"Test Accuracy: {stacking_clf.score(X_test, y_test):.3f}")

print("\nBase Estimator Scores:")
for name, model in stacking_clf.named_estimators_.items():
    print(f"{name}: {model.score(X_test, y_test):.3f}")

## Part 6: Comprehensive Model Comparison

Let's compare all ensemble methods on a challenging dataset.

In [None]:
# Load wine dataset (multi-class)
wine = load_wine()
X, y = wine.data, wine.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Models to compare
models = {
    'Single Tree': DecisionTreeClassifier(random_state=42),
    'Bagging': BaggingClassifier(n_estimators=100, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'Voting': VotingClassifier(
        estimators=[
            ('rf', RandomForestClassifier(n_estimators=50, random_state=42)),
            ('gb', GradientBoostingClassifier(n_estimators=50, random_state=42))
        ],
        voting='soft'
    )
}

results = {}

for name, model in models.items():
    # Cross-validation
    cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5)
    
    # Train and test
    model.fit(X_train_scaled, y_train)
    test_score = model.score(X_test_scaled, y_test)
    
    results[name] = {
        'cv_mean': cv_scores.mean(),
        'cv_std': cv_scores.std(),
        'test_score': test_score
    }
    
    print(f"\n{name}:")
    print(f"  CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
    print(f"  Test Score: {test_score:.3f}")

# Visualize comparison
results_df = pd.DataFrame(results).T

fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(results_df))
width = 0.35

ax.bar(x - width/2, results_df['cv_mean'], width, label='CV Score',
      yerr=results_df['cv_std'], capsize=5)
ax.bar(x + width/2, results_df['test_score'], width, label='Test Score')

ax.set_xlabel('Model')
ax.set_ylabel('Accuracy')
ax.set_title('Ensemble Methods Comparison')
ax.set_xticks(x)
ax.set_xticklabels(results_df.index, rotation=45, ha='right')
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Key Takeaways

1. **Bagging** reduces variance, works well with high-variance models (deep trees)
2. **Random Forests** combine bagging with feature randomness for excellent performance
3. **Boosting** reduces bias, builds models sequentially
4. **AdaBoost** focuses on misclassified samples
5. **Gradient Boosting** is more sophisticated, typically more accurate
6. **XGBoost/LightGBM** are state-of-the-art implementations
7. **Voting** combines diverse models through averaging
8. **Stacking** uses a meta-learner to combine predictions
9. **Feature importance** helps understand what the model learned
10. Ensemble methods typically outperform single models

## When to Use What?

- **Random Forest**: Great default choice, interpretable, robust
- **Gradient Boosting/XGBoost**: When you need maximum accuracy
- **AdaBoost**: When you have simple weak learners
- **Bagging**: When variance is the main issue
- **Stacking**: When you want to squeeze out last bit of performance

## Exercises

1. **Out-of-Bag (OOB) Score**: Implement OOB evaluation for bagging.

2. **Ensemble Diversity**: Measure and visualize diversity among ensemble members.

3. **Partial Dependence Plots**: Visualize how features affect predictions.

4. **Hyperparameter Tuning**: Use GridSearchCV to find optimal hyperparameters for Random Forest.

5. **Custom Voting**: Implement weighted voting with custom weights.

6. **Multi-level Stacking**: Create a 3-level stacking ensemble.

7. **Feature Selection with RF**: Use Random Forest feature importance for feature selection.

## Next Steps

You've completed Week 5! Next week:
- Week 6: Neural Networks - Deep learning fundamentals
- Week 7: Language - NLP and transformers

Congratulations! You now understand both individual ML algorithms and how to combine them for superior performance.