# Module 00: Introduction to Ensemble Learning

**Difficulty**: ⭐⭐⭐ Advanced
**Estimated Time**: 60 minutes
**Prerequisites**: 
- Machine Learning Fundamentals (decision trees, model evaluation)
- Feature Engineering basics
- Understanding of bias-variance tradeoff

## Learning Objectives

By the end of this notebook, you will be able to:
1. Explain the concept of ensemble learning and the "wisdom of crowds"
2. Understand the bias-variance tradeoff in ensemble methods
3. Distinguish between different ensemble strategies (bagging, boosting, stacking)
4. Implement a simple ensemble from scratch
5. Identify when ensemble methods provide the most value

## Setup and Configuration

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.metrics import accuracy_score, mean_squared_error

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("Setup complete! All libraries imported successfully.")

## 1. What is Ensemble Learning?

### The Wisdom of Crowds

**Core Idea**: Combining multiple models often produces better predictions than any single model.

**Real-world analogy**: 
- Asking 100 people to estimate the number of jelly beans in a jar
- The average of all guesses is often more accurate than individual expert estimates
- Why? Individual errors tend to cancel out

**Key Requirements for Effective Ensembles**:
1. **Diversity**: Models should make different types of errors
2. **Independence**: Models should be trained differently
3. **Reasonable accuracy**: Models should perform better than random guessing

In [None]:
# Demonstration: Wisdom of crowds with simulated estimators
true_value = 500  # True number of jelly beans

# Simulate 100 people's guesses (with some error)
# Each person has their own bias and noise
np.random.seed(RANDOM_STATE)
individual_estimates = np.random.normal(loc=true_value, scale=100, size=100)

# Calculate average estimate
crowd_estimate = np.mean(individual_estimates)

# Find the best individual estimate
best_individual = individual_estimates[np.argmin(np.abs(individual_estimates - true_value))]

print(f"True value: {true_value}")
print(f"Crowd estimate (average): {crowd_estimate:.2f}")
print(f"Best individual estimate: {best_individual:.2f}")
print(f"\nCrowd error: {abs(crowd_estimate - true_value):.2f}")
print(f"Best individual error: {abs(best_individual - true_value):.2f}")

# Visualization
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.hist(individual_estimates, bins=30, alpha=0.7, edgecolor='black')
plt.axvline(true_value, color='green', linestyle='--', linewidth=2, label='True value')
plt.axvline(crowd_estimate, color='red', linestyle='--', linewidth=2, label='Crowd estimate')
plt.xlabel('Estimate')
plt.ylabel('Frequency')
plt.title('Distribution of Individual Estimates')
plt.legend()

plt.subplot(1, 2, 2)
errors = np.abs(individual_estimates - true_value)
plt.scatter(range(len(errors)), errors, alpha=0.5)
plt.axhline(abs(crowd_estimate - true_value), color='red', linestyle='--', 
            linewidth=2, label='Crowd error')
plt.xlabel('Individual')
plt.ylabel('Absolute Error')
plt.title('Individual Errors vs Crowd Error')
plt.legend()

plt.tight_layout()
plt.show()

## 2. Bias-Variance Tradeoff in Ensembles

### Understanding Error Decomposition

**Total Error = Bias² + Variance + Irreducible Error**

- **Bias**: Error from overly simplistic assumptions (underfitting)
- **Variance**: Error from sensitivity to training data fluctuations (overfitting)
- **Irreducible Error**: Noise in the data itself

### How Ensembles Help

1. **Bagging** (Bootstrap Aggregating): Reduces **variance**
   - Trains models on different subsets of data
   - Averages predictions to smooth out fluctuations
   - Example: Random Forest

2. **Boosting**: Reduces **bias**
   - Sequentially trains models to correct previous errors
   - Builds complex models from simple ones
   - Example: XGBoost, AdaBoost

3. **Stacking**: Can reduce both
   - Uses meta-model to learn optimal combination
   - Leverages strengths of diverse models

In [None]:
# Demonstration: Bias-Variance with ensemble averaging
# Create a synthetic regression problem
X, y = make_regression(n_samples=200, n_features=1, noise=15, random_state=RANDOM_STATE)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

# Train multiple high-variance models (deep decision trees)
n_models = 20
models = []
predictions = []

for i in range(n_models):
    # Use bootstrap sampling to create diverse models
    # Sample with replacement from training data
    indices = np.random.choice(len(X_train), size=len(X_train), replace=True)
    X_bootstrap = X_train[indices]
    y_bootstrap = y_train[indices]
    
    # Train a high-variance model (deep tree)
    model = DecisionTreeRegressor(max_depth=10, random_state=i)
    model.fit(X_bootstrap, y_bootstrap)
    
    models.append(model)
    predictions.append(model.predict(X_test))

# Ensemble prediction: average of all models
ensemble_pred = np.mean(predictions, axis=0)

# Calculate errors
individual_errors = [mean_squared_error(y_test, pred) for pred in predictions]
ensemble_error = mean_squared_error(y_test, ensemble_pred)

print("Individual Model Performance:")
print(f"  Average MSE: {np.mean(individual_errors):.2f}")
print(f"  Best MSE: {np.min(individual_errors):.2f}")
print(f"  Worst MSE: {np.max(individual_errors):.2f}")
print(f"\nEnsemble Performance:")
print(f"  MSE: {ensemble_error:.2f}")
print(f"\nImprovement: {((np.mean(individual_errors) - ensemble_error) / np.mean(individual_errors) * 100):.1f}%")

In [None]:
# Visualize predictions
plt.figure(figsize=(14, 5))

# Sort for better visualization
sort_idx = np.argsort(X_test[:, 0])
X_test_sorted = X_test[sort_idx]
y_test_sorted = y_test[sort_idx]

# Plot 1: Individual model predictions
plt.subplot(1, 2, 1)
for i, pred in enumerate(predictions[:5]):  # Show first 5 models
    plt.plot(X_test_sorted, pred[sort_idx], alpha=0.3, linewidth=1)
plt.scatter(X_test_sorted, y_test_sorted, color='black', s=20, 
            alpha=0.5, label='True values')
plt.xlabel('Feature value')
plt.ylabel('Target value')
plt.title('Individual Model Predictions (High Variance)')
plt.legend()

# Plot 2: Ensemble prediction
plt.subplot(1, 2, 2)
plt.scatter(X_test_sorted, y_test_sorted, color='black', s=20, 
            alpha=0.5, label='True values')
plt.plot(X_test_sorted, ensemble_pred[sort_idx], color='red', 
         linewidth=2, label='Ensemble prediction')
plt.xlabel('Feature value')
plt.ylabel('Target value')
plt.title('Ensemble Prediction (Reduced Variance)')
plt.legend()

plt.tight_layout()
plt.show()

## 3. Types of Ensemble Methods

### 3.1 Parallel Ensembles (Bagging)

**Strategy**: Train models independently in parallel
- Each model sees different subset of data
- Predictions are averaged (regression) or voted (classification)
- **Goal**: Reduce variance

**When to use**:
- Base model has high variance (e.g., deep decision trees)
- Large datasets where you can create diverse subsets
- You want stable, robust predictions

**Examples**: Random Forest, Bagged Decision Trees

### 3.2 Sequential Ensembles (Boosting)

**Strategy**: Train models sequentially, each correcting previous errors
- Each new model focuses on hard-to-predict examples
- Models are weighted by performance
- **Goal**: Reduce bias

**When to use**:
- Base model has high bias (e.g., shallow decision trees)
- You need high accuracy and can afford longer training
- Dataset is clean (boosting can overfit to noise)

**Examples**: AdaBoost, Gradient Boosting, XGBoost, LightGBM, CatBoost

### 3.3 Heterogeneous Ensembles (Stacking/Blending)

**Strategy**: Combine different types of models
- Train diverse base models (e.g., trees, linear models, neural nets)
- Meta-model learns how to best combine predictions
- **Goal**: Leverage complementary strengths

**When to use**:
- You have computational resources for multiple model types
- Different models capture different patterns
- Competition settings (Kaggle)

**Examples**: Stacking classifier, voting ensemble

In [None]:
# Comparison of ensemble strategies
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

# Create a classification dataset
X, y = make_classification(
    n_samples=1000, 
    n_features=20, 
    n_informative=15,
    n_redundant=5,
    random_state=RANDOM_STATE
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

# 1. Base model (single decision tree)
base_model = DecisionTreeClassifier(max_depth=5, random_state=RANDOM_STATE)
base_model.fit(X_train, y_train)
base_score = accuracy_score(y_test, base_model.predict(X_test))

# 2. Bagging ensemble (parallel)
bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=5, random_state=RANDOM_STATE),
    n_estimators=20,
    random_state=RANDOM_STATE
)
bagging_model.fit(X_train, y_train)
bagging_score = accuracy_score(y_test, bagging_model.predict(X_test))

# 3. Boosting ensemble (sequential)
boosting_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=5, random_state=RANDOM_STATE),
    n_estimators=20,
    random_state=RANDOM_STATE,
    algorithm='SAMME'
)
boosting_model.fit(X_train, y_train)
boosting_score = accuracy_score(y_test, boosting_model.predict(X_test))

# 4. Voting ensemble (heterogeneous)
voting_model = VotingClassifier(
    estimators=[
        ('dt', DecisionTreeClassifier(max_depth=5, random_state=RANDOM_STATE)),
        ('lr', LogisticRegression(max_iter=1000, random_state=RANDOM_STATE)),
        ('svc', SVC(kernel='rbf', probability=True, random_state=RANDOM_STATE))
    ],
    voting='soft'
)
voting_model.fit(X_train, y_train)
voting_score = accuracy_score(y_test, voting_model.predict(X_test))

# Compare results
results = pd.DataFrame({
    'Model': ['Base (Single Tree)', 'Bagging', 'Boosting', 'Voting'],
    'Accuracy': [base_score, bagging_score, boosting_score, voting_score],
    'Strategy': ['None', 'Parallel', 'Sequential', 'Heterogeneous'],
    'Primary Benefit': ['Baseline', 'Reduce variance', 'Reduce bias', 'Combine strengths']
})

print(results.to_string(index=False))

# Visualization
plt.figure(figsize=(10, 6))
colors = ['gray', 'blue', 'green', 'orange']
bars = plt.bar(results['Model'], results['Accuracy'], color=colors, alpha=0.7, edgecolor='black')
plt.ylabel('Accuracy')
plt.title('Comparison of Ensemble Strategies')
plt.ylim(0.7, 1.0)
plt.xticks(rotation=15, ha='right')

# Add value labels on bars
for bar, score in zip(bars, results['Accuracy']):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{score:.3f}', ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()

## 4. Building a Simple Ensemble from Scratch

Let's implement a basic ensemble to understand the mechanics. We'll create a simple averaging ensemble for regression.

In [None]:
class SimpleEnsembleRegressor:
    """
    A simple ensemble that averages predictions from multiple models.
    
    This demonstrates the core concept of ensemble learning:
    combining multiple models to improve predictions.
    """
    
    def __init__(self, models):
        """
        Parameters
        ----------
        models : list
            List of sklearn-compatible models to ensemble
        """
        self.models = models
        self.n_models = len(models)
        
    def fit(self, X, y):
        """
        Train all models in the ensemble.
        
        Each model sees the same training data in this simple version.
        More advanced ensembles (like bagging) would give each model
        different subsets of data.
        """
        for i, model in enumerate(self.models):
            model.fit(X, y)
            print(f"Trained model {i+1}/{self.n_models}")
        return self
    
    def predict(self, X):
        """
        Make predictions by averaging all model predictions.
        
        This simple averaging works well when models make uncorrelated errors.
        """
        # Get predictions from all models
        predictions = np.array([model.predict(X) for model in self.models])
        
        # Average across models (axis=0 means average across models)
        ensemble_prediction = np.mean(predictions, axis=0)
        
        return ensemble_prediction
    
    def get_individual_predictions(self, X):
        """
        Get predictions from each model separately.
        Useful for analyzing model diversity.
        """
        return np.array([model.predict(X) for model in self.models])

print("SimpleEnsembleRegressor class defined successfully!")

In [None]:
# Test our custom ensemble
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Ridge, Lasso
from sklearn.svm import SVR

# Create regression dataset
X, y = make_regression(n_samples=300, n_features=10, noise=20, random_state=RANDOM_STATE)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

# Create diverse base models
# Using different model types increases diversity
base_models = [
    DecisionTreeRegressor(max_depth=5, random_state=RANDOM_STATE),
    DecisionTreeRegressor(max_depth=10, random_state=RANDOM_STATE + 1),
    Ridge(alpha=1.0),
    Lasso(alpha=1.0),
    SVR(kernel='rbf', C=1.0)
]

# Create and train ensemble
ensemble = SimpleEnsembleRegressor(base_models)
ensemble.fit(X_train, y_train)

# Make predictions
ensemble_pred = ensemble.predict(X_test)
individual_preds = ensemble.get_individual_predictions(X_test)

# Evaluate performance
print("\nPerformance Comparison:")
print("=" * 50)

for i, model in enumerate(base_models):
    mse = mean_squared_error(y_test, individual_preds[i])
    print(f"Model {i+1} ({model.__class__.__name__:20s}): MSE = {mse:.2f}")

ensemble_mse = mean_squared_error(y_test, ensemble_pred)
print("=" * 50)
print(f"Ensemble (Average)                    : MSE = {ensemble_mse:.2f}")

# Calculate improvement
avg_individual_mse = np.mean([mean_squared_error(y_test, pred) for pred in individual_preds])
improvement = (avg_individual_mse - ensemble_mse) / avg_individual_mse * 100
print(f"\nEnsemble improvement: {improvement:.1f}% better than average individual model")

## 5. When Do Ensembles Work Best?

### Ideal Conditions

1. **Diverse Models**: Models make different types of errors
2. **Reasonable Accuracy**: Each model performs better than random
3. **Uncorrelated Errors**: Models' mistakes are independent

### Measuring Model Diversity

We can measure correlation between model predictions to assess diversity.

In [None]:
# Analyze model diversity
# Higher correlation = less diversity = less benefit from ensembling

# Calculate correlation matrix of predictions
pred_df = pd.DataFrame(
    individual_preds.T,
    columns=[f'Model {i+1}' for i in range(len(base_models))]
)

correlation_matrix = pred_df.corr()

# Visualization
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, fmt='.3f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={'label': 'Correlation'})
plt.title('Model Prediction Correlation Matrix\n(Lower correlation = More diversity)', 
          fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

# Calculate average correlation (excluding diagonal)
mask = np.ones_like(correlation_matrix, dtype=bool)
np.fill_diagonal(mask, False)
avg_correlation = correlation_matrix.where(mask).mean().mean()

print(f"\nAverage pairwise correlation: {avg_correlation:.3f}")
print("\nInterpretation:")
if avg_correlation < 0.7:
    print("✓ Good diversity! Models make different predictions.")
elif avg_correlation < 0.85:
    print("○ Moderate diversity. Ensemble will help, but could be better.")
else:
    print("✗ Low diversity. Models are too similar. Consider different architectures.")

## 6. Real-World Applications

### Where Ensembles Excel

1. **Kaggle Competitions**: Most winning solutions use ensembles
   - Combine 10-50+ models for maximum accuracy
   - Stack different model types (trees, neural nets, linear models)

2. **Production ML Systems**: Especially for critical applications
   - Credit risk assessment
   - Fraud detection
   - Medical diagnosis
   - Recommendation systems

3. **Tabular Data**: Gradient boosting dominates
   - XGBoost, LightGBM, CatBoost are state-of-the-art
   - Often outperform neural networks on structured data

### Trade-offs to Consider

**Advantages**:
- Higher accuracy and robustness
- Reduced overfitting (especially bagging)
- Better generalization

**Disadvantages**:
- Longer training time (N models vs 1 model)
- Higher memory usage
- Less interpretable (harder to explain predictions)
- More complex deployment

In [None]:
# Demonstration: Training time comparison
import time

# Create larger dataset
X_large, y_large = make_classification(
    n_samples=10000, n_features=50, n_informative=30, random_state=RANDOM_STATE
)

# Time single model
single_model = DecisionTreeClassifier(max_depth=10, random_state=RANDOM_STATE)
start = time.time()
single_model.fit(X_large, y_large)
single_time = time.time() - start

# Time ensemble (50 models)
ensemble_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=10, random_state=RANDOM_STATE),
    n_estimators=50,
    random_state=RANDOM_STATE,
    n_jobs=-1  # Use all CPU cores for parallel training
)
start = time.time()
ensemble_model.fit(X_large, y_large)
ensemble_time = time.time() - start

print("Training Time Comparison:")
print(f"Single model: {single_time:.3f} seconds")
print(f"Ensemble (50 models): {ensemble_time:.3f} seconds")
print(f"\nTime multiplier: {ensemble_time/single_time:.1f}x")
print("\nNote: Parallel processing (n_jobs=-1) helps, but ensembles")
print("still take longer. The accuracy gain often justifies this cost.")

## Exercises

Test your understanding of ensemble learning concepts.

### Exercise 1: Ensemble Size vs Performance

Investigate how ensemble size affects performance. Create bagging ensembles with different numbers of models (1, 5, 10, 20, 50, 100) and plot how accuracy changes.

**Questions to answer**:
- At what point do diminishing returns set in?
- Is there a "sweet spot" for ensemble size?
- How does variance in accuracy change with ensemble size?

In [None]:
# Your code here
# Hint: Use BaggingClassifier with different n_estimators values
# Hint: Use cross_val_score to get robust accuracy estimates


### Exercise 2: Model Diversity Experiment

Create two ensembles:
1. **Low diversity**: 5 decision trees with similar parameters
2. **High diversity**: 5 different model types (tree, linear, SVM, etc.)

Compare their performance and analyze prediction correlations.

**Hypothesis to test**: Does higher diversity lead to better ensemble performance?

In [None]:
# Your code here
# Create classification dataset
# Build two ensembles with different diversity levels
# Compare performance and correlations


### Exercise 3: Bias-Variance Analysis

Create a dataset with a known true function (e.g., polynomial). Train:
1. A single high-bias model (shallow tree, max_depth=2)
2. A single high-variance model (deep tree, max_depth=20)
3. A bagging ensemble of high-variance models
4. A boosting ensemble of high-bias models

Visualize predictions and calculate errors to see how ensembles address bias and variance.

In [None]:
# Your code here
# Create synthetic regression data with known function
# Train different model types
# Visualize and compare predictions


### Exercise 4: Custom Weighted Ensemble

Extend the `SimpleEnsembleRegressor` class to support weighted averaging. Instead of equal weights, allow users to specify custom weights for each model.

**Bonus**: Implement a method that automatically learns optimal weights based on validation performance.

In [None]:
# Your code here
# Modify SimpleEnsembleRegressor to accept weights parameter
# Implement weighted averaging in predict() method
# Test with different weight configurations


## Summary

### Key Concepts

1. **Ensemble Learning**: Combining multiple models improves predictions through error averaging

2. **Bias-Variance Tradeoff**:
   - Bagging reduces variance (parallel ensembles)
   - Boosting reduces bias (sequential ensembles)
   - Both can outperform single models

3. **Ensemble Strategies**:
   - **Bagging**: Independent models, average predictions
   - **Boosting**: Sequential models, focus on errors
   - **Stacking**: Different model types, meta-learning

4. **Success Requirements**:
   - Model diversity (different errors)
   - Reasonable individual accuracy
   - Uncorrelated predictions

5. **Trade-offs**:
   - Higher accuracy vs longer training
   - Better predictions vs interpretability
   - Robustness vs complexity

### What's Next?

In the following notebooks, we'll dive deep into specific ensemble methods:

- **Module 01**: Bagging and Bootstrap Aggregation
- **Module 02**: Random Forests (the most popular bagging ensemble)
- **Module 03**: AdaBoost (the first successful boosting algorithm)
- **Module 04-07**: Modern gradient boosting (XGBoost, LightGBM, CatBoost)
- **Module 08-10**: Advanced ensembles (stacking, voting, comparison)
- **Module 11**: Kaggle-style competition project

### Additional Resources

- **Paper**: "Ensemble Methods in Machine Learning" by Dietterich (2000)
- **Book**: "Ensemble Machine Learning" edited by Zhang & Ma
- **Documentation**: Scikit-learn ensemble methods guide
- **Practice**: Kaggle competitions for real-world ensemble applications