# Module 03: Boosting Fundamentals and AdaBoost

**Difficulty**: ⭐⭐⭐ Advanced
**Estimated Time**: 80 minutes
**Prerequisites**: 
- Module 00: Introduction to Ensemble Learning
- Module 01: Bagging and Bootstrap
- Module 02: Random Forests
- Understanding of bias-variance tradeoff

## Learning Objectives

By the end of this notebook, you will be able to:
1. Explain how boosting differs from bagging (sequential vs parallel)
2. Understand the AdaBoost algorithm and adaptive weighting mechanism
3. Implement AdaBoost from scratch to understand the mechanics
4. Compare bias-variance properties of boosting vs bagging
5. Analyze the impact of learning rate on boosting performance
6. Identify when boosting is sensitive to outliers and noisy data

## Setup and Configuration

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

# Machine learning
from sklearn.datasets import make_classification, make_moons, load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.ensemble import AdaBoostClassifier, RandomForestClassifier, BaggingClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("Setup complete! All libraries imported successfully.")

## 1. Sequential Learning: The Boosting Paradigm

### Bagging vs Boosting: A Fundamental Difference

#### Bagging (Parallel Ensemble)
```
Tree 1 ←┐
Tree 2 ←├─ Bootstrap samples (independent)
Tree 3 ←┘
     ↓
  Average
```
- All trees trained independently in parallel
- Each sees different random sample
- **Goal**: Reduce variance

#### Boosting (Sequential Ensemble)
```
Tree 1 → identify errors → Tree 2 → identify errors → Tree 3
  ↓           ↓                ↓          ↓             ↓
pred 1   +  pred 2       +   pred 3   = Final prediction
```
- Trees trained sequentially, each learning from previous mistakes
- Later trees focus on examples that earlier trees got wrong
- **Goal**: Reduce bias

### Why Sequential Learning Works

**Key insight**: Complex patterns can be learned by combining many simple patterns

- Start with weak learner (slightly better than random)
- Each new learner focuses on "hard" examples
- Gradually build up complex decision boundary
- Final model: weighted combination of all learners

**Analogy**: Like building expertise
- First, learn basics (easy examples)
- Then, focus on edge cases (hard examples)
- Eventually master the entire domain

### Bias-Variance Perspective

**Bagging**:
- Uses strong base learners (deep trees)
- High variance → Averaging reduces variance
- Bias stays roughly the same

**Boosting**:
- Uses weak base learners (shallow trees, stumps)
- High bias → Sequential learning reduces bias
- Can increase variance if overtrained

In [None]:
# Visual demonstration: Bagging vs Boosting on same dataset
# Create non-linear decision boundary
X, y = make_moons(n_samples=300, noise=0.3, random_state=RANDOM_STATE)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

# Train both ensembles with weak base learners
base_estimator = DecisionTreeClassifier(max_depth=1, random_state=RANDOM_STATE)  # Stump

bagging_model = BaggingClassifier(
    estimator=DecisionTreeClassifier(max_depth=1, random_state=RANDOM_STATE),
    n_estimators=50,
    random_state=RANDOM_STATE
)

boosting_model = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1, random_state=RANDOM_STATE),
    n_estimators=50,
    random_state=RANDOM_STATE,
    algorithm='SAMME'
)

bagging_model.fit(X_train, y_train)
boosting_model.fit(X_train, y_train)

# Create mesh for decision boundary visualization
h = 0.02  # Step size
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Plot decision boundaries
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Single weak learner
single_tree = DecisionTreeClassifier(max_depth=1, random_state=RANDOM_STATE)
single_tree.fit(X_train, y_train)
Z_single = single_tree.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
axes[0].contourf(xx, yy, Z_single, alpha=0.4, cmap='RdYlBu')
axes[0].scatter(X_train[:, 0], X_train[:, 1], c=y_train, 
                cmap='RdYlBu', edgecolor='black', alpha=0.7)
axes[0].set_title(f'Single Weak Learner (Stump)\nAccuracy: {single_tree.score(X_test, y_test):.3f}',
                  fontsize=12, fontweight='bold')
axes[0].set_xlabel('Feature 1')
axes[0].set_ylabel('Feature 2')

# Bagging
Z_bagging = bagging_model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
axes[1].contourf(xx, yy, Z_bagging, alpha=0.4, cmap='RdYlBu')
axes[1].scatter(X_train[:, 0], X_train[:, 1], c=y_train, 
                cmap='RdYlBu', edgecolor='black', alpha=0.7)
axes[1].set_title(f'Bagging (50 stumps)\nAccuracy: {bagging_model.score(X_test, y_test):.3f}',
                  fontsize=12, fontweight='bold')
axes[1].set_xlabel('Feature 1')

# Boosting
Z_boosting = boosting_model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
axes[2].contourf(xx, yy, Z_boosting, alpha=0.4, cmap='RdYlBu')
axes[2].scatter(X_train[:, 0], X_train[:, 1], c=y_train, 
                cmap='RdYlBu', edgecolor='black', alpha=0.7)
axes[2].set_title(f'Boosting (50 stumps)\nAccuracy: {boosting_model.score(X_test, y_test):.3f}',
                  fontsize=12, fontweight='bold')
axes[2].set_xlabel('Feature 1')

plt.tight_layout()
plt.show()

print("\nKey Observation:")
print("Boosting creates a much more complex decision boundary from weak learners!")
print("Sequential learning allows combining simple patterns into complex ones.")

## 2. AdaBoost Algorithm

### Adaptive Boosting (AdaBoost)

Developed by Freund and Schapire (1996), AdaBoost was the first practical boosting algorithm.

### The Algorithm

**Initialize**: All samples have equal weight $w_i = 1/N$

**For each iteration t = 1 to T**:

1. **Train weak learner** on weighted samples
   - Samples with higher weights are more important
   
2. **Calculate weighted error**:
   $$\epsilon_t = \sum_{i: h_t(x_i) \neq y_i} w_i$$
   - Sum of weights for misclassified samples
   
3. **Calculate learner weight** (importance of this learner):
   $$\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)$$
   - Better learners get higher weight
   - If $\epsilon_t = 0.5$ (random), then $\alpha_t = 0$ (ignored)
   - If $\epsilon_t < 0.5$ (better than random), then $\alpha_t > 0$
   
4. **Update sample weights**:
   $$w_i \leftarrow w_i \times e^{\alpha_t \times \mathbb{1}[h_t(x_i) \neq y_i]}$$
   - Increase weight for misclassified samples
   - Decrease weight for correctly classified samples
   
5. **Normalize weights**: $w_i \leftarrow w_i / \sum_j w_j$

**Final prediction**:
$$H(x) = \text{sign}\left(\sum_{t=1}^T \alpha_t h_t(x)\right)$$
- Weighted vote of all learners

### Key Insights

1. **Adaptive weights**: Algorithm focuses on hard examples
2. **Exponential loss**: Misclassified samples get exponentially higher weight
3. **Weak learner assumption**: Only requires $\epsilon_t < 0.5$ (better than random)
4. **Theoretical guarantee**: Training error decreases exponentially fast

In [None]:
# Demonstrate AdaBoost weight evolution
# Create simple 1D dataset for visualization
np.random.seed(RANDOM_STATE)
X_simple = np.random.randn(100, 1)
y_simple = (X_simple[:, 0] > 0).astype(int)

# Add some noise to make it interesting
noise_idx = np.random.choice(100, 10, replace=False)
y_simple[noise_idx] = 1 - y_simple[noise_idx]

# Train AdaBoost with just 5 estimators to track weight evolution
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=5,
    random_state=RANDOM_STATE,
    algorithm='SAMME'
)

# Manually track weight evolution
weights_history = []
n_samples = len(X_simple)
sample_weights = np.ones(n_samples) / n_samples

# Store initial weights
weights_history.append(sample_weights.copy())

# Train and track
ada.fit(X_simple, y_simple)

# Get estimator weights (alpha values)
estimator_weights = ada.estimator_weights_
estimator_errors = ada.estimator_errors_

# Display information
print("AdaBoost Training Progress:")
print("=" * 70)
for i, (weight, error) in enumerate(zip(estimator_weights, estimator_errors)):
    print(f"Estimator {i+1}:")
    print(f"  Weighted Error: {error:.4f}")
    print(f"  Alpha (weight): {weight:.4f}")
    if error < 0.5:
        print(f"  → Better than random, positive contribution")
    elif error > 0.5:
        print(f"  → Worse than random, negative contribution")
    else:
        print(f"  → Random guessing, zero contribution")
    print()

# Visualize how alpha relates to error
errors = np.linspace(0.01, 0.99, 100)
alphas = 0.5 * np.log((1 - errors) / errors)

plt.figure(figsize=(10, 6))
plt.plot(errors, alphas, linewidth=2, color='steelblue')
plt.axhline(0, color='red', linestyle='--', linewidth=1, label='Zero weight')
plt.axvline(0.5, color='green', linestyle='--', linewidth=1, label='Random guessing')
plt.scatter(estimator_errors, estimator_weights, color='orange', s=100, 
            zorder=5, edgecolor='black', label='Our estimators')
plt.xlabel('Weighted Error Rate', fontsize=12)
plt.ylabel('Estimator Weight (Alpha)', fontsize=12)
plt.title('AdaBoost: How Error Affects Estimator Weight', fontsize=14, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey Insight:")
print("- Error < 0.5 → Positive alpha (learner helps)")
print("- Error = 0.5 → Zero alpha (learner ignored)")
print("- Error > 0.5 → Negative alpha (learner hurts, flip predictions)")

## 3. Implementing AdaBoost from Scratch

Let's build a simple AdaBoost classifier to understand the mechanics.

In [None]:
class SimpleAdaBoost:
    """
    Simple AdaBoost implementation for binary classification.
    
    This implementation shows the core AdaBoost algorithm:
    - Adaptive sample weighting
    - Estimator weight calculation
    - Weighted voting
    """
    
    def __init__(self, n_estimators=50, random_state=None):
        self.n_estimators = n_estimators
        self.random_state = random_state
        self.estimators = []
        self.estimator_weights = []
        self.estimator_errors = []
        
    def fit(self, X, y):
        """
        Train AdaBoost ensemble.
        
        Parameters
        ----------
        X : array-like, shape (n_samples, n_features)
        y : array-like, shape (n_samples,)
            Labels must be -1 or +1
        """
        n_samples = X.shape[0]
        
        # Initialize sample weights uniformly
        sample_weights = np.ones(n_samples) / n_samples
        
        # Convert labels to -1, +1 if needed
        y_encoded = np.where(y == 0, -1, y)
        
        for i in range(self.n_estimators):
            # Train weak learner with weighted samples
            estimator = DecisionTreeClassifier(
                max_depth=1,  # Decision stump
                random_state=self.random_state + i if self.random_state else None
            )
            estimator.fit(X, y_encoded, sample_weight=sample_weights)
            
            # Make predictions
            predictions = estimator.predict(X)
            
            # Calculate weighted error
            incorrect = (predictions != y_encoded)
            weighted_error = np.sum(sample_weights[incorrect]) / np.sum(sample_weights)
            
            # Avoid division by zero and numerical issues
            weighted_error = np.clip(weighted_error, 1e-10, 1 - 1e-10)
            
            # Calculate estimator weight (alpha)
            estimator_weight = 0.5 * np.log((1 - weighted_error) / weighted_error)
            
            # Update sample weights
            # Increase weights for misclassified samples
            sample_weights *= np.exp(estimator_weight * incorrect * 
                                    ((predictions != y_encoded) * 2 - 1))
            
            # Normalize weights
            sample_weights /= np.sum(sample_weights)
            
            # Store estimator and its weight
            self.estimators.append(estimator)
            self.estimator_weights.append(estimator_weight)
            self.estimator_errors.append(weighted_error)
            
        return self
    
    def predict(self, X):
        """
        Make predictions using weighted voting.
        """
        # Get predictions from all estimators
        estimator_predictions = np.array([
            estimator.predict(X) for estimator in self.estimators
        ])
        
        # Weight predictions by estimator weights
        weighted_predictions = np.dot(self.estimator_weights, estimator_predictions)
        
        # Return sign of weighted sum
        # Convert back to 0/1 labels
        return np.where(weighted_predictions >= 0, 1, 0)
    
    def score(self, X, y):
        """Calculate accuracy."""
        predictions = self.predict(X)
        return np.mean(predictions == y)

print("SimpleAdaBoost class defined successfully!")

In [None]:
# Test our implementation
X, y = make_classification(
    n_samples=500,
    n_features=10,
    n_informative=8,
    n_redundant=2,
    random_state=RANDOM_STATE
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

# Train our implementation
our_ada = SimpleAdaBoost(n_estimators=50, random_state=RANDOM_STATE)
our_ada.fit(X_train, y_train)
our_score = our_ada.score(X_test, y_test)

# Train sklearn implementation for comparison
sklearn_ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=50,
    random_state=RANDOM_STATE,
    algorithm='SAMME'
)
sklearn_ada.fit(X_train, y_train)
sklearn_score = sklearn_ada.score(X_test, y_test)

print("Implementation Comparison:")
print("=" * 50)
print(f"Our AdaBoost:     {our_score:.4f}")
print(f"Sklearn AdaBoost: {sklearn_score:.4f}")
print(f"Difference:       {abs(our_score - sklearn_score):.4f}")
print("\nOur implementation works! Small differences due to numerical precision.")

In [None]:
# Visualize error evolution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Estimator errors over iterations
axes[0].plot(range(1, len(our_ada.estimator_errors) + 1), 
             our_ada.estimator_errors, marker='o', linewidth=2)
axes[0].axhline(0.5, color='red', linestyle='--', linewidth=1, label='Random guessing')
axes[0].set_xlabel('Estimator Number', fontsize=11)
axes[0].set_ylabel('Weighted Error Rate', fontsize=11)
axes[0].set_title('Weighted Error for Each Weak Learner', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Estimator weights
axes[1].bar(range(1, len(our_ada.estimator_weights) + 1), 
            our_ada.estimator_weights, alpha=0.7, edgecolor='black')
axes[1].set_xlabel('Estimator Number', fontsize=11)
axes[1].set_ylabel('Alpha (Estimator Weight)', fontsize=11)
axes[1].set_title('Weight (Importance) of Each Weak Learner', fontsize=12, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nObservation:")
print("- Later estimators often have higher error (harder examples)")
print("- But they still contribute (error < 0.5)")
print("- Estimator weights vary based on performance")

## 4. Bias-Variance Analysis: Boosting vs Bagging

Let's empirically compare how boosting and bagging affect bias and variance.

In [None]:
# Generate dataset with known complexity
X, y = make_classification(
    n_samples=1000,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    random_state=RANDOM_STATE
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

# Compare weak learner, bagging, and boosting
n_estimators_range = [1, 5, 10, 25, 50, 100, 200]

# Single weak learner baseline
weak_learner = DecisionTreeClassifier(max_depth=2, random_state=RANDOM_STATE)
weak_learner.fit(X_train, y_train)
weak_score = weak_learner.score(X_test, y_test)

# Test bagging and boosting with increasing ensemble size
bagging_scores = []
boosting_scores = []

for n_est in n_estimators_range:
    # Bagging
    bagging = BaggingClassifier(
        estimator=DecisionTreeClassifier(max_depth=2, random_state=RANDOM_STATE),
        n_estimators=n_est,
        random_state=RANDOM_STATE
    )
    bagging.fit(X_train, y_train)
    bagging_scores.append(bagging.score(X_test, y_test))
    
    # Boosting
    boosting = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=2, random_state=RANDOM_STATE),
        n_estimators=n_est,
        random_state=RANDOM_STATE,
        algorithm='SAMME'
    )
    boosting.fit(X_train, y_train)
    boosting_scores.append(boosting.score(X_test, y_test))

# Visualize comparison
plt.figure(figsize=(12, 7))
plt.axhline(weak_score, color='gray', linestyle='--', linewidth=2, 
            label=f'Single Weak Learner: {weak_score:.3f}')
plt.plot(n_estimators_range, bagging_scores, marker='o', linewidth=2, 
         markersize=8, label='Bagging')
plt.plot(n_estimators_range, boosting_scores, marker='s', linewidth=2, 
         markersize=8, label='Boosting (AdaBoost)')
plt.xlabel('Number of Estimators', fontsize=12)
plt.ylabel('Test Accuracy', fontsize=12)
plt.title('Bias-Variance: Bagging vs Boosting with Weak Learners', 
          fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey Insights:")
print("=" * 70)
print(f"Single weak learner:     {weak_score:.4f} (high bias)")
print(f"Bagging (200 learners):  {bagging_scores[-1]:.4f} (modest improvement)")
print(f"Boosting (200 learners): {boosting_scores[-1]:.4f} (large improvement)")
print("\nWhy?")
print("- Bagging: Weak learners have high bias, averaging doesn't fix bias")
print("- Boosting: Sequentially reduces bias by focusing on errors")
print("- Boosting is MORE effective with weak learners!")

## 5. Learning Rate Impact

The learning rate (also called shrinkage) controls how much each estimator contributes to the final prediction.

**Formula**: Instead of using full $\alpha_t$, use $\nu \times \alpha_t$ where $0 < \nu \leq 1$

**Trade-off**:
- **Higher learning rate** (close to 1): Faster learning, fewer estimators needed, risk of overfitting
- **Lower learning rate** (e.g., 0.1): Slower learning, more estimators needed, better generalization

**General rule**: Lower learning rate + more estimators = better performance (but longer training)

In [None]:
# Compare different learning rates
learning_rates = [0.01, 0.1, 0.5, 1.0, 2.0]
n_estimators = 200

# Use breast cancer dataset
cancer_data = load_breast_cancer()
X, y = cancer_data.data, cancer_data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

# Store results
results = []

for lr in learning_rates:
    ada = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=1),
        n_estimators=n_estimators,
        learning_rate=lr,
        random_state=RANDOM_STATE,
        algorithm='SAMME'
    )
    ada.fit(X_train, y_train)
    
    train_score = ada.score(X_train, y_train)
    test_score = ada.score(X_test, y_test)
    
    results.append({
        'Learning Rate': lr,
        'Train Accuracy': train_score,
        'Test Accuracy': test_score,
        'Overfit Gap': train_score - test_score
    })

results_df = pd.DataFrame(results)
print("Learning Rate Impact:")
print("=" * 70)
print(results_df.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Train vs Test accuracy
axes[0].plot(learning_rates, results_df['Train Accuracy'], 
             marker='o', linewidth=2, label='Train Accuracy')
axes[0].plot(learning_rates, results_df['Test Accuracy'], 
             marker='s', linewidth=2, label='Test Accuracy')
axes[0].set_xlabel('Learning Rate', fontsize=11)
axes[0].set_ylabel('Accuracy', fontsize=11)
axes[0].set_title('Learning Rate vs Accuracy', fontsize=12, fontweight='bold')
axes[0].set_xscale('log')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Plot 2: Overfitting gap
axes[1].bar(range(len(learning_rates)), results_df['Overfit Gap'], 
            alpha=0.7, edgecolor='black')
axes[1].set_xticks(range(len(learning_rates)))
axes[1].set_xticklabels(learning_rates)
axes[1].set_xlabel('Learning Rate', fontsize=11)
axes[1].set_ylabel('Train - Test Gap', fontsize=11)
axes[1].set_title('Overfitting vs Learning Rate', fontsize=12, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nRecommendation:")
best_idx = results_df['Test Accuracy'].idxmax()
best_lr = results_df.loc[best_idx, 'Learning Rate']
print(f"Best learning rate for this dataset: {best_lr}")
print("General advice: Start with 0.1 or 1.0, tune based on validation performance")

## 6. Sensitivity to Outliers and Noise

### A Critical Weakness of Boosting

**Problem**: Boosting increases weights for misclassified samples
- Outliers and mislabeled data get very high weights
- Later estimators focus heavily on these noisy samples
- Can lead to overfitting on noise

**Comparison with Bagging**:
- Bagging treats all samples equally (with bootstrap sampling)
- Outliers don't get special attention
- More robust to noisy data

In [None]:
# Demonstrate sensitivity to label noise
# Create clean dataset
X_clean, y_clean = make_classification(
    n_samples=500,
    n_features=20,
    n_informative=15,
    n_redundant=5,
    flip_y=0,  # No label noise
    random_state=RANDOM_STATE
)

# Add increasing amounts of label noise
noise_levels = [0.0, 0.05, 0.10, 0.15, 0.20, 0.30]
bagging_performance = []
boosting_performance = []
rf_performance = []

for noise in noise_levels:
    # Add label noise
    y_noisy = y_clean.copy()
    n_flip = int(len(y_noisy) * noise)
    flip_idx = np.random.choice(len(y_noisy), n_flip, replace=False)
    y_noisy[flip_idx] = 1 - y_noisy[flip_idx]
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X_clean, y_noisy, test_size=0.3, random_state=RANDOM_STATE
    )
    
    # Train models
    bagging = BaggingClassifier(
        estimator=DecisionTreeClassifier(max_depth=5, random_state=RANDOM_STATE),
        n_estimators=50,
        random_state=RANDOM_STATE
    )
    bagging.fit(X_train, y_train)
    bagging_performance.append(bagging.score(X_test, y_test))
    
    boosting = AdaBoostClassifier(
        estimator=DecisionTreeClassifier(max_depth=5, random_state=RANDOM_STATE),
        n_estimators=50,
        random_state=RANDOM_STATE,
        algorithm='SAMME'
    )
    boosting.fit(X_train, y_train)
    boosting_performance.append(boosting.score(X_test, y_test))
    
    rf = RandomForestClassifier(
        n_estimators=50,
        max_depth=5,
        random_state=RANDOM_STATE
    )
    rf.fit(X_train, y_train)
    rf_performance.append(rf.score(X_test, y_test))

# Visualize robustness to noise
plt.figure(figsize=(12, 7))
plt.plot(noise_levels, bagging_performance, marker='o', linewidth=2, 
         markersize=8, label='Bagging')
plt.plot(noise_levels, boosting_performance, marker='s', linewidth=2, 
         markersize=8, label='AdaBoost')
plt.plot(noise_levels, rf_performance, marker='^', linewidth=2, 
         markersize=8, label='Random Forest')
plt.xlabel('Label Noise Rate', fontsize=12)
plt.ylabel('Test Accuracy', fontsize=12)
plt.title('Robustness to Label Noise', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate degradation
bagging_deg = bagging_performance[0] - bagging_performance[-1]
boosting_deg = boosting_performance[0] - boosting_performance[-1]
rf_deg = rf_performance[0] - rf_performance[-1]

print("\nPerformance Degradation with 30% Label Noise:")
print("=" * 50)
print(f"Bagging:       {bagging_deg:.4f} drop")
print(f"AdaBoost:      {boosting_deg:.4f} drop  ← Most affected!")
print(f"Random Forest: {rf_deg:.4f} drop")
print("\nConclusion:")
print("AdaBoost is most sensitive to label noise!")
print("Bagging methods (including Random Forest) are more robust.")

## Exercises

### Exercise 1: Weak Learner Strength Analysis

Compare AdaBoost performance with different weak learner complexities:
- Decision stumps (max_depth=1)
- Shallow trees (max_depth=2, 3, 5)
- Deeper trees (max_depth=10)

For each:
1. Train AdaBoost with 100 estimators
2. Compare training and test accuracy
3. Measure training time
4. Analyze the trade-off between weak learner complexity and ensemble size

**Question**: What's the optimal weak learner complexity for AdaBoost?

In [None]:
# Your code here


### Exercise 2: Early Stopping for AdaBoost

Implement early stopping for AdaBoost:
1. Split training data into train/validation
2. Train AdaBoost with many estimators (e.g., 500)
3. Track validation accuracy after each estimator is added
4. Plot learning curves (train vs validation)
5. Identify the optimal stopping point

**Bonus**: Implement patience-based early stopping (stop if no improvement for N iterations)

In [None]:
# Your code here


### Exercise 3: Outlier Removal Strategy

Since AdaBoost is sensitive to outliers, implement and test an outlier removal strategy:

1. Create a dataset with synthetic outliers
2. Train AdaBoost on full data (with outliers)
3. Identify potential outliers using:
   - Isolation Forest
   - Or samples with highest final weights in AdaBoost
4. Remove suspected outliers and retrain
5. Compare performance

**Does outlier removal improve AdaBoost performance?**

In [None]:
# Your code here


### Exercise 4: Learning Rate vs Number of Estimators

Explore the relationship between learning rate and ensemble size:

1. Create a grid of (learning_rate, n_estimators) combinations
2. For each combination, train AdaBoost and record CV accuracy
3. Create a heatmap showing performance across the grid
4. Find Pareto-optimal combinations (good accuracy with fewer estimators)

**Goal**: Understand the trade-off between learning rate and ensemble size for deployment.

In [None]:
# Your code here


## Summary

### Key Concepts

1. **Boosting Philosophy**:
   - Sequential learning: each model corrects previous errors
   - Weak learners combine to form strong learner
   - Focus on hard examples through adaptive weighting
   - Reduces bias, can increase variance if overtrained

2. **AdaBoost Algorithm**:
   - Initialize equal sample weights
   - For each iteration:
     - Train weak learner on weighted data
     - Calculate weighted error
     - Compute learner weight (alpha)
     - Increase weights for misclassified samples
   - Final prediction: weighted vote

3. **Key Parameters**:
   - `n_estimators`: More estimators → better performance (until overfitting)
   - `learning_rate`: Lower rates need more estimators but generalize better
   - `base_estimator`: Weak learners (stumps or shallow trees) work best

4. **Strengths**:
   - Excellent performance with weak learners
   - Theoretically grounded (provable convergence)
   - Often achieves better accuracy than bagging
   - Automatic feature selection (focuses on informative features)

5. **Weaknesses**:
   - **Sensitive to outliers and noise** (critical limitation)
   - Sequential training (can't parallelize like bagging)
   - Risk of overfitting with too many estimators
   - Less stable than bagging methods

### Bagging vs Boosting Comparison

| Aspect | Bagging | Boosting |
|--------|---------|----------|
| Training | Parallel | Sequential |
| Reduces | Variance | Bias |
| Base learner | Strong (deep trees) | Weak (stumps) |
| Sample weighting | Equal | Adaptive |
| Noise sensitivity | Robust | Sensitive |
| Overfitting risk | Low | Moderate |
| Parallelization | Easy | Hard |

### When to Use AdaBoost

**Good for**:
- Clean datasets with low noise
- Binary classification problems
- When you need high accuracy and can afford sequential training
- Baseline boosting algorithm before trying modern variants

**Avoid when**:
- Data has many outliers or label noise
- Need fast training (use Random Forest)
- Need robustness over peak accuracy
- Working with very large datasets (use LightGBM/XGBoost instead)

### Historical Impact

AdaBoost (1996) was groundbreaking:
- First practical boosting algorithm
- Inspired modern gradient boosting methods
- Freund & Schapire won Gödel Prize (2003)
- Foundation for XGBoost, LightGBM, CatBoost

### What's Next?

In **Module 04: Gradient Boosting Machines**, we'll explore:
- Gradient boosting framework (more general than AdaBoost)
- How boosting fits in function space
- Different loss functions
- Regularization techniques
- Connection to gradient descent

### Additional Resources

- **Original Paper**: "A Decision-Theoretic Generalization of On-Line Learning" (Freund & Schapire, 1997)
- **Tutorial**: "Boosting Algorithms as Gradient Descent" (Mason et al., 1999)
- **Book**: "Boosting: Foundations and Algorithms" by Schapire & Freund
- **Sklearn Docs**: [AdaBoost Classifier](https://scikit-learn.org/stable/modules/ensemble.html#adaboost)