# Module 04: Gradient Boosting Machines

**Difficulty**: ⭐⭐⭐ Advanced
**Estimated Time**: 85 minutes
**Prerequisites**: 
- Module 00: Introduction to Ensemble Learning
- Module 03: Boosting Fundamentals and AdaBoost
- Understanding of gradient descent
- Basic calculus (derivatives)

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand gradient boosting as optimization in function space
2. Explain how gradient boosting generalizes AdaBoost to arbitrary loss functions
3. Implement residual fitting for regression problems
4. Configure learning rate and tree depth for optimal performance
5. Apply regularization techniques to prevent overfitting
6. Use early stopping to find optimal number of estimators

## Setup and Configuration

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import time

# Machine learning
from sklearn.datasets import (
    make_regression, make_classification, 
    load_diabetes, load_breast_cancer, fetch_california_housing
)
from sklearn.model_selection import (
    train_test_split, cross_val_score, validation_curve,
    learning_curve
)
from sklearn.ensemble import (
    GradientBoostingClassifier, GradientBoostingRegressor,
    RandomForestRegressor, AdaBoostRegressor
)
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import (
    mean_squared_error, mean_absolute_error, r2_score,
    accuracy_score, log_loss, roc_auc_score
)

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("Setup complete! All libraries imported successfully.")

## 1. Gradient Boosting: The Big Picture

### From AdaBoost to Gradient Boosting

**AdaBoost limitation**: Specifically designed for classification with exponential loss
- Hard to extend to other loss functions
- Not easily adaptable to regression

**Gradient Boosting breakthrough**: Unified framework for any differentiable loss function
- Works for classification, regression, ranking, etc.
- More flexible and powerful

### Key Insight: Optimization in Function Space

**Traditional gradient descent**: Optimize parameters
```
θ ← θ - learning_rate × ∇L(θ)
```

**Gradient boosting**: Optimize function itself
```
F(x) ← F(x) + learning_rate × new_tree(x)
```

### The Algorithm (Simplified)

1. **Initialize** with constant prediction (e.g., mean for regression)
   $$F_0(x) = \text{argmin}_c \sum_{i=1}^n L(y_i, c)$$

2. **For each iteration** m = 1 to M:
   
   a. **Compute pseudo-residuals** (negative gradient of loss):
   $$r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}$$
   
   b. **Fit tree** to pseudo-residuals:
   $$h_m(x) = \text{DecisionTree}(X, r_m)$$
   
   c. **Update model**:
   $$F_m(x) = F_{m-1}(x) + \nu \cdot h_m(x)$$
   
   where $\nu$ is the learning rate

3. **Final model**: $F_M(x) = F_0(x) + \nu \sum_{m=1}^M h_m(x)$

### Why "Gradient"?

- Pseudo-residuals are the negative gradient of the loss function
- Each tree fits the gradient direction
- We're doing gradient descent, but adding functions instead of adjusting parameters!

### Common Loss Functions

**Regression**:
- Squared loss: $L(y, F) = \frac{1}{2}(y - F)^2$ → residual = $y - F$
- Absolute loss: $L(y, F) = |y - F|$ → residual = $\text{sign}(y - F)$
- Huber loss: Robust to outliers (combines squared and absolute)

**Classification**:
- Log loss (deviance): $L(y, F) = -\sum [y\log(p) + (1-y)\log(1-p)]$
- Exponential loss: $L(y, F) = e^{-yF}$ (equivalent to AdaBoost)

In [None]:
# Visualize different loss functions and their gradients
# For regression: compare squared vs absolute loss

residuals = np.linspace(-3, 3, 100)

# Loss functions
squared_loss = 0.5 * residuals**2
absolute_loss = np.abs(residuals)
huber_loss = np.where(
    np.abs(residuals) <= 1,
    0.5 * residuals**2,
    np.abs(residuals) - 0.5
)

# Gradients (pseudo-residuals for squared loss)
squared_grad = residuals  # d/dF of 0.5(y-F)^2 = -(y-F) = -residual
absolute_grad = np.sign(residuals)
huber_grad = np.where(
    np.abs(residuals) <= 1,
    residuals,
    np.sign(residuals)
)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss functions
axes[0].plot(residuals, squared_loss, linewidth=2, label='Squared Loss')
axes[0].plot(residuals, absolute_loss, linewidth=2, label='Absolute Loss')
axes[0].plot(residuals, huber_loss, linewidth=2, label='Huber Loss')
axes[0].set_xlabel('Residual (y - F)', fontsize=11)
axes[0].set_ylabel('Loss', fontsize=11)
axes[0].set_title('Loss Functions for Regression', fontsize=12, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Gradients
axes[1].plot(residuals, -squared_grad, linewidth=2, label='Squared (linear)')
axes[1].plot(residuals, -absolute_grad, linewidth=2, label='Absolute (constant)')
axes[1].plot(residuals, -huber_grad, linewidth=2, label='Huber (robust)')
axes[1].set_xlabel('Residual (y - F)', fontsize=11)
axes[1].set_ylabel('Pseudo-Residual (-gradient)', fontsize=11)
axes[1].set_title('What Trees Fit (Pseudo-Residuals)', fontsize=12, fontweight='bold')
axes[1].legend(fontsize=10)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("Loss Function Comparison:")
print("=" * 70)
print("Squared Loss:")
print("  - Smooth, differentiable everywhere")
print("  - Large errors get LARGE penalties (quadratic)")
print("  - Sensitive to outliers")
print("\nAbsolute Loss:")
print("  - Linear penalty for all errors")
print("  - Robust to outliers")
print("  - Gradient is constant (just sign)")
print("\nHuber Loss:")
print("  - Best of both worlds")
print("  - Squared for small errors (smooth)")
print("  - Linear for large errors (robust)")

## 2. Residual Fitting Demonstration

Let's build intuition by manually implementing gradient boosting for regression with squared loss.

In [None]:
# Create simple regression dataset
np.random.seed(RANDOM_STATE)
X_simple = np.linspace(0, 10, 100).reshape(-1, 1)
y_simple = np.sin(X_simple).ravel() + np.random.normal(0, 0.1, X_simple.shape[0])

# Manual gradient boosting with 5 trees
n_trees = 5
learning_rate = 0.3
max_depth = 3

# Initialize with mean
F = np.full(len(y_simple), y_simple.mean())
trees = []
predictions_history = [F.copy()]

print("Gradient Boosting Step-by-Step:")
print("=" * 70)
print(f"Initial prediction (mean): {F[0]:.4f}")
print(f"Initial MSE: {mean_squared_error(y_simple, F):.4f}\n")

# Fit trees sequentially
for i in range(n_trees):
    # Compute residuals (negative gradient for squared loss)
    residuals = y_simple - F
    
    # Fit tree to residuals
    tree = DecisionTreeRegressor(max_depth=max_depth, random_state=RANDOM_STATE + i)
    tree.fit(X_simple, residuals)
    
    # Update predictions
    tree_pred = tree.predict(X_simple)
    F = F + learning_rate * tree_pred
    
    trees.append(tree)
    predictions_history.append(F.copy())
    
    mse = mean_squared_error(y_simple, F)
    print(f"Tree {i+1}:")
    print(f"  Mean absolute residual: {np.mean(np.abs(residuals)):.4f}")
    print(f"  Updated MSE: {mse:.4f}")
    print()

In [None]:
# Visualize the sequential improvement
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

for i in range(6):
    axes[i].scatter(X_simple, y_simple, alpha=0.5, s=20, color='black', label='True data')
    axes[i].plot(X_simple, predictions_history[i], color='red', linewidth=2, 
                label=f'Prediction (MSE={mean_squared_error(y_simple, predictions_history[i]):.3f})')
    
    if i == 0:
        axes[i].set_title('Initial: F₀(x) = mean', fontsize=11, fontweight='bold')
    else:
        axes[i].set_title(f'After Tree {i}', fontsize=11, fontweight='bold')
    
    axes[i].set_xlabel('X')
    axes[i].set_ylabel('y')
    axes[i].legend(fontsize=9)
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey Observation:")
print("Each tree corrects errors from previous predictions.")
print("The model gradually learns the complex sinusoidal pattern!")

## 3. Gradient Boosting with Scikit-learn

Now let's use sklearn's optimized implementation.

In [None]:
# Load real regression dataset
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Take a subset for faster training
X = X[:5000]
y = y[:5000]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

print(f"Dataset: {len(X_train)} training samples, {len(X_test)} test samples")
print(f"Features: {X.shape[1]}")
print(f"Target: {housing.target_names}")
print(f"Target range: [{y.min():.2f}, {y.max():.2f}]")

In [None]:
# Train Gradient Boosting Regressor
gbr = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=RANDOM_STATE,
    verbose=0
)

start_time = time.time()
gbr.fit(X_train, y_train)
train_time = time.time() - start_time

# Make predictions
y_train_pred = gbr.predict(X_train)
y_test_pred = gbr.predict(X_test)

# Evaluate
train_mse = mean_squared_error(y_train, y_train_pred)
test_mse = mean_squared_error(y_test, y_test_pred)
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print("\nGradient Boosting Regressor Performance:")
print("=" * 70)
print(f"Training time: {train_time:.2f} seconds")
print(f"\nTrain MSE: {train_mse:.4f}")
print(f"Test MSE:  {test_mse:.4f}")
print(f"\nTrain R²: {train_r2:.4f}")
print(f"Test R²:  {test_r2:.4f}")

# Compare with other models
rf = RandomForestRegressor(n_estimators=100, random_state=RANDOM_STATE, n_jobs=-1)
rf.fit(X_train, y_train)
rf_test_r2 = rf.score(X_test, y_test)

print(f"\nRandom Forest R² (baseline): {rf_test_r2:.4f}")
print(f"Gradient Boosting improvement: {(test_r2 - rf_test_r2) / rf_test_r2 * 100:.1f}%")

In [None]:
# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Predicted vs Actual
axes[0].scatter(y_test, y_test_pred, alpha=0.5, s=20)
axes[0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
             'r--', linewidth=2, label='Perfect prediction')
axes[0].set_xlabel('Actual Price', fontsize=11)
axes[0].set_ylabel('Predicted Price', fontsize=11)
axes[0].set_title(f'Predictions vs Actual (R²={test_r2:.3f})', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Residuals
residuals = y_test - y_test_pred
axes[1].scatter(y_test_pred, residuals, alpha=0.5, s=20)
axes[1].axhline(0, color='red', linestyle='--', linewidth=2)
axes[1].set_xlabel('Predicted Price', fontsize=11)
axes[1].set_ylabel('Residuals', fontsize=11)
axes[1].set_title('Residual Plot', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Key Hyperparameters

### 4.1 Number of Estimators (n_estimators)
- More trees → better training performance
- But risk of overfitting!
- Use early stopping or validation to find optimal number

### 4.2 Learning Rate (learning_rate)
- Controls contribution of each tree
- Lower rate → needs more trees but generalizes better
- Typical values: 0.01 to 0.3
- **Trade-off**: learning_rate × n_estimators ≈ constant for similar performance

### 4.3 Tree Depth (max_depth)
- Controls complexity of each tree
- Shallow trees (3-5): less prone to overfitting, need more trees
- Deeper trees: can capture complex interactions, but risk overfitting
- **Default**: 3 (works well in practice)

### 4.4 Minimum Samples per Leaf (min_samples_leaf)
- Regularization: prevents tiny leaf nodes
- Higher values → simpler trees → less overfitting
- Typical values: 1-20

### 4.5 Subsample (subsample)
- Fraction of samples to use for each tree (stochastic gradient boosting)
- Values < 1.0 add randomness and reduce overfitting
- Typical values: 0.5 to 1.0
- **Benefit**: Faster training + better generalization

In [None]:
# Effect of learning rate
learning_rates = [0.01, 0.05, 0.1, 0.2, 0.5]
n_estimators = 200

train_scores = []
test_scores = []

for lr in learning_rates:
    gbr = GradientBoostingRegressor(
        n_estimators=n_estimators,
        learning_rate=lr,
        max_depth=3,
        random_state=RANDOM_STATE
    )
    gbr.fit(X_train, y_train)
    
    train_scores.append(gbr.score(X_train, y_train))
    test_scores.append(gbr.score(X_test, y_test))

# Plot
plt.figure(figsize=(10, 6))
plt.plot(learning_rates, train_scores, marker='o', linewidth=2, markersize=8, label='Train R²')
plt.plot(learning_rates, test_scores, marker='s', linewidth=2, markersize=8, label='Test R²')
plt.xlabel('Learning Rate', fontsize=12)
plt.ylabel('R² Score', fontsize=12)
plt.title(f'Learning Rate Impact (n_estimators={n_estimators})', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.xscale('log')
plt.tight_layout()
plt.show()

best_lr = learning_rates[np.argmax(test_scores)]
print(f"\nBest learning rate: {best_lr}")
print(f"Test R²: {max(test_scores):.4f}")
print("\nRule of thumb: Lower learning rate + more trees = better generalization")

In [None]:
# Effect of max_depth
max_depths = [1, 2, 3, 4, 5, 7, 10]

depth_train_scores = []
depth_test_scores = []

for depth in max_depths:
    gbr = GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=depth,
        random_state=RANDOM_STATE
    )
    gbr.fit(X_train, y_train)
    
    depth_train_scores.append(gbr.score(X_train, y_train))
    depth_test_scores.append(gbr.score(X_test, y_test))

# Plot
plt.figure(figsize=(10, 6))
plt.plot(max_depths, depth_train_scores, marker='o', linewidth=2, markersize=8, label='Train R²')
plt.plot(max_depths, depth_test_scores, marker='s', linewidth=2, markersize=8, label='Test R²')
plt.axvline(3, color='red', linestyle='--', linewidth=1, alpha=0.7, label='Default (3)')
plt.xlabel('Maximum Tree Depth', fontsize=12)
plt.ylabel('R² Score', fontsize=12)
plt.title('Tree Depth Impact', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

best_depth = max_depths[np.argmax(depth_test_scores)]
print(f"\nBest max_depth: {best_depth}")
print(f"Test R²: {max(depth_test_scores):.4f}")
print("\nObservation: Deeper trees increase overfitting (train-test gap grows)")

## 5. Regularization Techniques

Gradient boosting can easily overfit. Multiple regularization strategies help:

### 5.1 Learning Rate (Shrinkage)
- Lower learning rate = stronger regularization
- Prevents any single tree from having too much influence

### 5.2 Tree Constraints
- `max_depth`: Limit tree complexity
- `min_samples_split`: Require minimum samples to split
- `min_samples_leaf`: Require minimum samples in leaves
- `max_features`: Random feature subsampling (like Random Forest)

### 5.3 Subsampling (Stochastic Gradient Boosting)
- `subsample < 1.0`: Use random subset of data for each tree
- Adds variance → reduces overfitting
- Also speeds up training!

### 5.4 Early Stopping
- Monitor validation performance
- Stop when no improvement
- Prevents wasting computation and overfitting

In [None]:
# Demonstrate subsampling (stochastic gradient boosting)
subsample_values = [0.3, 0.5, 0.7, 0.9, 1.0]

subsample_train_scores = []
subsample_test_scores = []
subsample_times = []

for subsample in subsample_values:
    start = time.time()
    gbr = GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        subsample=subsample,
        random_state=RANDOM_STATE
    )
    gbr.fit(X_train, y_train)
    train_time = time.time() - start
    
    subsample_train_scores.append(gbr.score(X_train, y_train))
    subsample_test_scores.append(gbr.score(X_test, y_test))
    subsample_times.append(train_time)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy
axes[0].plot(subsample_values, subsample_train_scores, marker='o', linewidth=2, label='Train R²')
axes[0].plot(subsample_values, subsample_test_scores, marker='s', linewidth=2, label='Test R²')
axes[0].set_xlabel('Subsample Ratio', fontsize=11)
axes[0].set_ylabel('R² Score', fontsize=11)
axes[0].set_title('Subsampling Effect on Accuracy', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Training time
axes[1].bar(range(len(subsample_values)), subsample_times, alpha=0.7, edgecolor='black')
axes[1].set_xticks(range(len(subsample_values)))
axes[1].set_xticklabels(subsample_values)
axes[1].set_xlabel('Subsample Ratio', fontsize=11)
axes[1].set_ylabel('Training Time (seconds)', fontsize=11)
axes[1].set_title('Subsampling Effect on Speed', fontsize=12, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

best_subsample = subsample_values[np.argmax(subsample_test_scores)]
print(f"\nBest subsample: {best_subsample}")
print(f"Test R²: {max(subsample_test_scores):.4f}")
print(f"\nSpeed improvement with subsample=0.5: {(subsample_times[-1]/subsample_times[1] - 1) * 100:.1f}%")
print("Subsampling often improves both speed AND generalization!")

## 6. Early Stopping

Early stopping monitors validation performance and stops training when it stops improving.

In [None]:
# Split training data into train/validation for early stopping
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(
    X_train, y_train, test_size=0.2, random_state=RANDOM_STATE
)

# Train with many estimators and track validation performance
gbr_es = GradientBoostingRegressor(
    n_estimators=500,  # Train many
    learning_rate=0.05,
    max_depth=3,
    subsample=0.8,
    random_state=RANDOM_STATE,
    validation_fraction=0.2,  # Use 20% for validation
    n_iter_no_change=10,  # Stop if no improvement for 10 iterations
    tol=1e-4
)

gbr_es.fit(X_train, y_train)

print(f"Early stopping results:")
print(f"Estimators trained: {gbr_es.n_estimators_}")
print(f"Training stopped early at iteration: {len(gbr_es.train_score_)}")
print(f"\nFinal test R²: {gbr_es.score(X_test, y_test):.4f}")

In [None]:
# Visualize learning curves with early stopping
# Train without early stopping to see full curve
gbr_full = GradientBoostingRegressor(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=3,
    subsample=0.8,
    random_state=RANDOM_STATE
)
gbr_full.fit(X_train_split, y_train_split)

# Compute validation scores for each iteration
val_scores = []
for i, pred in enumerate(gbr_full.staged_predict(X_val_split)):
    val_scores.append(r2_score(y_val_split, pred))

# Plot
plt.figure(figsize=(12, 6))
plt.plot(range(1, len(gbr_full.train_score_) + 1), gbr_full.train_score_, 
         linewidth=2, label='Train Score', alpha=0.8)
plt.plot(range(1, len(val_scores) + 1), val_scores, 
         linewidth=2, label='Validation Score', alpha=0.8)

# Mark best validation score
best_iter = np.argmax(val_scores) + 1
plt.axvline(best_iter, color='red', linestyle='--', linewidth=2, 
            label=f'Best iteration: {best_iter}')

plt.xlabel('Number of Boosting Iterations', fontsize=12)
plt.ylabel('R² Score', fontsize=12)
plt.title('Learning Curves: When to Stop?', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nOptimal stopping point: iteration {best_iter}")
print(f"Best validation R²: {max(val_scores):.4f}")
print(f"\nContinuing to train beyond this point leads to overfitting!")

## Exercises

### Exercise 1: Loss Function Comparison

Create a regression dataset with 10% outliers (extreme values). Train Gradient Boosting models with different loss functions:
- `loss='squared_error'` (default, sensitive to outliers)
- `loss='absolute_error'` (robust to outliers)
- `loss='huber'` (balanced)

Compare:
1. Performance metrics (MSE, MAE)
2. Prediction errors on outliers vs normal points
3. Visual comparison of predictions

**Which loss function handles outliers best?**

In [None]:
# Your code here


### Exercise 2: Hyperparameter Grid Search

Perform a systematic grid search over:
- `learning_rate`: [0.01, 0.05, 0.1]
- `n_estimators`: [50, 100, 200]
- `max_depth`: [3, 5, 7]
- `subsample`: [0.8, 1.0]

Use cross-validation to find the best combination. Create a visualization showing:
1. Top 10 parameter combinations
2. Parameter importance (which matters most?)
3. Trade-off between n_estimators and learning_rate

**Bonus**: Include training time in your analysis.

In [None]:
# Your code here


### Exercise 3: Feature Importance Analysis

Train a Gradient Boosting model on a classification dataset and analyze feature importance:

1. Get impurity-based importance from the model
2. Calculate permutation importance
3. Visualize both and compare
4. Remove the bottom 50% of features and retrain
5. Compare performance and training time

**Can you achieve similar performance with fewer features?**

In [None]:
# Your code here


### Exercise 4: Learning Rate Schedule

Implement a custom learning rate schedule (not built into sklearn):

1. Start with high learning rate (e.g., 0.3)
2. Train in stages, decreasing learning rate each stage:
   - Stage 1: 50 estimators at lr=0.3
   - Stage 2: 50 estimators at lr=0.1  
   - Stage 3: 50 estimators at lr=0.03
3. Use `warm_start=True` to continue training
4. Compare with constant learning rate

**Does this learning rate schedule improve performance?**

In [None]:
# Your code here


## Summary

### Key Concepts

1. **Gradient Boosting Framework**:
   - Generalization of AdaBoost to arbitrary loss functions
   - Optimization in function space via gradient descent
   - Each tree fits negative gradient (pseudo-residuals)
   - Flexible: works for classification, regression, ranking, etc.

2. **Algorithm**:
   - Initialize with constant prediction
   - For each iteration:
     - Compute pseudo-residuals (negative gradient)
     - Fit tree to residuals
     - Update model with scaled tree prediction
   - Final model: sum of all trees

3. **Loss Functions**:
   - Squared loss: Smooth, sensitive to outliers
   - Absolute loss: Robust, constant gradient
   - Huber loss: Best of both worlds
   - Log loss: For classification

4. **Key Hyperparameters**:
   - `n_estimators`: Number of trees (use early stopping)
   - `learning_rate`: Shrinkage (lower is better but needs more trees)
   - `max_depth`: Tree complexity (3-5 works well)
   - `subsample`: Stochastic gradient boosting (0.5-0.8 reduces overfitting)
   - `min_samples_leaf`: Regularization

5. **Regularization**:
   - Learning rate (shrinkage)
   - Tree constraints (depth, samples)
   - Subsampling (stochastic GB)
   - Early stopping
   - **Multiple strategies recommended**: combine them!

### Strengths

- Often best performance on tabular data
- Handles mixed feature types naturally
- Built-in feature importance
- Flexible loss functions
- Robust to irrelevant features

### Weaknesses

- Sequential training (can't parallelize like Random Forest)
- Sensitive to hyperparameters (needs tuning)
- Can overfit easily without regularization
- Slower training than Random Forest
- Harder to interpret than single trees

### Best Practices

1. **Start with these defaults**:
   - `n_estimators=100-500` (use early stopping)
   - `learning_rate=0.1`
   - `max_depth=3-5`
   - `subsample=0.8`
   - `min_samples_leaf=5-10`

2. **Tuning strategy**:
   - First tune tree parameters (max_depth, min_samples_leaf)
   - Then tune learning_rate and n_estimators together
   - Finally tune subsample
   - Use cross-validation throughout

3. **For production**:
   - Lower learning rate (0.01-0.05) for better generalization
   - Use early stopping to save computation
   - Monitor for overfitting with validation set
   - Consider modern variants (XGBoost, LightGBM, CatBoost)

### Comparison Summary

| Method | Training | Bias/Variance | Regularization | Speed |
|--------|----------|---------------|----------------|-------|
| Random Forest | Parallel | Reduces variance | Natural (averaging) | Fast |
| AdaBoost | Sequential | Reduces bias | Limited options | Medium |
| Gradient Boosting | Sequential | Reduces bias | Many options | Medium |

### What's Next?

In **Module 05: XGBoost**, we'll explore:
- Modern optimized gradient boosting
- Advanced regularization (L1, L2, gamma)
- Handling missing values
- Built-in cross-validation
- SHAP values for interpretability

### Additional Resources

- **Original Paper**: "Greedy Function Approximation: A Gradient Boosting Machine" (Friedman, 2001)
- **Tutorial**: "Gradient Boosting from Scratch" on StatQuest
- **Documentation**: [Sklearn Gradient Boosting](https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting)
- **Book**: "The Elements of Statistical Learning" Chapter 10 (Hastie et al.)