# Module 04: Gradient Boosting Machines

**Difficulty**: ⭐⭐
**Estimated Time**: 45 minutes
**Prerequisites**: 
- Module 00: Introduction to Ensemble Methods
- Module 01: Bagging and Bootstrap Aggregation
- Module 02: Random Forest
- Module 03: AdaBoost

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand the gradient boosting algorithm and how it differs from AdaBoost
2. Explain the role of loss functions and gradient descent in boosting
3. Implement gradient boosting for classification and regression tasks
4. Tune key hyperparameters: learning rate, max depth, and n_estimators
5. Use early stopping to prevent overfitting
6. Analyze feature importance in gradient boosting models

## 1. Introduction to Gradient Boosting

### What is Gradient Boosting?

Gradient Boosting is a powerful ensemble technique that builds models sequentially, where each new model corrects the errors made by previous models. Unlike AdaBoost which adjusts sample weights, **Gradient Boosting fits new models to the residual errors** (the difference between predictions and actual values).

### Key Differences from AdaBoost:

1. **AdaBoost**: Adjusts sample weights, focuses on misclassified samples
2. **Gradient Boosting**: Fits new models to residuals (errors), uses gradient descent optimization

### The Algorithm:

1. Start with an initial prediction (usually the mean for regression, log-odds for classification)
2. Calculate residuals (errors) between predictions and actual values
3. Train a new weak learner (decision tree) to predict these residuals
4. Add the new model's predictions (scaled by learning rate) to the ensemble
5. Repeat steps 2-4 for a specified number of iterations

### Why "Gradient"?

The method uses gradient descent to minimize a loss function. Each new tree is fitted to the negative gradient of the loss function, moving the ensemble predictions in the direction that reduces the error.

## 2. Setup and Imports

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn imports
from sklearn.datasets import make_classification, make_regression, load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, learning_curve
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_squared_error, r2_score, mean_absolute_error
)

# Configuration
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Set random seed for reproducibility
np.random.seed(42)

## 3. Understanding Gradient Boosting with a Simple Example

Let's build a simple gradient boosting model from scratch to understand how it works. We'll create a regression example where we can visualize the process.

In [None]:
# Create a simple dataset for regression
np.random.seed(42)
X_simple = np.linspace(0, 10, 100).reshape(-1, 1)
y_simple = 2 * X_simple.ravel() + np.sin(X_simple.ravel() * 2) + np.random.randn(100) * 0.5

print(f"Dataset shape: {X_simple.shape}")
print(f"Target shape: {y_simple.shape}")

# Visualize the data
plt.figure(figsize=(10, 6))
plt.scatter(X_simple, y_simple, alpha=0.6, label='Actual data')
plt.xlabel('X')
plt.ylabel('y')
plt.title('Simple Regression Dataset')
plt.legend()
plt.show()

### Manual Gradient Boosting Implementation

Let's implement a simplified version of gradient boosting to see how it works step by step.

In [None]:
# Step 1: Initialize with the mean (simplest prediction)
initial_prediction = np.mean(y_simple)
print(f"Initial prediction (mean): {initial_prediction:.2f}")

# Create array to store ensemble predictions
ensemble_predictions = np.full(len(y_simple), initial_prediction)

# Learning rate - controls how much each tree contributes
learning_rate = 0.1

# Number of boosting iterations
n_iterations = 5

# Store trees for visualization
trees = []

# Visualize the boosting process
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for i in range(n_iterations):
    # Step 2: Calculate residuals (negative gradient for MSE loss)
    residuals = y_simple - ensemble_predictions
    
    # Step 3: Fit a tree to predict the residuals
    tree = DecisionTreeRegressor(max_depth=3, random_state=42)
    tree.fit(X_simple, residuals)
    trees.append(tree)
    
    # Step 4: Update ensemble predictions
    tree_predictions = tree.predict(X_simple)
    ensemble_predictions += learning_rate * tree_predictions
    
    # Calculate error
    mse = mean_squared_error(y_simple, ensemble_predictions)
    
    # Visualize
    axes[i].scatter(X_simple, y_simple, alpha=0.5, label='Actual')
    axes[i].plot(X_simple, ensemble_predictions, 'r-', linewidth=2, label='Prediction')
    axes[i].set_title(f'Iteration {i+1} - MSE: {mse:.2f}')
    axes[i].set_xlabel('X')
    axes[i].set_ylabel('y')
    axes[i].legend()
    
    print(f"Iteration {i+1}: MSE = {mse:.4f}")

# Hide the last subplot if not used
axes[-1].axis('off')

plt.tight_layout()
plt.show()

print(f"\nFinal MSE: {mean_squared_error(y_simple, ensemble_predictions):.4f}")

**Key Observation**: Notice how each iteration improves the fit by learning from the residuals. The ensemble gets progressively better at capturing the underlying pattern in the data.

## 4. Gradient Boosting for Regression

Now let's use scikit-learn's `GradientBoostingRegressor` on a more realistic dataset.

In [None]:
# Create a regression dataset
X_reg, y_reg = make_regression(
    n_samples=1000,
    n_features=10,
    n_informative=8,
    noise=10,
    random_state=42
)

# Split the data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

print(f"Training set size: {X_train_reg.shape}")
print(f"Test set size: {X_test_reg.shape}")

In [None]:
# Train a Gradient Boosting Regressor
gb_regressor = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

gb_regressor.fit(X_train_reg, y_train_reg)

# Make predictions
y_pred_train_reg = gb_regressor.predict(X_train_reg)
y_pred_test_reg = gb_regressor.predict(X_test_reg)

# Evaluate performance
train_mse = mean_squared_error(y_train_reg, y_pred_train_reg)
test_mse = mean_squared_error(y_test_reg, y_pred_test_reg)
train_r2 = r2_score(y_train_reg, y_pred_train_reg)
test_r2 = r2_score(y_test_reg, y_pred_test_reg)

print("Gradient Boosting Regressor Performance:")
print(f"Train MSE: {train_mse:.2f}")
print(f"Test MSE: {test_mse:.2f}")
print(f"Train R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")

In [None]:
# Visualize predictions vs actual values
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Training set
axes[0].scatter(y_train_reg, y_pred_train_reg, alpha=0.5)
axes[0].plot([y_train_reg.min(), y_train_reg.max()], 
             [y_train_reg.min(), y_train_reg.max()], 
             'r--', linewidth=2, label='Perfect prediction')
axes[0].set_xlabel('Actual values')
axes[0].set_ylabel('Predicted values')
axes[0].set_title(f'Training Set (R² = {train_r2:.4f})')
axes[0].legend()

# Test set
axes[1].scatter(y_test_reg, y_pred_test_reg, alpha=0.5)
axes[1].plot([y_test_reg.min(), y_test_reg.max()], 
             [y_test_reg.min(), y_test_reg.max()], 
             'r--', linewidth=2, label='Perfect prediction')
axes[1].set_xlabel('Actual values')
axes[1].set_ylabel('Predicted values')
axes[1].set_title(f'Test Set (R² = {test_r2:.4f})')
axes[1].legend()

plt.tight_layout()
plt.show()

## 5. Gradient Boosting for Classification

Gradient boosting works slightly differently for classification. It uses log-loss (cross-entropy) as the loss function and predicts log-odds that are converted to probabilities.

In [None]:
# Load breast cancer dataset
cancer_data = load_breast_cancer()
X_clf = cancer_data.data
y_clf = cancer_data.target

# Split the data
X_train_clf, X_test_clf, y_train_clf, y_test_clf = train_test_split(
    X_clf, y_clf, test_size=0.2, random_state=42, stratify=y_clf
)

print(f"Training set size: {X_train_clf.shape}")
print(f"Test set size: {X_test_clf.shape}")
print(f"\nClass distribution in training set:")
print(pd.Series(y_train_clf).value_counts())

In [None]:
# Train a Gradient Boosting Classifier
gb_classifier = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

gb_classifier.fit(X_train_clf, y_train_clf)

# Make predictions
y_pred_train_clf = gb_classifier.predict(X_train_clf)
y_pred_test_clf = gb_classifier.predict(X_test_clf)

# Evaluate performance
train_accuracy = accuracy_score(y_train_clf, y_pred_train_clf)
test_accuracy = accuracy_score(y_test_clf, y_pred_test_clf)

print("Gradient Boosting Classifier Performance:")
print(f"Train Accuracy: {train_accuracy:.4f}")
print(f"Test Accuracy: {test_accuracy:.4f}")
print(f"\nClassification Report (Test Set):")
print(classification_report(y_test_clf, y_pred_test_clf, 
                          target_names=cancer_data.target_names))

In [None]:
# Confusion matrix
cm = confusion_matrix(y_test_clf, y_pred_test_clf)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=cancer_data.target_names,
            yticklabels=cancer_data.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - Gradient Boosting Classifier')
plt.show()

## 6. Key Hyperparameters

Gradient Boosting has several important hyperparameters that control model complexity and performance:

### 1. **n_estimators**: Number of boosting stages (trees)
- More trees → Better training performance but risk of overfitting
- Too few trees → Underfitting

### 2. **learning_rate** (also called shrinkage)
- Controls the contribution of each tree
- Lower values require more trees but often generalize better
- Typical range: 0.01 to 0.3
- Trade-off: `n_estimators` × `learning_rate`

### 3. **max_depth**: Maximum depth of individual trees
- Controls tree complexity
- Typical range: 3 to 8
- Deeper trees → More complex patterns but higher overfitting risk

### 4. **subsample**: Fraction of samples used for fitting trees
- Similar to Random Forest's bootstrap
- Typical range: 0.5 to 1.0
- Values < 1.0 add randomness and reduce overfitting

### 5. **min_samples_split** and **min_samples_leaf**
- Control when to stop splitting nodes
- Higher values → Simpler trees, less overfitting

### Effect of Learning Rate

In [None]:
# Compare different learning rates
learning_rates = [0.01, 0.05, 0.1, 0.3]
train_scores = []
test_scores = []

for lr in learning_rates:
    gb = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=lr,
        max_depth=3,
        random_state=42
    )
    gb.fit(X_train_clf, y_train_clf)
    
    train_scores.append(gb.score(X_train_clf, y_train_clf))
    test_scores.append(gb.score(X_test_clf, y_test_clf))
    
    print(f"Learning Rate: {lr:.2f}")
    print(f"  Train Accuracy: {train_scores[-1]:.4f}")
    print(f"  Test Accuracy: {test_scores[-1]:.4f}")
    print()

# Visualize
x_pos = np.arange(len(learning_rates))
width = 0.35

plt.figure(figsize=(10, 6))
plt.bar(x_pos - width/2, train_scores, width, label='Train', alpha=0.8)
plt.bar(x_pos + width/2, test_scores, width, label='Test', alpha=0.8)
plt.xlabel('Learning Rate')
plt.ylabel('Accuracy')
plt.title('Effect of Learning Rate on Model Performance')
plt.xticks(x_pos, learning_rates)
plt.legend()
plt.grid(axis='y', alpha=0.3)
plt.show()

### Effect of Max Depth

In [None]:
# Compare different max depths
max_depths = [1, 2, 3, 5, 7]
train_scores_depth = []
test_scores_depth = []

for depth in max_depths:
    gb = GradientBoostingClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=depth,
        random_state=42
    )
    gb.fit(X_train_clf, y_train_clf)
    
    train_scores_depth.append(gb.score(X_train_clf, y_train_clf))
    test_scores_depth.append(gb.score(X_test_clf, y_test_clf))
    
    print(f"Max Depth: {depth}")
    print(f"  Train Accuracy: {train_scores_depth[-1]:.4f}")
    print(f"  Test Accuracy: {test_scores_depth[-1]:.4f}")
    print()

# Visualize
plt.figure(figsize=(10, 6))
plt.plot(max_depths, train_scores_depth, 'o-', label='Train', linewidth=2, markersize=8)
plt.plot(max_depths, test_scores_depth, 's-', label='Test', linewidth=2, markersize=8)
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Effect of Max Depth on Model Performance')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

**Key Observation**: Notice how deeper trees lead to better training performance but may hurt test performance (overfitting). A max_depth of 3-5 often provides a good balance.

## 7. Early Stopping

Early stopping monitors validation performance and stops training when performance stops improving. This prevents overfitting and saves computational time.

In [None]:
# Split training data into train and validation
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train_clf, y_train_clf, test_size=0.2, random_state=42, stratify=y_train_clf
)

# Train with early stopping
gb_early_stop = GradientBoostingClassifier(
    n_estimators=1000,  # Set high, early stopping will determine actual number
    learning_rate=0.1,
    max_depth=3,
    validation_fraction=0.2,  # Use 20% of training data for validation
    n_iter_no_change=10,  # Stop if no improvement for 10 iterations
    tol=1e-4,  # Minimum improvement required
    random_state=42
)

gb_early_stop.fit(X_train_clf, y_train_clf)

print(f"Total estimators with early stopping: {gb_early_stop.n_estimators_}")
print(f"Specified max estimators: 1000")
print(f"\nTest Accuracy: {gb_early_stop.score(X_test_clf, y_test_clf):.4f}")

In [None]:
# Visualize training progress
# Train model with staged predictions to see performance at each stage
gb_staged = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

gb_staged.fit(X_train_clf, y_train_clf)

# Calculate accuracy at each stage
train_scores_staged = []
test_scores_staged = []

for y_pred_train in gb_staged.staged_predict(X_train_clf):
    train_scores_staged.append(accuracy_score(y_train_clf, y_pred_train))

for y_pred_test in gb_staged.staged_predict(X_test_clf):
    test_scores_staged.append(accuracy_score(y_test_clf, y_pred_test))

# Plot
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(train_scores_staged) + 1), train_scores_staged, 
         label='Train', linewidth=2)
plt.plot(range(1, len(test_scores_staged) + 1), test_scores_staged, 
         label='Test', linewidth=2)
plt.xlabel('Number of Boosting Iterations')
plt.ylabel('Accuracy')
plt.title('Learning Curve: Accuracy vs Number of Trees')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# Find optimal number of estimators
optimal_n_estimators = np.argmax(test_scores_staged) + 1
print(f"Optimal number of estimators: {optimal_n_estimators}")
print(f"Best test accuracy: {test_scores_staged[optimal_n_estimators-1]:.4f}")

**Key Insight**: The plot shows that test accuracy often peaks before training accuracy, indicating when the model starts to overfit. Early stopping helps prevent this by monitoring validation performance.

## 8. Feature Importance

Gradient Boosting provides feature importance scores based on how much each feature reduces the loss function across all trees.

In [None]:
# Get feature importances
feature_importance = gb_classifier.feature_importances_
feature_names = cancer_data.feature_names

# Create a DataFrame for better visualization
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print("Top 10 Most Important Features:")
print(importance_df.head(10))

# Visualize
plt.figure(figsize=(12, 8))
plt.barh(range(10), importance_df['Importance'].head(10))
plt.yticks(range(10), importance_df['Feature'].head(10))
plt.xlabel('Feature Importance')
plt.title('Top 10 Feature Importances in Gradient Boosting Model')
plt.gca().invert_yaxis()  # Highest importance at the top
plt.tight_layout()
plt.show()

## 9. Loss Functions

Gradient Boosting supports different loss functions depending on the task:

### For Regression:
- **'squared_error'** (default): Mean squared error, standard choice
- **'absolute_error'**: Mean absolute error, robust to outliers
- **'huber'**: Combination of squared and absolute error
- **'quantile'**: For predicting quantiles (e.g., median)

### For Classification:
- **'log_loss'** (default): Logistic regression loss, provides probability estimates
- **'exponential'**: AdaBoost loss function

In [None]:
# Compare different loss functions for regression
# Create data with outliers
X_outlier, y_outlier = make_regression(
    n_samples=200, n_features=1, noise=10, random_state=42
)

# Add some outliers
y_outlier[::20] += np.random.randn(10) * 100

# Split data
X_train_out, X_test_out, y_train_out, y_test_out = train_test_split(
    X_outlier, y_outlier, test_size=0.2, random_state=42
)

# Train models with different loss functions
loss_functions = ['squared_error', 'absolute_error', 'huber']
models = {}

for loss in loss_functions:
    model = GradientBoostingRegressor(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        loss=loss,
        random_state=42
    )
    model.fit(X_train_out, y_train_out)
    models[loss] = model
    
    # Evaluate
    train_score = r2_score(y_train_out, model.predict(X_train_out))
    test_score = r2_score(y_test_out, model.predict(X_test_out))
    test_mae = mean_absolute_error(y_test_out, model.predict(X_test_out))
    
    print(f"Loss Function: {loss}")
    print(f"  Train R²: {train_score:.4f}")
    print(f"  Test R²: {test_score:.4f}")
    print(f"  Test MAE: {test_mae:.2f}")
    print()

# Visualize predictions
X_range = np.linspace(X_outlier.min(), X_outlier.max(), 300).reshape(-1, 1)

plt.figure(figsize=(15, 5))
for idx, (loss, model) in enumerate(models.items(), 1):
    plt.subplot(1, 3, idx)
    plt.scatter(X_train_out, y_train_out, alpha=0.5, label='Training data')
    plt.plot(X_range, model.predict(X_range), 'r-', linewidth=2, label='Prediction')
    plt.xlabel('X')
    plt.ylabel('y')
    plt.title(f'Loss: {loss}')
    plt.legend()

plt.tight_layout()
plt.show()

## 10. Exercises

Now it's your turn to practice! Complete the following exercises to reinforce your understanding.

### Exercise 1: Basic Gradient Boosting

Create a synthetic classification dataset with 500 samples and 15 features. Train a Gradient Boosting Classifier with:
- 50 estimators
- Learning rate of 0.15
- Max depth of 4

Calculate and print the training and test accuracy.

In [None]:
# Your code here
# Step 1: Create dataset using make_classification
# Step 2: Split into train and test sets
# Step 3: Train GradientBoostingClassifier with specified parameters
# Step 4: Calculate and print accuracies


### Exercise 2: Hyperparameter Tuning

Using the breast cancer dataset:
1. Test different combinations of `n_estimators` (50, 100, 200) and `learning_rate` (0.01, 0.1, 0.3)
2. Store the test accuracy for each combination
3. Create a heatmap showing the accuracy for each combination
4. Which combination gives the best test accuracy?

In [None]:
# Your code here
# Step 1: Create lists for n_estimators and learning_rate values
# Step 2: Use nested loops to try all combinations
# Step 3: Store results in a 2D array or DataFrame
# Step 4: Create heatmap using seaborn
# Step 5: Print the best combination


### Exercise 3: Comparing Subsample Ratios

The `subsample` parameter controls the fraction of samples used for fitting each tree. This adds randomness similar to Random Forest:

1. Train three Gradient Boosting Classifiers on the breast cancer data with:
   - subsample=1.0 (use all samples)
   - subsample=0.8
   - subsample=0.5
2. Keep other parameters constant: n_estimators=100, learning_rate=0.1, max_depth=3
3. Compare training and test accuracies
4. Which subsample value helps reduce overfitting the most?

In [None]:
# Your code here
# Step 1: Create a list of subsample values to test
# Step 2: Train models with different subsample values
# Step 3: Calculate train and test accuracies
# Step 4: Visualize results with a bar plot
# Step 5: Analyze which value best reduces overfitting


### Exercise 4: Feature Importance Analysis

Using the breast cancer dataset:
1. Train a Gradient Boosting Classifier
2. Extract feature importances
3. Train another model using only the top 10 most important features
4. Compare the test accuracy of the full model vs the reduced model
5. Does using fewer features significantly hurt performance?

In [None]:
# Your code here
# Step 1: Train initial model and get feature importances
# Step 2: Identify top 10 most important features
# Step 3: Create new dataset with only these features
# Step 4: Train model on reduced dataset
# Step 5: Compare performances


### Exercise 5: Early Stopping Optimization

Create a regression dataset and experiment with early stopping:
1. Create a regression dataset with 1000 samples, 20 features
2. Train a model with n_estimators=500 and early stopping enabled
3. Record the actual number of estimators used
4. Plot the staged predictions to show when the model stops improving
5. What's the optimal number of trees according to your analysis?

In [None]:
# Your code here
# Step 1: Create regression dataset
# Step 2: Train with early stopping
# Step 3: Use staged_predict to get predictions at each iteration
# Step 4: Calculate MSE at each stage
# Step 5: Plot and analyze


## 11. Summary

In this notebook, you learned about Gradient Boosting Machines:

### Key Concepts:
1. **Gradient Boosting** builds models sequentially by fitting each new model to the residuals (errors) of the previous ensemble
2. **Loss Functions** drive the optimization - MSE for regression, log-loss for classification
3. **Learning Rate** controls how much each tree contributes - lower values require more trees but often generalize better
4. **Tree Depth** controls model complexity - shallow trees (3-5 levels) often work best
5. **Early Stopping** prevents overfitting by monitoring validation performance
6. **Subsample** parameter adds randomness and reduces overfitting

### Best Practices:
- Start with conservative parameters: learning_rate=0.1, max_depth=3, n_estimators=100
- Use early stopping to automatically determine the optimal number of trees
- Lower learning rates often give better results but require more trees
- Use cross-validation for hyperparameter tuning
- Monitor training vs test performance to detect overfitting
- Consider using subsample < 1.0 to add randomness and speed up training

### Advantages:
- Often achieves state-of-the-art performance on structured data
- Handles mixed data types naturally
- Provides feature importance scores
- Robust to outliers (especially with robust loss functions)
- Works well with small to medium datasets

### Disadvantages:
- Training is sequential (cannot be parallelized like Random Forest)
- More hyperparameters to tune than Random Forest
- Can be slow to train with many trees
- Risk of overfitting if not carefully tuned
- Less effective on very high-dimensional sparse data

### What's Next?

In the next module, we'll explore **XGBoost** (Extreme Gradient Boosting), an optimized implementation of gradient boosting that includes:
- Regularization terms to prevent overfitting
- Efficient tree pruning algorithms
- Built-in cross-validation
- Parallel processing capabilities
- Better handling of missing values

XGBoost has become the go-to algorithm for many machine learning competitions and real-world applications!

## Additional Resources

- [Scikit-learn Gradient Boosting Documentation](https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting)
- [Original Gradient Boosting Paper by Friedman (2001)](https://statweb.stanford.edu/~jhf/ftp/trebst.pdf)
- [Understanding Gradient Boosting (Video)](https://www.youtube.com/watch?v=3CC4N4z3GJc)
- [Gradient Boosting Interactive Demo](https://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)