# Module 05: XGBoost

**Difficulty**: ⭐⭐⭐ Advanced
**Estimated Time**: 90 minutes
**Prerequisites**: 
- Module 00: Introduction to Ensemble Learning
- Module 03: Boosting Fundamentals
- Module 04: Gradient Boosting Machines

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand XGBoost's algorithmic innovations and advantages
2. Install and use the XGBoost library effectively
3. Tune critical XGBoost hyperparameters for optimal performance
4. Apply regularization techniques (L1, L2, gamma) to prevent overfitting
5. Use early stopping with validation sets
6. Extract and interpret feature importance (gain, cover, frequency)
7. Use SHAP values for model interpretation
8. Leverage XGBoost's built-in cross-validation

## Setup and Configuration

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import time
import warnings
warnings.filterwarnings('ignore')

# Machine learning
from sklearn.datasets import load_breast_cancer, fetch_california_housing, load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    roc_auc_score, roc_curve, log_loss,
    mean_squared_error, r2_score
)

# XGBoost
try:
    import xgboost as xgb
    print(f"XGBoost version: {xgb.__version__}")
except ImportError:
    print("Installing XGBoost...")
    !pip install xgboost -q
    import xgboost as xgb
    print(f"XGBoost version: {xgb.__version__}")

# SHAP for interpretability (optional but recommended)
try:
    import shap
    print(f"SHAP version: {shap.__version__}")
    SHAP_AVAILABLE = True
except ImportError:
    print("SHAP not available. Install with: pip install shap")
    SHAP_AVAILABLE = False

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("\nSetup complete! All libraries imported successfully.")

## 1. What Makes XGBoost Special?

### XGBoost = eXtreme Gradient Boosting

Created by Tianqi Chen (2014), XGBoost dominated Kaggle competitions and became the go-to algorithm for tabular data.

### Key Innovations

#### 1.1 Regularized Learning Objective

Standard gradient boosting minimizes:
$$L = \sum_{i} l(y_i, \hat{y}_i)$$

XGBoost adds regularization:
$$L = \sum_{i} l(y_i, \hat{y}_i) + \sum_{k} \Omega(f_k)$$

where $\Omega(f) = \gamma T + \frac{1}{2}\lambda \|w\|^2 + \alpha \|w\|_1$
- $T$: number of leaves
- $\gamma$: minimum loss reduction to split (like min_impurity_decrease)
- $\lambda$: L2 regularization on leaf weights
- $\alpha$: L1 regularization on leaf weights

#### 1.2 Sparsity-Aware Split Finding
- Automatically handles missing values
- Learns optimal direction for missing values during training
- No need for imputation!

#### 1.3 Weighted Quantile Sketch
- Efficient algorithm for finding split points
- Handles weighted data properly
- Much faster than exact greedy search

#### 1.4 System Optimizations
- Cache-aware access patterns
- Out-of-core computing (handles data larger than RAM)
- Parallel tree construction
- GPU acceleration support
- Distributed computing support

### Why XGBoost Wins Competitions

1. **Performance**: Usually best accuracy on structured data
2. **Speed**: 10x+ faster than sklearn GradientBoosting
3. **Flexibility**: Many tuning options
4. **Robustness**: Handles missing data, large datasets
5. **Built-in tools**: CV, early stopping, feature importance

### XGBoost vs Sklearn GradientBoosting

| Feature | Sklearn GB | XGBoost |
|---------|-----------|----------|
| Speed | Baseline | 10-50x faster |
| Missing values | Need imputation | Automatic |
| Regularization | Limited | L1, L2, gamma |
| Parallel training | No | Yes |
| GPU support | No | Yes |
| Built-in CV | No | Yes |
| Memory efficiency | Good | Excellent |

In [None]:
# Quick comparison: XGBoost vs Sklearn GradientBoosting
# Load dataset
cancer_data = load_breast_cancer()
X, y = cancer_data.data, cancer_data.target
feature_names = cancer_data.feature_names

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=RANDOM_STATE
)

print(f"Dataset: {len(X_train)} training samples, {X.shape[1]} features")
print(f"Classes: {np.unique(y)}, Distribution: {np.bincount(y)}")

In [None]:
# Train both models with similar parameters
n_estimators = 100

# Sklearn GradientBoosting
print("Training Sklearn GradientBoosting...")
start = time.time()
gb_sklearn = GradientBoostingClassifier(
    n_estimators=n_estimators,
    learning_rate=0.1,
    max_depth=3,
    random_state=RANDOM_STATE
)
gb_sklearn.fit(X_train, y_train)
sklearn_time = time.time() - start
sklearn_acc = gb_sklearn.score(X_test, y_test)

# XGBoost
print("Training XGBoost...")
start = time.time()
xgb_model = xgb.XGBClassifier(
    n_estimators=n_estimators,
    learning_rate=0.1,
    max_depth=3,
    random_state=RANDOM_STATE,
    eval_metric='logloss'
)
xgb_model.fit(X_train, y_train)
xgb_time = time.time() - start
xgb_acc = xgb_model.score(X_test, y_test)

# Compare
print("\n" + "=" * 70)
print("Comparison Results:")
print("=" * 70)
print(f"\nSklearn GradientBoosting:")
print(f"  Training time: {sklearn_time:.3f} seconds")
print(f"  Test accuracy: {sklearn_acc:.4f}")
print(f"\nXGBoost:")
print(f"  Training time: {xgb_time:.3f} seconds")
print(f"  Test accuracy: {xgb_acc:.4f}")
print(f"\nSpeed improvement: {sklearn_time / xgb_time:.1f}x faster")
print(f"Accuracy difference: {(xgb_acc - sklearn_acc) * 100:.2f}%")

## 2. XGBoost APIs

XGBoost provides multiple APIs:

### 2.1 Scikit-learn API (Easiest)
```python
xgb.XGBClassifier() / xgb.XGBRegressor()
```
- Familiar sklearn interface
- Easy integration with sklearn pipelines
- Good for most use cases

### 2.2 Native API (Most Powerful)
```python
xgb.DMatrix()  # Optimized data structure
xgb.train()    # Training function
```
- More control and features
- Slightly faster
- Required for advanced features

We'll use both in this notebook.

In [None]:
# Demonstrate native API with DMatrix
# DMatrix is XGBoost's optimized data structure

# Create DMatrix objects
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)
dtest = xgb.DMatrix(X_test, label=y_test, feature_names=feature_names)

print("DMatrix created:")
print(f"Training samples: {dtrain.num_row()}")
print(f"Features: {dtrain.num_col()}")
print(f"\nDMatrix is more memory efficient and faster for XGBoost!")

## 3. Critical Hyperparameters

XGBoost has many parameters. Here are the most important ones:

### 3.1 Tree Parameters

**`max_depth`** (default=6): Maximum tree depth
- Higher → more complex model, risk of overfitting
- Typical: 3-10

**`min_child_weight`** (default=1): Minimum sum of instance weight in a leaf
- Higher → more conservative
- Helps prevent overfitting
- Typical: 1-10

**`gamma`** (default=0): Minimum loss reduction required to split
- Higher → more conservative
- Makes algorithm more conservative
- Typical: 0-5

### 3.2 Boosting Parameters

**`learning_rate`** (eta, default=0.3): Step size shrinkage
- Lower → need more trees but better generalization
- Typical: 0.01-0.3

**`n_estimators`** (default=100): Number of trees
- Use with early stopping
- Typical: 100-1000

### 3.3 Sampling Parameters

**`subsample`** (default=1): Fraction of samples for each tree
- < 1.0 → stochastic gradient boosting
- Prevents overfitting
- Typical: 0.6-1.0

**`colsample_bytree`** (default=1): Fraction of features per tree
- Like Random Forest feature randomness
- Typical: 0.6-1.0

**`colsample_bylevel`**: Fraction of features per level
**`colsample_bynode`**: Fraction of features per split

### 3.4 Regularization Parameters

**`reg_alpha`** (default=0): L1 regularization
- Higher → simpler model
- Encourages sparsity
- Typical: 0-1

**`reg_lambda`** (default=1): L2 regularization
- Higher → simpler model
- Smooths leaf weights
- Typical: 1-100

In [None]:
# Effect of key hyperparameters
# We'll test max_depth systematically

max_depths = [2, 3, 4, 5, 6, 7, 8, 10]
train_scores = []
test_scores = []

for depth in max_depths:
    model = xgb.XGBClassifier(
        n_estimators=100,
        max_depth=depth,
        learning_rate=0.1,
        random_state=RANDOM_STATE,
        eval_metric='logloss'
    )
    model.fit(X_train, y_train)
    
    train_scores.append(model.score(X_train, y_train))
    test_scores.append(model.score(X_test, y_test))

# Plot
plt.figure(figsize=(10, 6))
plt.plot(max_depths, train_scores, marker='o', linewidth=2, markersize=8, label='Train Accuracy')
plt.plot(max_depths, test_scores, marker='s', linewidth=2, markersize=8, label='Test Accuracy')
plt.axvline(6, color='red', linestyle='--', alpha=0.5, label='Default (6)')
plt.xlabel('max_depth', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Effect of Tree Depth', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Find optimal
best_depth = max_depths[np.argmax(test_scores)]
print(f"\nOptimal max_depth: {best_depth}")
print(f"Test accuracy: {max(test_scores):.4f}")
print(f"Overfitting at depth={max_depths[-1]}: {train_scores[-1] - test_scores[-1]:.4f}")

In [None]:
# Effect of regularization (L1 and L2)
reg_values = [0, 0.1, 1, 10, 100]

# Test L1 regularization (reg_alpha)
l1_train = []
l1_test = []

for alpha in reg_values:
    model = xgb.XGBClassifier(
        n_estimators=100,
        max_depth=4,
        reg_alpha=alpha,
        random_state=RANDOM_STATE,
        eval_metric='logloss'
    )
    model.fit(X_train, y_train)
    l1_train.append(model.score(X_train, y_train))
    l1_test.append(model.score(X_test, y_test))

# Test L2 regularization (reg_lambda)
l2_train = []
l2_test = []

for lambda_val in reg_values:
    model = xgb.XGBClassifier(
        n_estimators=100,
        max_depth=4,
        reg_lambda=lambda_val,
        random_state=RANDOM_STATE,
        eval_metric='logloss'
    )
    model.fit(X_train, y_train)
    l2_train.append(model.score(X_train, y_train))
    l2_test.append(model.score(X_test, y_test))

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# L1 regularization
axes[0].plot(reg_values, l1_train, marker='o', label='Train', linewidth=2)
axes[0].plot(reg_values, l1_test, marker='s', label='Test', linewidth=2)
axes[0].set_xlabel('L1 Regularization (alpha)', fontsize=11)
axes[0].set_ylabel('Accuracy', fontsize=11)
axes[0].set_title('L1 Regularization Effect', fontsize=12, fontweight='bold')
axes[0].set_xscale('symlog')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# L2 regularization
axes[1].plot(reg_values, l2_train, marker='o', label='Train', linewidth=2)
axes[1].plot(reg_values, l2_test, marker='s', label='Test', linewidth=2)
axes[1].set_xlabel('L2 Regularization (lambda)', fontsize=11)
axes[1].set_ylabel('Accuracy', fontsize=11)
axes[1].set_title('L2 Regularization Effect', fontsize=12, fontweight='bold')
axes[1].set_xscale('symlog')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nRegularization helps reduce overfitting!")
print("L1 (alpha): Encourages sparsity, some features ignored")
print("L2 (lambda): Smooths weights, reduces magnitude")

## 4. Handling Missing Values

XGBoost's killer feature: automatic handling of missing values.

### How it works:
1. During training, XGBoost tries both directions for missing values at each split
2. Chooses the direction that improves the loss most
3. Remembers this default direction
4. At prediction time, missing values go to the learned default direction

**No imputation needed!** XGBoost learns the optimal treatment.

In [None]:
# Demonstrate missing value handling
# Artificially introduce missing values
X_missing = X_train.copy()
X_test_missing = X_test.copy()

# Randomly set 10% of values to NaN
np.random.seed(RANDOM_STATE)
mask_train = np.random.random(X_missing.shape) < 0.1
mask_test = np.random.random(X_test_missing.shape) < 0.1

X_missing[mask_train] = np.nan
X_test_missing[mask_test] = np.nan

print(f"Training data: {np.isnan(X_missing).sum()} missing values ({np.isnan(X_missing).sum() / X_missing.size * 100:.1f}%)")
print(f"Test data: {np.isnan(X_test_missing).sum()} missing values ({np.isnan(X_test_missing).sum() / X_test_missing.size * 100:.1f}%)")

In [None]:
# Train XGBoost on data with missing values (no imputation!)
xgb_missing = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    random_state=RANDOM_STATE,
    eval_metric='logloss'
)

xgb_missing.fit(X_missing, y_train)
missing_score = xgb_missing.score(X_test_missing, y_test)

# Compare with model trained on complete data
xgb_complete = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    random_state=RANDOM_STATE,
    eval_metric='logloss'
)
xgb_complete.fit(X_train, y_train)
complete_score = xgb_complete.score(X_test, y_test)

print("\nPerformance Comparison:")
print("=" * 50)
print(f"Complete data:      {complete_score:.4f}")
print(f"10% missing values: {missing_score:.4f}")
print(f"Accuracy drop:      {(complete_score - missing_score) * 100:.2f}%")
print("\nXGBoost handles missing values gracefully!")
print("No imputation needed - it learns optimal treatment.")

## 5. Early Stopping with Validation Set

XGBoost makes early stopping easy with the `eval_set` parameter.

In [None]:
# Split data: train, validation, test
X_train_split, X_val, y_train_split, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=RANDOM_STATE
)

# Train with early stopping
xgb_es = xgb.XGBClassifier(
    n_estimators=500,  # Set high, early stopping will find optimal
    max_depth=4,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=RANDOM_STATE,
    eval_metric='logloss'
)

# Fit with validation monitoring and early stopping
xgb_es.fit(
    X_train_split, y_train_split,
    eval_set=[(X_train_split, y_train_split), (X_val, y_val)],
    early_stopping_rounds=20,  # Stop if no improvement for 20 rounds
    verbose=False
)

print("Early Stopping Results:")
print("=" * 50)
print(f"Best iteration: {xgb_es.best_iteration}")
print(f"Best validation score: {xgb_es.best_score:.4f}")
print(f"\nStopped at {xgb_es.best_iteration} instead of 500!")
print(f"Saved training {500 - xgb_es.best_iteration} unnecessary iterations.")

# Evaluate on test set
test_acc = xgb_es.score(X_test, y_test)
print(f"\nFinal test accuracy: {test_acc:.4f}")

In [None]:
# Visualize learning curves
results = xgb_es.evals_result()

train_logloss = results['validation_0']['logloss']
val_logloss = results['validation_1']['logloss']

plt.figure(figsize=(12, 6))
plt.plot(train_logloss, label='Train', linewidth=2)
plt.plot(val_logloss, label='Validation', linewidth=2)
plt.axvline(xgb_es.best_iteration, color='red', linestyle='--', linewidth=2,
            label=f'Best iteration ({xgb_es.best_iteration})')
plt.xlabel('Boosting Round', fontsize=12)
plt.ylabel('Log Loss', fontsize=12)
plt.title('Training Progress with Early Stopping', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey Observation:")
print("Validation loss stops improving, triggering early stopping.")
print("This prevents overfitting and saves computation!")

## 6. Feature Importance

XGBoost provides three types of feature importance:

### 6.1 Weight (Frequency)
- Number of times feature is used in splits
- Simple count

### 6.2 Gain (Most Important!)
- Average gain when feature is used
- Measures improvement in loss
- **Most informative** for feature importance

### 6.3 Cover
- Average number of samples affected when feature is used
- Indicates breadth of impact

In [None]:
# Train model for feature importance analysis
xgb_importance = xgb.XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    random_state=RANDOM_STATE,
    eval_metric='logloss'
)
xgb_importance.fit(X_train, y_train)

# Get different importance types
importance_weight = xgb_importance.get_booster().get_score(importance_type='weight')
importance_gain = xgb_importance.get_booster().get_score(importance_type='gain')
importance_cover = xgb_importance.get_booster().get_score(importance_type='cover')

# Convert to DataFrame for easier handling
def importance_to_df(importance_dict, feature_names):
    # XGBoost uses f0, f1, etc. as feature names
    importance_df = pd.DataFrame([
        {'Feature': feature_names[int(k[1:])], 'Importance': v}
        for k, v in importance_dict.items()
    ])
    return importance_df.sort_values('Importance', ascending=False)

df_weight = importance_to_df(importance_weight, feature_names)
df_gain = importance_to_df(importance_gain, feature_names)
df_cover = importance_to_df(importance_cover, feature_names)

print("Top 10 Features by Gain (Most Important):")
print("=" * 70)
print(df_gain.head(10).to_string(index=False))

In [None]:
# Visualize all three importance types
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

# Plot top 10 for each type
for ax, df, title in zip(axes, [df_weight, df_gain, df_cover], 
                          ['Weight (Frequency)', 'Gain (Average Improvement)', 'Cover (Sample Count)']):
    top_features = df.head(10)
    ax.barh(range(len(top_features)), top_features['Importance'], color='steelblue', edgecolor='black')
    ax.set_yticks(range(len(top_features)))
    ax.set_yticklabels(top_features['Feature'], fontsize=9)
    ax.invert_yaxis()
    ax.set_xlabel('Importance Score', fontsize=10)
    ax.set_title(title, fontsize=11, fontweight='bold')
    ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Weight: How often feature is used")
print("- Gain: How much improvement when used (BEST for interpretation)")
print("- Cover: How many samples affected")

## 7. SHAP Values for Interpretability

SHAP (SHapley Additive exPlanations) provides the most principled way to interpret model predictions.

### Why SHAP?
- Based on game theory (Shapley values)
- Shows contribution of each feature to each prediction
- Handles feature interactions properly
- Can explain individual predictions or global patterns

### SHAP for XGBoost
- Fast implementation using TreeSHAP
- Industry standard for model interpretation

In [None]:
if SHAP_AVAILABLE:
    # Create SHAP explainer
    explainer = shap.TreeExplainer(xgb_importance)
    
    # Calculate SHAP values for test set (use subset for speed)
    X_test_sample = X_test[:100]
    shap_values = explainer.shap_values(X_test_sample)
    
    print("SHAP values computed!")
    print(f"Shape: {shap_values.shape}")
    print(f"One SHAP value per feature per sample")
    print(f"\nPositive SHAP → pushes prediction toward class 1")
    print(f"Negative SHAP → pushes prediction toward class 0")
else:
    print("SHAP not installed. Install with: pip install shap")
    print("Skipping SHAP analysis.")

In [None]:
if SHAP_AVAILABLE:
    # Global feature importance (average absolute SHAP)
    plt.figure(figsize=(10, 8))
    shap.summary_plot(shap_values, X_test_sample, feature_names=feature_names, 
                     plot_type='bar', show=False)
    plt.title('Global Feature Importance (SHAP)', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\nThis shows which features are most important overall.")
    print("Based on average absolute SHAP values across all predictions.")

In [None]:
if SHAP_AVAILABLE:
    # Detailed SHAP summary plot
    plt.figure(figsize=(10, 10))
    shap.summary_plot(shap_values, X_test_sample, feature_names=feature_names, show=False)
    plt.title('SHAP Summary Plot', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\nHow to read this plot:")
    print("- Each dot is one prediction")
    print("- X-axis: SHAP value (impact on prediction)")
    print("- Color: Feature value (red=high, blue=low)")
    print("- Example: High 'worst perimeter' → high SHAP → predicts malignant")

In [None]:
if SHAP_AVAILABLE:
    # Explain individual prediction
    sample_idx = 0
    
    plt.figure(figsize=(10, 6))
    shap.waterfall_plot(
        shap.Explanation(
            values=shap_values[sample_idx],
            base_values=explainer.expected_value,
            data=X_test_sample[sample_idx],
            feature_names=feature_names
        ),
        show=False
    )
    plt.title(f'SHAP Explanation for Sample {sample_idx}', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    actual = y_test[sample_idx]
    predicted = xgb_importance.predict(X_test_sample[sample_idx:sample_idx+1])[0]
    
    print(f"\nSample {sample_idx}:")
    print(f"Actual class: {actual}")
    print(f"Predicted class: {predicted}")
    print(f"\nWaterfall shows how each feature contributes to final prediction.")

## 8. Built-in Cross-Validation

XGBoost has efficient built-in cross-validation with `xgb.cv()`.

In [None]:
# Prepare data in DMatrix format for native API
dtrain_full = xgb.DMatrix(X_train, label=y_train)

# Set parameters
params = {
    'max_depth': 4,
    'eta': 0.1,  # learning rate
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'seed': RANDOM_STATE
}

# Run cross-validation
print("Running 5-fold cross-validation...")
cv_results = xgb.cv(
    params=params,
    dtrain=dtrain_full,
    num_boost_round=200,
    nfold=5,
    stratified=True,
    early_stopping_rounds=20,
    verbose_eval=False,
    seed=RANDOM_STATE
)

print("\nCross-Validation Results:")
print("=" * 50)
print(f"Best iteration: {len(cv_results)}")
print(f"Best train logloss: {cv_results['train-logloss-mean'].iloc[-1]:.4f}")
print(f"Best test logloss:  {cv_results['test-logloss-mean'].iloc[-1]:.4f}")
print(f"Std deviation:      {cv_results['test-logloss-std'].iloc[-1]:.4f}")

In [None]:
# Visualize CV results
plt.figure(figsize=(12, 6))
plt.plot(cv_results['train-logloss-mean'], label='Train (mean)', linewidth=2)
plt.plot(cv_results['test-logloss-mean'], label='CV (mean)', linewidth=2)
plt.fill_between(
    range(len(cv_results)),
    cv_results['test-logloss-mean'] - cv_results['test-logloss-std'],
    cv_results['test-logloss-mean'] + cv_results['test-logloss-std'],
    alpha=0.2,
    label='CV std'
)
plt.xlabel('Boosting Round', fontsize=12)
plt.ylabel('Log Loss', fontsize=12)
plt.title('Cross-Validation Learning Curves', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nBuilt-in CV is fast and convenient!")
print("Use for hyperparameter tuning without extra code.")

## Exercises

### Exercise 1: Comprehensive Hyperparameter Tuning

Use RandomizedSearchCV or GridSearchCV to tune XGBoost on a regression dataset:

Parameters to tune:
- `max_depth`: [3, 5, 7]
- `learning_rate`: [0.01, 0.05, 0.1]
- `n_estimators`: [100, 200, 500]
- `subsample`: [0.6, 0.8, 1.0]
- `colsample_bytree`: [0.6, 0.8, 1.0]
- `reg_alpha`: [0, 0.1, 1]
- `reg_lambda`: [1, 10, 100]

1. Find best parameters
2. Analyze which parameters matter most
3. Compare with default parameters
4. Measure training time vs performance trade-off

In [None]:
# Your code here


### Exercise 2: Learning Rate Schedule Experiment

Test the hypothesis: "Lower learning rate + more estimators = better performance"

Train models with:
1. lr=0.3, n_estimators=50
2. lr=0.1, n_estimators=150
3. lr=0.03, n_estimators=500
4. lr=0.01, n_estimators=1500

For each:
- Use early stopping
- Track train/validation performance
- Measure training time
- Plot learning curves

**Is there a clear winner? What's the sweet spot?**

In [None]:
# Your code here


### Exercise 3: Feature Engineering Impact

On a dataset of your choice:

1. Train baseline XGBoost on raw features
2. Engineer new features:
   - Polynomial features (degree 2)
   - Interaction terms
   - Binned/discretized features
3. Train XGBoost on engineered features
4. Use SHAP to see which engineered features are important
5. Compare performance

**Does XGBoost benefit from feature engineering, or does it capture interactions automatically?**

In [None]:
# Your code here


### Exercise 4: Imbalanced Classification Challenge

Create or load a highly imbalanced dataset (90:10 or worse). Apply multiple strategies:

1. **No adjustment**: Standard XGBoost
2. **Scale_pos_weight**: Set to (negative samples / positive samples)
3. **Custom eval_metric**: Use AUC instead of logloss
4. **Threshold tuning**: Adjust decision threshold on probabilities
5. **Focal loss**: Implement custom objective (advanced)

Compare using:
- ROC-AUC
- Precision-Recall curve
- F1 score
- Confusion matrix

**Which approach works best?**

In [None]:
# Your code here


### Exercise 5: Monotonic Constraints

XGBoost supports monotonic constraints: force feature relationships to be monotonic.

Use a dataset where you know the relationship (e.g., house prices):
- Square footage should increase price (monotonic increasing)
- Age might decrease price (monotonic decreasing)

1. Train without constraints
2. Train with monotonic constraints
3. Compare predictions and feature relationships
4. Visualize how constraints affect decision boundaries

**Hint**: Use `monotone_constraints` parameter

In [None]:
# Your code here


## Summary

### Key Concepts

1. **XGBoost Innovations**:
   - Regularized objective (L1, L2, gamma)
   - Sparsity-aware algorithm (automatic missing value handling)
   - Weighted quantile sketch (efficient split finding)
   - System optimizations (parallel, cache-aware, out-of-core)

2. **Critical Hyperparameters**:
   - `max_depth`: Tree complexity (3-10)
   - `learning_rate`: Shrinkage (0.01-0.3)
   - `n_estimators`: Number of trees (use early stopping)
   - `subsample`, `colsample_bytree`: Randomness
   - `reg_alpha`, `reg_lambda`: Regularization
   - `gamma`: Minimum split loss reduction

3. **Best Practices**:
   - Use early stopping with validation set
   - Start with defaults, tune systematically
   - Lower learning rate + more trees = better generalization
   - Use regularization to prevent overfitting
   - Leverage built-in CV for hyperparameter search

4. **Feature Importance**:
   - **Gain**: Most informative (average improvement)
   - Weight: Frequency of use
   - Cover: Sample count
   - Use SHAP for principled interpretation

5. **Advantages Over Sklearn GB**:
   - 10-50x faster
   - Handles missing values automatically
   - More regularization options
   - Parallel training
   - Built-in CV and early stopping
   - Production-ready features

### Strengths

- State-of-the-art performance on tabular data
- Fast training (parallelized)
- Handles missing values elegantly
- Extensive regularization options
- Production-ready (used at scale)
- Great documentation and community
- GPU support for massive datasets

### Weaknesses

- Many hyperparameters (can be overwhelming)
- Requires careful tuning for optimal performance
- Slower than LightGBM on very large datasets
- Not ideal for extrapolation
- Can overfit with poor hyperparameters

### When to Use XGBoost

**Ideal for**:
- Kaggle competitions (proven winner)
- Production ML systems (fast + accurate)
- Tabular data with complex patterns
- When you need interpretability (SHAP)
- Data with missing values

**Consider alternatives**:
- LightGBM: Even faster, especially on large data
- CatBoost: Better for categorical features
- Neural nets: Images, text, audio
- Linear models: Need interpretability, linear relationships

### Tuning Strategy

**Phase 1: Tree structure**
1. Fix `learning_rate=0.1`, `n_estimators=100`
2. Tune `max_depth` (3-10)
3. Tune `min_child_weight` (1-10)
4. Tune `gamma` (0-5)

**Phase 2: Randomness**
5. Tune `subsample` (0.6-1.0)
6. Tune `colsample_bytree` (0.6-1.0)

**Phase 3: Regularization**
7. Tune `reg_alpha` (0-10)
8. Tune `reg_lambda` (1-100)

**Phase 4: Learning rate**
9. Lower `learning_rate` (0.01-0.05)
10. Increase `n_estimators` (500-1000+)
11. Use early stopping

### Production Tips

1. **Model serialization**: Use `model.save_model()` and `load_model()`
2. **Version control**: Save hyperparameters with model
3. **Monitoring**: Track feature importance changes over time
4. **Retraining**: Set up pipeline for regular retraining
5. **A/B testing**: Compare model versions in production

### What's Next?

In **Module 06: LightGBM**, we'll explore:
- Microsoft's gradient boosting implementation
- GOSS and EFB algorithms for speed
- Leaf-wise vs level-wise growth
- Histogram-based learning
- Comparison with XGBoost

### Additional Resources

- **Paper**: "XGBoost: A Scalable Tree Boosting System" (Chen & Guestrin, 2016)
- **Documentation**: [XGBoost Official Docs](https://xgboost.readthedocs.io/)
- **Tutorial**: [Complete Guide to Parameter Tuning](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)
- **SHAP**: [Official SHAP documentation](https://shap.readthedocs.io/)
- **Book**: "Hands-On Gradient Boosting with XGBoost and scikit-learn" by Corey Wade