# Module 05: XGBoost - Extreme Gradient Boosting

**Difficulty**: ⭐⭐
**Estimated Time**: 50 minutes
**Prerequisites**: 
- Module 00: Introduction to Ensemble Methods
- Module 04: Gradient Boosting Machines

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand what makes XGBoost different from standard gradient boosting
2. Install and use the XGBoost library for classification and regression
3. Explain XGBoost's key innovations: regularization, tree pruning, and parallel processing
4. Tune important hyperparameters: max_depth, eta, subsample, colsample_bytree
5. Implement early stopping to optimize training
6. Analyze and interpret feature importance scores
7. Handle missing values effectively with XGBoost

## 1. Introduction to XGBoost

### What is XGBoost?

**XGBoost** (Extreme Gradient Boosting) is an optimized distributed gradient boosting library designed to be highly efficient, flexible, and portable. Created by Tianqi Chen in 2014, it has become the **most popular machine learning algorithm** for structured data and has won numerous Kaggle competitions.

### Why XGBoost is Better than Standard Gradient Boosting:

1. **Regularization**: 
   - Adds L1 (Lasso) and L2 (Ridge) regularization to prevent overfitting
   - Standard GB doesn't have built-in regularization

2. **Smarter Tree Pruning**:
   - Uses "max_depth" parameter and then prunes trees backward
   - Removes splits that don't provide enough gain (controlled by gamma)
   - More efficient than standard GB's greedy approach

3. **Built-in Cross-Validation**:
   - Can perform CV during training
   - Makes hyperparameter tuning easier

4. **Handling Missing Values**:
   - Automatically learns the best direction for missing values
   - No need to impute before training

5. **Parallel Processing**:
   - While trees are built sequentially, XGBoost parallelizes the construction of each tree
   - Much faster than standard GB

6. **Tree Pruning Using Depth-First Approach**:
   - More efficient memory usage
   - Faster training

7. **Hardware Optimization**:
   - Cache-aware access patterns
   - Out-of-core computing for datasets that don't fit in memory

## 2. Setup and Installation

In [None]:
# Install XGBoost if not already installed
# Uncomment the line below if you need to install
# !pip install xgboost

# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time

# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier, XGBRegressor

# Scikit-learn
from sklearn.datasets import make_classification, make_regression, load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_squared_error, r2_score, mean_absolute_error
)
from sklearn.ensemble import GradientBoostingClassifier

# Configuration
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Set random seed for reproducibility
np.random.seed(42)

# Check XGBoost version
print(f"XGBoost version: {xgb.__version__}")

## 3. XGBoost for Classification

Let's start with a classification example using the breast cancer dataset.

In [None]:
# Load the breast cancer dataset
cancer_data = load_breast_cancer()
X = cancer_data.data
y = cancer_data.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nFeatures: {cancer_data.feature_names[:5]}... (30 total)")
print(f"Classes: {cancer_data.target_names}")

In [None]:
# Train a basic XGBoost classifier
xgb_clf = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
    eval_metric='logloss'  # Suppress warning about default metric
)

# Fit the model
xgb_clf.fit(X_train, y_train)

# Make predictions
y_pred_train = xgb_clf.predict(X_train)
y_pred_test = xgb_clf.predict(X_test)

# Evaluate
train_acc = accuracy_score(y_train, y_pred_train)
test_acc = accuracy_score(y_test, y_pred_test)

print("XGBoost Classifier Performance:")
print(f"Train Accuracy: {train_acc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_test, target_names=cancer_data.target_names))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_test)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=cancer_data.target_names,
            yticklabels=cancer_data.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('XGBoost Confusion Matrix')
plt.show()

### Comparison with Standard Gradient Boosting

In [None]:
# Train sklearn's GradientBoostingClassifier for comparison
start_time = time()
gb_clf = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)
gb_clf.fit(X_train, y_train)
gb_time = time() - start_time
gb_acc = gb_clf.score(X_test, y_test)

# Train XGBoost again and time it
start_time = time()
xgb_clf_timed = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
    eval_metric='logloss'
)
xgb_clf_timed.fit(X_train, y_train)
xgb_time = time() - start_time
xgb_acc = xgb_clf_timed.score(X_test, y_test)

# Compare
print("Performance Comparison:")
print(f"\nStandard Gradient Boosting:")
print(f"  Accuracy: {gb_acc:.4f}")
print(f"  Training Time: {gb_time:.3f} seconds")
print(f"\nXGBoost:")
print(f"  Accuracy: {xgb_acc:.4f}")
print(f"  Training Time: {xgb_time:.3f} seconds")
print(f"\nSpeedup: {gb_time/xgb_time:.2f}x faster")

## 4. Key XGBoost Hyperparameters

XGBoost has many hyperparameters. Here are the most important ones:

### Tree-Specific Parameters:

1. **max_depth** (default=6):
   - Maximum depth of trees
   - Higher values → more complex models, risk of overfitting
   - Typical range: 3-10

2. **min_child_weight** (default=1):
   - Minimum sum of instance weight needed in a child
   - Higher values → more conservative, prevents overfitting
   - Typical range: 1-10

3. **gamma** (default=0):
   - Minimum loss reduction required to make a split
   - Higher values → more conservative tree pruning
   - Typical range: 0-5

### Boosting Parameters:

4. **eta** (learning_rate, default=0.3):
   - Step size shrinkage to prevent overfitting
   - Lower values → need more trees but better generalization
   - Typical range: 0.01-0.3

5. **n_estimators** (default=100):
   - Number of boosting rounds (trees)
   - More trees → better performance up to a point

### Randomness Parameters:

6. **subsample** (default=1):
   - Fraction of samples used for each tree
   - Values < 1.0 prevent overfitting
   - Typical range: 0.5-1.0

7. **colsample_bytree** (default=1):
   - Fraction of features used for each tree
   - Similar to Random Forest's max_features
   - Typical range: 0.3-1.0

8. **colsample_bylevel** (default=1):
   - Fraction of features used at each level
   - Adds more randomness

### Regularization Parameters:

9. **reg_alpha** (default=0):
   - L1 regularization term on weights
   - Higher values → more regularization

10. **reg_lambda** (default=1):
    - L2 regularization term on weights
    - Higher values → more regularization

### Effect of Learning Rate (eta)

In [None]:
# Test different learning rates
learning_rates = [0.01, 0.05, 0.1, 0.3]
results = []

for eta in learning_rates:
    model = XGBClassifier(
        n_estimators=100,
        learning_rate=eta,
        max_depth=3,
        random_state=42,
        eval_metric='logloss'
    )
    model.fit(X_train, y_train)
    
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    results.append({
        'learning_rate': eta,
        'train_acc': train_acc,
        'test_acc': test_acc
    })
    
    print(f"Learning Rate: {eta}")
    print(f"  Train: {train_acc:.4f}, Test: {test_acc:.4f}")

# Visualize
results_df = pd.DataFrame(results)

fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(learning_rates))
width = 0.35

ax.bar(x - width/2, results_df['train_acc'], width, label='Train', alpha=0.8)
ax.bar(x + width/2, results_df['test_acc'], width, label='Test', alpha=0.8)

ax.set_xlabel('Learning Rate')
ax.set_ylabel('Accuracy')
ax.set_title('Effect of Learning Rate on Performance')
ax.set_xticks(x)
ax.set_xticklabels(learning_rates)
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

### Effect of Max Depth

In [None]:
# Test different max depths
max_depths = [2, 3, 4, 5, 7, 10]
depth_results = []

for depth in max_depths:
    model = XGBClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=depth,
        random_state=42,
        eval_metric='logloss'
    )
    model.fit(X_train, y_train)
    
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    depth_results.append({
        'max_depth': depth,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'gap': train_acc - test_acc
    })

depth_df = pd.DataFrame(depth_results)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Accuracy plot
axes[0].plot(depth_df['max_depth'], depth_df['train_acc'], 'o-', 
             label='Train', linewidth=2, markersize=8)
axes[0].plot(depth_df['max_depth'], depth_df['test_acc'], 's-', 
             label='Test', linewidth=2, markersize=8)
axes[0].set_xlabel('Max Depth')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy vs Max Depth')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Overfitting gap plot
axes[1].plot(depth_df['max_depth'], depth_df['gap'], 'ro-', linewidth=2, markersize=8)
axes[1].set_xlabel('Max Depth')
axes[1].set_ylabel('Train-Test Gap')
axes[1].set_title('Overfitting Gap vs Max Depth')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nBest max_depth based on test accuracy:")
best_idx = depth_df['test_acc'].idxmax()
print(f"Max Depth: {depth_df.loc[best_idx, 'max_depth']}")
print(f"Test Accuracy: {depth_df.loc[best_idx, 'test_acc']:.4f}")

### Effect of Subsample and Colsample

In [None]:
# Test different subsample and colsample_bytree values
sample_params = [
    (1.0, 1.0),
    (0.8, 1.0),
    (1.0, 0.8),
    (0.8, 0.8),
    (0.7, 0.7),
    (0.5, 0.5)
]

sample_results = []

for subsample, colsample in sample_params:
    model = XGBClassifier(
        n_estimators=100,
        learning_rate=0.1,
        max_depth=3,
        subsample=subsample,
        colsample_bytree=colsample,
        random_state=42,
        eval_metric='logloss'
    )
    model.fit(X_train, y_train)
    
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    sample_results.append({
        'subsample': subsample,
        'colsample': colsample,
        'params': f"{subsample}/{colsample}",
        'train_acc': train_acc,
        'test_acc': test_acc
    })
    
    print(f"Subsample: {subsample}, Colsample: {colsample}")
    print(f"  Train: {train_acc:.4f}, Test: {test_acc:.4f}")

# Visualize
sample_df = pd.DataFrame(sample_results)

fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(sample_params))
width = 0.35

ax.bar(x - width/2, sample_df['train_acc'], width, label='Train', alpha=0.8)
ax.bar(x + width/2, sample_df['test_acc'], width, label='Test', alpha=0.8)

ax.set_xlabel('Subsample / Colsample_bytree')
ax.set_ylabel('Accuracy')
ax.set_title('Effect of Subsample and Colsample on Performance')
ax.set_xticks(x)
ax.set_xticklabels(sample_df['params'])
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

**Key Insight**: Using subsample and colsample_bytree < 1.0 often reduces overfitting (smaller train-test gap) while maintaining good test performance. This adds randomness similar to Random Forest.

## 5. Early Stopping

XGBoost has excellent built-in early stopping support. It monitors a validation set and stops training when performance stops improving.

In [None]:
# Create a validation set
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

# Train with early stopping
xgb_early = XGBClassifier(
    n_estimators=1000,  # Set high, early stopping will determine actual number
    learning_rate=0.1,
    max_depth=3,
    early_stopping_rounds=10,  # Stop if no improvement for 10 rounds
    random_state=42,
    eval_metric='logloss'
)

# Fit with validation set
xgb_early.fit(
    X_train_sub, y_train_sub,
    eval_set=[(X_val, y_val)],
    verbose=False  # Set to True to see training progress
)

print(f"Best iteration: {xgb_early.best_iteration}")
print(f"Best score: {xgb_early.best_score:.4f}")
print(f"Total boosting rounds: {xgb_early.n_estimators}")
print(f"\nTest Accuracy: {xgb_early.score(X_test, y_test):.4f}")

In [None]:
# Visualize training progress with eval_set results
# Retrain with verbose output to capture evaluation results
xgb_verbose = XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
    eval_metric='logloss'
)

# Fit with both training and validation sets tracked
xgb_verbose.fit(
    X_train_sub, y_train_sub,
    eval_set=[(X_train_sub, y_train_sub), (X_val, y_val)],
    verbose=False
)

# Get evaluation results
results = xgb_verbose.evals_result()
train_logloss = results['validation_0']['logloss']
val_logloss = results['validation_1']['logloss']

# Plot
plt.figure(figsize=(10, 6))
plt.plot(train_logloss, label='Train', linewidth=2)
plt.plot(val_logloss, label='Validation', linewidth=2)
plt.xlabel('Boosting Round')
plt.ylabel('Log Loss')
plt.title('XGBoost Training Progress')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

# Find best iteration
best_iter = np.argmin(val_logloss)
print(f"Best iteration: {best_iter}")
print(f"Best validation log loss: {val_logloss[best_iter]:.4f}")

## 6. Feature Importance

XGBoost provides multiple types of feature importance:

1. **'weight'**: Number of times a feature appears in trees
2. **'gain'**: Average gain when the feature is used for splitting (default)
3. **'cover'**: Average coverage of the feature when used in trees
4. **'total_gain'**: Total gain of the feature
5. **'total_cover'**: Total coverage of the feature

In [None]:
# Train a model for feature importance analysis
xgb_fi = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
    eval_metric='logloss'
)

xgb_fi.fit(X_train, y_train)

# Get feature importance using different metrics
importance_types = ['weight', 'gain', 'cover']

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for idx, imp_type in enumerate(importance_types):
    # Get importance scores
    importance_dict = xgb_fi.get_booster().get_score(importance_type=imp_type)
    
    # Convert to DataFrame for easier plotting
    importance_df = pd.DataFrame({
        'feature': [f"f{i}" for i in range(len(cancer_data.feature_names))],
        'feature_name': cancer_data.feature_names,
        'importance': [importance_dict.get(f"f{i}", 0) for i in range(len(cancer_data.feature_names))]
    }).sort_values('importance', ascending=False).head(10)
    
    # Plot
    axes[idx].barh(range(len(importance_df)), importance_df['importance'])
    axes[idx].set_yticks(range(len(importance_df)))
    axes[idx].set_yticklabels(importance_df['feature_name'], fontsize=9)
    axes[idx].set_xlabel('Importance Score')
    axes[idx].set_title(f'Feature Importance ({imp_type})')
    axes[idx].invert_yaxis()

plt.tight_layout()
plt.show()

In [None]:
# Built-in XGBoost plot_importance function
from xgboost import plot_importance

fig, ax = plt.subplots(figsize=(10, 8))
plot_importance(xgb_fi, ax=ax, max_num_features=15, importance_type='gain')
plt.title('Top 15 Feature Importances (by Gain)')
plt.tight_layout()
plt.show()

## 7. Handling Missing Values

One of XGBoost's powerful features is its ability to handle missing values automatically. It learns the optimal direction to send missing values during training.

In [None]:
# Create a dataset with missing values
np.random.seed(42)
X_missing = X_train.copy()

# Randomly set 10% of values to NaN
mask = np.random.rand(*X_missing.shape) < 0.1
X_missing[mask] = np.nan

print(f"Original training data shape: {X_train.shape}")
print(f"Number of missing values: {np.isnan(X_missing).sum()}")
print(f"Percentage of missing values: {np.isnan(X_missing).sum() / X_missing.size * 100:.2f}%")

In [None]:
# XGBoost can handle missing values directly - no imputation needed!
xgb_missing = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
    eval_metric='logloss'
)

# Train on data with missing values
xgb_missing.fit(X_missing, y_train)

# Evaluate on test set (no missing values)
y_pred_missing = xgb_missing.predict(X_test)
acc_missing = accuracy_score(y_test, y_pred_missing)

# Compare with model trained on complete data
xgb_complete = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42,
    eval_metric='logloss'
)
xgb_complete.fit(X_train, y_train)
acc_complete = xgb_complete.score(X_test, y_test)

print("Performance Comparison:")
print(f"\nModel trained on complete data:")
print(f"  Test Accuracy: {acc_complete:.4f}")
print(f"\nModel trained on data with 10% missing values:")
print(f"  Test Accuracy: {acc_missing:.4f}")
print(f"\nAccuracy difference: {abs(acc_complete - acc_missing):.4f}")

**Key Insight**: XGBoost handles missing values gracefully without requiring imputation. It learns the optimal default direction for missing values during training, which can actually be more effective than simple imputation strategies.

## 8. XGBoost for Regression

In [None]:
# Create a regression dataset
X_reg, y_reg = make_regression(
    n_samples=1000,
    n_features=10,
    n_informative=8,
    noise=10,
    random_state=42
)

# Split the data
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Train XGBoost Regressor
xgb_reg = XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

xgb_reg.fit(X_train_reg, y_train_reg)

# Predictions
y_pred_train_reg = xgb_reg.predict(X_train_reg)
y_pred_test_reg = xgb_reg.predict(X_test_reg)

# Evaluate
train_r2 = r2_score(y_train_reg, y_pred_train_reg)
test_r2 = r2_score(y_test_reg, y_pred_test_reg)
train_mse = mean_squared_error(y_train_reg, y_pred_train_reg)
test_mse = mean_squared_error(y_test_reg, y_pred_test_reg)

print("XGBoost Regressor Performance:")
print(f"Train R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")
print(f"Train MSE: {train_mse:.2f}")
print(f"Test MSE: {test_mse:.2f}")

In [None]:
# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Training set
axes[0].scatter(y_train_reg, y_pred_train_reg, alpha=0.5)
axes[0].plot([y_train_reg.min(), y_train_reg.max()],
             [y_train_reg.min(), y_train_reg.max()],
             'r--', linewidth=2, label='Perfect prediction')
axes[0].set_xlabel('Actual values')
axes[0].set_ylabel('Predicted values')
axes[0].set_title(f'Training Set (R² = {train_r2:.4f})')
axes[0].legend()

# Test set
axes[1].scatter(y_test_reg, y_pred_test_reg, alpha=0.5)
axes[1].plot([y_test_reg.min(), y_test_reg.max()],
             [y_test_reg.min(), y_test_reg.max()],
             'r--', linewidth=2, label='Perfect prediction')
axes[1].set_xlabel('Actual values')
axes[1].set_ylabel('Predicted values')
axes[1].set_title(f'Test Set (R² = {test_r2:.4f})')
axes[1].legend()

plt.tight_layout()
plt.show()

## 9. Hyperparameter Tuning with Grid Search

In [None]:
# Define parameter grid
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
    'n_estimators': [50, 100],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Create base model
xgb_base = XGBClassifier(
    random_state=42,
    eval_metric='logloss'
)

# Grid search with cross-validation
# Note: This can take a while! Using a smaller grid for demonstration
grid_search = GridSearchCV(
    xgb_base,
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)

print("Starting grid search...")
print(f"Total combinations to test: {len(param_grid['max_depth']) * len(param_grid['learning_rate']) * len(param_grid['n_estimators']) * len(param_grid['subsample']) * len(param_grid['colsample_bytree'])}")
print("This may take a few minutes...\n")

grid_search.fit(X_train, y_train)

print("\nBest parameters:")
print(grid_search.best_params_)
print(f"\nBest cross-validation score: {grid_search.best_score_:.4f}")
print(f"Test set score: {grid_search.score(X_test, y_test):.4f}")

In [None]:
# Analyze grid search results
results_df = pd.DataFrame(grid_search.cv_results_)

# Show top 10 parameter combinations
print("Top 10 parameter combinations:")
print(results_df[['params', 'mean_test_score', 'std_test_score']]
      .sort_values('mean_test_score', ascending=False)
      .head(10))

## 10. Exercises

Now it's your turn to practice! Complete the following exercises.

### Exercise 1: Basic XGBoost Training

Create a synthetic classification dataset with 800 samples and 20 features. Train an XGBoost classifier with:
- 100 estimators
- Learning rate of 0.15
- Max depth of 5

Calculate and compare training vs test accuracy. Is there overfitting?

In [None]:
# Your code here
# Step 1: Create dataset using make_classification
# Step 2: Split into train/test sets
# Step 3: Train XGBClassifier
# Step 4: Evaluate and analyze overfitting


### Exercise 2: Regularization Effect

Using the breast cancer dataset:
1. Train three XGBoost models with different regularization:
   - No regularization: reg_alpha=0, reg_lambda=0
   - L2 only: reg_alpha=0, reg_lambda=1
   - Both L1 and L2: reg_alpha=1, reg_lambda=1
2. Compare their training and test accuracies
3. Which regularization setting works best?

In [None]:
# Your code here
# Step 1: Train models with different regularization parameters
# Step 2: Evaluate each model
# Step 3: Compare results


### Exercise 3: Early Stopping Analysis

Train an XGBoost classifier on the breast cancer data with:
- n_estimators=500
- Early stopping enabled (early_stopping_rounds=15)
- Use a validation set

Questions:
1. At what iteration does training stop?
2. How much computational time is saved compared to training all 500 trees?
3. What is the test accuracy?

In [None]:
# Your code here
# Step 1: Split data into train, validation, and test sets
# Step 2: Time the training with early stopping
# Step 3: Train without early stopping for comparison
# Step 4: Compare iterations and timing


### Exercise 4: Feature Selection with Importance

Using the breast cancer dataset:
1. Train an XGBoost model and get feature importances (use 'gain')
2. Create a new model using only the top 15 most important features
3. Compare the performance of the full model vs the reduced model
4. How much does performance degrade with fewer features?

In [None]:
# Your code here
# Step 1: Train model and extract feature importances
# Step 2: Select top 15 features
# Step 3: Train new model with reduced features
# Step 4: Compare performances


### Exercise 5: Missing Value Experiment

Create an experiment to test XGBoost's missing value handling:
1. Create a copy of the breast cancer training data
2. Randomly set 20% of values to NaN
3. Train two models:
   - XGBoost on data with missing values (no imputation)
   - XGBoost on data with mean imputation
4. Compare test set performance
5. Which approach works better?

In [None]:
# Your code here
# Step 1: Create data with missing values
# Step 2: Create imputed version using SimpleImputer
# Step 3: Train both models
# Step 4: Compare performances


## 11. Summary

In this notebook, you learned about XGBoost, the most popular gradient boosting implementation:

### Key Concepts:

1. **XGBoost Innovations**:
   - L1/L2 regularization prevents overfitting
   - Smart tree pruning using gamma parameter
   - Parallel processing for faster training
   - Native missing value handling
   - Built-in cross-validation

2. **Important Hyperparameters**:
   - **max_depth**: Tree complexity (3-10)
   - **eta (learning_rate)**: Step size shrinkage (0.01-0.3)
   - **subsample**: Row sampling ratio (0.5-1.0)
   - **colsample_bytree**: Column sampling ratio (0.3-1.0)
   - **gamma**: Minimum loss reduction for splits
   - **reg_alpha, reg_lambda**: Regularization terms

3. **Feature Importance Types**:
   - 'weight': Frequency of feature usage
   - 'gain': Average gain from feature splits (most informative)
   - 'cover': Coverage of feature splits

### Best Practices:

1. **Start with conservative defaults**:
   ```python
   XGBClassifier(
       n_estimators=100,
       learning_rate=0.1,
       max_depth=3,
       subsample=0.8,
       colsample_bytree=0.8
   )
   ```

2. **Use early stopping** to find optimal number of trees
3. **Lower learning rate + more trees** often gives better results
4. **Add randomness** via subsample and colsample to reduce overfitting
5. **Use regularization** (gamma, reg_alpha, reg_lambda) when overfitting
6. **Monitor train/validation curves** to detect overfitting

### Advantages:

- **State-of-the-art performance** on structured/tabular data
- **Fast training** due to parallelization and optimizations
- **Handles missing values** automatically
- **Built-in regularization** prevents overfitting
- **Flexible** - supports custom objectives and metrics
- **Feature importance** for interpretability
- **Cross-platform** - works on CPU and GPU

### Disadvantages:

- **Many hyperparameters** to tune
- **Can overfit** if not carefully configured
- **Sequential training** (though each tree is parallelized)
- **Memory intensive** for large datasets
- **Less interpretable** than simple models

### When to Use XGBoost:

✅ **Use XGBoost when:**
- Working with structured/tabular data
- Need state-of-the-art performance
- Have missing values in data
- Dataset is small to medium sized (< 10M rows)
- Participating in ML competitions

❌ **Consider alternatives when:**
- Working with images, text, or sequences (use deep learning)
- Need real-time predictions with strict latency requirements
- Interpretability is critical (use linear models or decision trees)
- Dataset is extremely large (consider LightGBM)

### What's Next?

In the next module, we'll explore **LightGBM** (Light Gradient Boosting Machine), Microsoft's gradient boosting framework that's optimized for:
- **Faster training speed** on large datasets
- **Lower memory usage**
- **Better accuracy** in some scenarios
- **Native categorical feature** support

We'll compare LightGBM with XGBoost and learn when to choose each one!

## Additional Resources

- [XGBoost Documentation](https://xgboost.readthedocs.io/)
- [XGBoost Parameters Explained](https://xgboost.readthedocs.io/en/latest/parameter.html)
- [Original XGBoost Paper (Chen & Guestrin, 2016)](https://arxiv.org/abs/1603.02754)
- [XGBoost Python Tutorial](https://xgboost.readthedocs.io/en/latest/python/python_intro.html)
- [Complete Guide to Parameter Tuning in XGBoost](https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/)