# XGBoost - Complete Guide

## üìö Learning Objectives
- Understand XGBoost algorithm and its advantages
- Implement XGBoost for regression and classification
- Master hyperparameter tuning
- Learn feature importance analysis
- Handle imbalanced datasets
- Optimize model performance

## üéØ What is XGBoost?

**XGBoost (eXtreme Gradient Boosting)** is an optimized distributed gradient boosting library designed to be:
- **Highly efficient**: Parallel processing, cache optimization
- **Flexible**: Custom objectives and evaluation metrics
- **Portable**: Works on various platforms

### Key Features:
1. **Regularization**: L1 (Lasso) and L2 (Ridge) to prevent overfitting
2. **Handling Missing Values**: Built-in support
3. **Tree Pruning**: Uses max_depth and then prunes backward
4. **Built-in Cross-Validation**: Easy model evaluation
5. **Parallel Processing**: Fast training

### Why XGBoost?
‚úÖ State-of-the-art performance on structured data  
‚úÖ Wins many Kaggle competitions  
‚úÖ Handles various types of data  
‚úÖ Built-in regularization  
‚úÖ Feature importance  

In [None]:
# Install XGBoost if not already installed
# !pip install xgboost

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from xgboost import XGBRegressor, XGBClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.datasets import load_breast_cancer
import warnings
warnings.filterwarnings('ignore')

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

print(f"XGBoost version: {xgb.__version__}")

## Part 1: XGBoost for Regression
### 1Ô∏è‚É£ Load and Prepare Data

In [None]:
# Load California Housing dataset
df = pd.read_csv('../../Linear Regression/data/dataset.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nMissing values:")
print(df.isnull().sum())

# Display first few rows
df.head()

In [None]:
# Prepare features and target
target_col = 'median_house_value'
X = df.drop(columns=[target_col])
y = df[target_col]

# Handle categorical variables
if 'ocean_proximity' in X.columns:
    # XGBoost can't handle categorical directly, so we'll encode
    le = LabelEncoder()
    X['ocean_proximity'] = le.fit_transform(X['ocean_proximity'])
    print(f"\nEncoded ocean_proximity: {le.classes_}")

# Handle missing values (XGBoost can handle them, but let's fill for comparison)
X = X.fillna(X.median())

print(f"\nFeatures shape: {X.shape}")
print(f"Target shape: {y.shape}")

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")

### 2Ô∏è‚É£ Basic XGBoost Model

In [None]:
# Create and train basic XGBoost model
xgb_basic = XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    n_jobs=-1  # Use all CPU cores
)

# Train model
print("Training XGBoost model...")
xgb_basic.fit(X_train, y_train)
print("‚úÖ Training complete!")

# Make predictions
y_train_pred = xgb_basic.predict(X_train)
y_test_pred = xgb_basic.predict(X_test)

# Evaluate
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))
train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

print(f"\nüìä Basic XGBoost Performance:")
print(f"\nTraining Metrics:")
print(f"  RMSE: ${train_rmse:,.2f}")
print(f"  R¬≤:   {train_r2:.4f}")
print(f"\nTest Metrics:")
print(f"  RMSE: ${test_rmse:,.2f}")
print(f"  R¬≤:   {test_r2:.4f}")
print(f"\nOverfitting Check: {train_r2 - test_r2:.4f}")

### 3Ô∏è‚É£ Feature Importance Analysis

In [None]:
# Get feature importance
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': xgb_basic.feature_importances_
}).sort_values('importance', ascending=False)

print("\nüîç Feature Importance:")
print(feature_importance)

# Visualize feature importance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Bar plot
feature_importance.plot(x='feature', y='importance', kind='barh', ax=ax1, color='skyblue', edgecolor='black')
ax1.set_xlabel('Importance Score', fontsize=12)
ax1.set_ylabel('Features', fontsize=12)
ax1.set_title('Feature Importance (Gain)', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3, axis='x')

# Built-in XGBoost plot
xgb.plot_importance(xgb_basic, ax=ax2, importance_type='weight', max_num_features=10)
ax2.set_title('Feature Importance (Weight)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

### 4Ô∏è‚É£ Hyperparameter Tuning

#### Key XGBoost Parameters:

**Tree Parameters:**
- `max_depth`: Maximum depth of trees (3-10)
- `min_child_weight`: Minimum sum of instance weight in a child (1-10)
- `gamma`: Minimum loss reduction for split (0-5)

**Boosting Parameters:**
- `learning_rate` (eta): Step size shrinkage (0.01-0.3)
- `n_estimators`: Number of boosting rounds (100-1000)
- `subsample`: Fraction of samples for training (0.5-1.0)
- `colsample_bytree`: Fraction of features for training (0.5-1.0)

**Regularization:**
- `reg_alpha`: L1 regularization (0-1)
- `reg_lambda`: L2 regularization (0-1)

In [None]:
# Define parameter grid for tuning
param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.3],
    'n_estimators': [100, 200, 300],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'reg_alpha': [0, 0.1],
    'reg_lambda': [1, 1.5]
}

# Use RandomizedSearchCV for efficiency (GridSearchCV would take too long)
print("üîç Starting hyperparameter tuning...")
print("This may take a few minutes...\n")

xgb_random = RandomizedSearchCV(
    XGBRegressor(random_state=42, n_jobs=-1),
    param_distributions=param_grid,
    n_iter=20,  # Number of parameter settings sampled
    cv=3,
    scoring='neg_mean_squared_error',
    random_state=42,
    n_jobs=-1,
    verbose=1
)

xgb_random.fit(X_train, y_train)

print(f"\nüèÜ Best Parameters: {xgb_random.best_params_}")
print(f"Best CV Score (RMSE): ${np.sqrt(-xgb_random.best_score_):,.2f}")

In [None]:
# Evaluate tuned model
best_xgb = xgb_random.best_estimator_

y_train_pred_tuned = best_xgb.predict(X_train)
y_test_pred_tuned = best_xgb.predict(X_test)

train_rmse_tuned = np.sqrt(mean_squared_error(y_train, y_train_pred_tuned))
test_rmse_tuned = np.sqrt(mean_squared_error(y_test, y_test_pred_tuned))
train_r2_tuned = r2_score(y_train, y_train_pred_tuned)
test_r2_tuned = r2_score(y_test, y_test_pred_tuned)

print(f"\nüìä Tuned XGBoost Performance:")
print(f"\nTraining Metrics:")
print(f"  RMSE: ${train_rmse_tuned:,.2f}")
print(f"  R¬≤:   {train_r2_tuned:.4f}")
print(f"\nTest Metrics:")
print(f"  RMSE: ${test_rmse_tuned:,.2f}")
print(f"  R¬≤:   {test_r2_tuned:.4f}")

# Compare with basic model
print(f"\nüìà Improvement:")
print(f"  RMSE: ${test_rmse - test_rmse_tuned:,.2f} better")
print(f"  R¬≤:   {test_r2_tuned - test_r2:.4f} better")

### 5Ô∏è‚É£ Learning Curves and Early Stopping

In [None]:
# Train with early stopping
xgb_early = XGBRegressor(
    n_estimators=1000,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    n_jobs=-1,
    early_stopping_rounds=50  # Stop if no improvement for 50 rounds
)

# Fit with evaluation set
xgb_early.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    verbose=False
)

# Get evaluation results
results = xgb_early.evals_result()

# Plot learning curves
epochs = len(results['validation_0']['rmse'])
x_axis = range(0, epochs)

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(x_axis, results['validation_0']['rmse'], label='Train', linewidth=2)
ax.plot(x_axis, results['validation_1']['rmse'], label='Test', linewidth=2)
ax.axvline(x=xgb_early.best_iteration, color='r', linestyle='--', 
           linewidth=2, label=f'Best Iteration ({xgb_early.best_iteration})')
ax.set_xlabel('Boosting Rounds', fontsize=12)
ax.set_ylabel('RMSE', fontsize=12)
ax.set_title('XGBoost Learning Curves with Early Stopping', fontsize=14, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
plt.show()

print(f"\nBest iteration: {xgb_early.best_iteration}")
print(f"Best RMSE: ${xgb_early.best_score:,.2f}")

## Part 2: XGBoost for Classification
### 6Ô∏è‚É£ Binary Classification Example

In [None]:
# Load breast cancer dataset
cancer = load_breast_cancer()
X_cancer = pd.DataFrame(cancer.data, columns=cancer.feature_names)
y_cancer = pd.Series(cancer.target, name='diagnosis')

print(f"Classification dataset shape: {X_cancer.shape}")
print(f"\nClass distribution:")
print(y_cancer.value_counts())
print(f"\nClass names: {cancer.target_names}")

# Split data
X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

In [None]:
# Train XGBoost classifier
xgb_clf = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    n_jobs=-1,
    eval_metric='logloss'  # For binary classification
)

xgb_clf.fit(X_train_c, y_train_c)

# Predictions
y_pred_c = xgb_clf.predict(X_test_c)
y_pred_proba_c = xgb_clf.predict_proba(X_test_c)[:, 1]

# Evaluate
accuracy = accuracy_score(y_test_c, y_pred_c)
roc_auc = roc_auc_score(y_test_c, y_pred_proba_c)

print(f"\nüìä XGBoost Classification Performance:")
print(f"Accuracy: {accuracy:.4f}")
print(f"ROC-AUC: {roc_auc:.4f}")
print(f"\nüìã Classification Report:")
print(classification_report(y_test_c, y_pred_c, target_names=cancer.target_names))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test_c, y_pred_c)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=cancer.target_names,
            yticklabels=cancer.target_names,
            cbar_kws={'label': 'Count'})
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('Actual', fontsize=12)
plt.title('XGBoost Classification - Confusion Matrix', fontsize=14, fontweight='bold')
plt.show()

### 7Ô∏è‚É£ Model Comparison: XGBoost vs Others

In [None]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression

# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42),
    'XGBoost': XGBRegressor(n_estimators=100, random_state=42, n_jobs=-1)
}

# Train and evaluate all models
results_comparison = {}

for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    
    results_comparison[name] = {'RMSE': rmse, 'R2': r2}

# Display results
comparison_df = pd.DataFrame(results_comparison).T
print("\nüìä Model Comparison:")
print(comparison_df.sort_values('R2', ascending=False))

# Visualize comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

comparison_df['RMSE'].plot(kind='barh', ax=ax1, color='coral', edgecolor='black')
ax1.set_xlabel('RMSE ($)', fontsize=12)
ax1.set_title('Model Comparison - RMSE', fontsize=14, fontweight='bold')
ax1.grid(True, alpha=0.3, axis='x')

comparison_df['R2'].plot(kind='barh', ax=ax2, color='skyblue', edgecolor='black')
ax2.set_xlabel('R¬≤ Score', fontsize=12)
ax2.set_title('Model Comparison - R¬≤', fontsize=14, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

## üìä Key Takeaways

### XGBoost Advantages:
‚úÖ **Superior Performance**: Often best for structured data  
‚úÖ **Built-in Regularization**: Prevents overfitting  
‚úÖ **Handles Missing Values**: No need for imputation  
‚úÖ **Feature Importance**: Easy interpretation  
‚úÖ **Parallel Processing**: Fast training  
‚úÖ **Early Stopping**: Automatic optimization  

### Best Practices:
1. **Start with default parameters**, then tune
2. **Use early stopping** to prevent overfitting
3. **Monitor learning curves** for convergence
4. **Scale features** (optional but can help)
5. **Use cross-validation** for robust evaluation

### Hyperparameter Tuning Strategy:
1. **Fix learning_rate** at 0.1
2. **Tune tree parameters**: max_depth, min_child_weight
3. **Tune regularization**: reg_alpha, reg_lambda
4. **Tune sampling**: subsample, colsample_bytree
5. **Lower learning_rate** and increase n_estimators

### When to Use XGBoost:
‚úÖ Structured/tabular data  
‚úÖ Medium to large datasets  
‚úÖ Need for interpretability  
‚úÖ Kaggle competitions  
‚úÖ Production systems (fast inference)  

### Common Pitfalls:
‚ùå Over-tuning on test set  
‚ùå Not using early stopping  
‚ùå Ignoring feature engineering  
‚ùå Too many boosting rounds  
‚ùå Not monitoring overfitting  

### Next Steps:
1. Try **LightGBM** for comparison
2. Implement **custom objectives**
3. Use **SHAP** for model interpretation
4. Deploy model to production
5. Monitor model performance over time