<h1 align="center"> <strong>🚀 XGBoost - Complete Guide</strong> </h1>

This notebook provides a comprehensive introduction to XGBoost (eXtreme Gradient Boosting), covering:
- Conceptual foundation and gradient boosting
- Implementation with XGBoost library
- Model evaluation and interpretation
- Hyperparameter tuning and optimization
- Comparison with Random Forest and Decision Trees
- Advanced features and techniques

---

## **📚 1. Import Libraries and Setup**

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
from sklearn.model_selection import train_test_split, cross_val_score, validation_curve
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (accuracy_score, confusion_matrix, classification_report, 
                           mean_squared_error, r2_score, mean_absolute_error,
                           precision_score, recall_score, f1_score, log_loss)
from sklearn.datasets import make_classification, make_regression, load_iris, load_wine, load_breast_cancer

# XGBoost
import xgboost as xgb
from xgboost import XGBClassifier, XGBRegressor

# Hyperparameter tuning
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

# Visualization settings
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)

# Suppress warnings
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")
print(f"XGBoost version: {xgb.__version__}")

## **🧠 2. Conceptual Foundation**

### **What is XGBoost?** 🤔

XGBoost (eXtreme Gradient Boosting) is an optimized gradient boosting framework designed to be highly efficient, flexible, and portable. It implements machine learning algorithms under the Gradient Boosting framework.

### **Key Concepts:**

1. **Gradient Boosting**: Sequentially builds models, each correcting errors of previous ones
2. **Regularization**: L1 and L2 regularization to prevent overfitting
3. **Tree Pruning**: Intelligent pruning to avoid unnecessary splits
4. **Parallel Processing**: Optimized for speed and performance
5. **Handling Missing Values**: Built-in handling of missing data

### **How XGBoost Works:**

1. **Initialize**: Start with a simple prediction (e.g., mean)
2. **Calculate Residuals**: Find errors from current prediction
3. **Build Tree**: Train a tree to predict residuals
4. **Update Model**: Add new tree with learning rate
5. **Repeat**: Continue until convergence or max iterations
6. **Final Prediction**: Sum of all tree predictions

### **Mathematical Foundation:**

#### Objective Function:
$$\text{Obj}^{(t)} = \sum_{i=1}^{n} l(y_i, \hat{y}_i^{(t)}) + \sum_{i=1}^{t} \Omega(f_i)$$

Where:
- $l$ is the loss function
- $\Omega(f_i)$ is the regularization term for tree $f_i$

#### Regularization:
$$\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2$$

Where:
- $T$ = number of leaves
- $w_j$ = leaf weights
- $\gamma$ = complexity penalty
- $\lambda$ = L2 regularization

#### Learning Rate:
$$\hat{y}_i^{(t)} = \hat{y}_i^{(t-1)} + \eta f_t(x_i)$$

Where $\eta$ is the learning rate (shrinkage parameter).

### **XGBoost vs Other Methods:**

| Feature | Decision Tree | Random Forest | XGBoost |
|---------|---------------|---------------|---------|
| **Method** | Single tree | Parallel trees | Sequential trees |
| **Bias** | High variance | Lower variance | Low bias & variance |
| **Speed** | Fast | Medium | Medium-Fast |
| **Accuracy** | Good | Better | Best |
| **Overfitting** | High risk | Low risk | Controlled |
| **Interpretability** | High | Medium | Medium |

## **📊 3. Generate Sample Data**

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Classification Dataset - More complex for gradient boosting
print("🎯 Creating Classification Dataset")
X_cls, y_cls = make_classification(
    n_samples=2000,
    n_features=15,
    n_informative=10,
    n_redundant=3,
    n_clusters_per_class=2,
    class_sep=0.8,
    random_state=42
)

# Add some noise and missing values to make it more realistic
noise_indices = np.random.choice(len(X_cls), size=int(0.05 * len(X_cls)), replace=False)
X_cls[noise_indices] += np.random.normal(0, 2, size=(len(noise_indices), X_cls.shape[1]))

feature_names_cls = [f'feature_{i}' for i in range(X_cls.shape[1])]
df_cls = pd.DataFrame(X_cls, columns=feature_names_cls)
df_cls['target'] = y_cls

print(f"Classification dataset shape: {df_cls.shape}")
print(f"Classes: {np.unique(y_cls)}")
print(f"Class distribution: {np.bincount(y_cls)}")

# Regression Dataset - Non-linear relationship
print("\n📈 Creating Regression Dataset")
X_reg, y_reg = make_regression(
    n_samples=1500,
    n_features=12,
    n_informative=8,
    noise=20,
    random_state=42
)

# Add non-linear transformations
X_reg[:, 0] = X_reg[:, 0] ** 2  # Quadratic relationship
X_reg[:, 1] = np.sin(X_reg[:, 1])  # Sinusoidal relationship

feature_names_reg = [f'feature_{i}' for i in range(X_reg.shape[1])]
df_reg = pd.DataFrame(X_reg, columns=feature_names_reg)
df_reg['target'] = y_reg

print(f"Regression dataset shape: {df_reg.shape}")

# Load Wine dataset for detailed analysis
print("\n🍷 Loading Wine Dataset")
wine = load_wine()
X_wine = wine.data
y_wine = wine.target

print(f"Wine dataset shape: {X_wine.shape}")
print(f"Classes: {wine.target_names}")

# Visualize relationships
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.scatter(df_cls['feature_0'], df_cls['feature_1'], c=y_cls, cmap='viridis', alpha=0.6)
plt.xlabel('Feature 0')
plt.ylabel('Feature 1')
plt.title('Classification Dataset')
plt.colorbar()

plt.subplot(1, 3, 2)
plt.scatter(df_reg['feature_0'], df_reg['target'], alpha=0.6, color='orange')
plt.xlabel('Feature 0 (Quadratic)')
plt.ylabel('Target')
plt.title('Regression Dataset')

plt.subplot(1, 3, 3)
plt.scatter(X_wine[:, 0], X_wine[:, 1], c=y_wine, cmap='Set1', alpha=0.6)
plt.xlabel('Alcohol')
plt.ylabel('Malic Acid')
plt.title('Wine Dataset')
plt.colorbar()

plt.tight_layout()
plt.show()

## **🛠️ 4. XGBoost Implementation - Classification**

In [None]:
# Classification with Wine dataset
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
    X_wine, y_wine, test_size=0.3, random_state=42, stratify=y_wine
)

# Create XGBoost classifier
xgb_clf = XGBClassifier(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    eval_metric='mlogloss'  # For multiclass classification
)

# Train the model
xgb_clf.fit(X_train_wine, y_train_wine)

# Make predictions
y_train_pred = xgb_clf.predict(X_train_wine)
y_test_pred = xgb_clf.predict(X_test_wine)
y_test_proba = xgb_clf.predict_proba(X_test_wine)

print("🚀 XGBoost Classification Results:")
print(f"Training Accuracy: {accuracy_score(y_train_wine, y_train_pred):.3f}")
print(f"Testing Accuracy:  {accuracy_score(y_test_wine, y_test_pred):.3f}")

print(f"Precision: {precision_score(y_test_wine, y_test_pred, average='weighted'):.3f}")
print(f"Recall:    {recall_score(y_test_wine, y_test_pred, average='weighted'):.3f}")
print(f"F1-Score:  {f1_score(y_test_wine, y_test_pred, average='weighted'):.3f}")

# Compare with Random Forest and Gradient Boosting
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
gb_clf = GradientBoostingClassifier(n_estimators=100, random_state=42)

rf_clf.fit(X_train_wine, y_train_wine)
gb_clf.fit(X_train_wine, y_train_wine)

rf_pred = rf_clf.predict(X_test_wine)
gb_pred = gb_clf.predict(X_test_wine)

print(f"\n🌲 Random Forest Accuracy:     {accuracy_score(y_test_wine, rf_pred):.3f}")
print(f"🌳 Gradient Boosting Accuracy: {accuracy_score(y_test_wine, gb_pred):.3f}")
print(f"🚀 XGBoost Accuracy:           {accuracy_score(y_test_wine, y_test_pred):.3f}")

# Feature importance comparison
print("\n📊 Top 10 Feature Importances (XGBoost):")
xgb_importance_df = pd.DataFrame({
    'Feature': wine.feature_names,
    'Importance': xgb_clf.feature_importances_
}).sort_values('Importance', ascending=False)

print(xgb_importance_df.head(10))

print("\n📋 Classification Report:")
print(classification_report(y_test_wine, y_test_pred, target_names=wine.target_names))

## **📈 5. XGBoost Implementation - Regression**

In [None]:
# Regression with synthetic dataset
X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Create XGBoost regressor
xgb_reg = XGBRegressor(
    n_estimators=100,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42
)

# Train the model
xgb_reg.fit(X_train_reg, y_train_reg)

# Make predictions
y_train_pred_reg = xgb_reg.predict(X_train_reg)
y_test_pred_reg = xgb_reg.predict(X_test_reg)

print("📈 XGBoost Regression Results:")
print(f"Training R² Score: {r2_score(y_train_reg, y_train_pred_reg):.3f}")
print(f"Testing R² Score:  {r2_score(y_test_reg, y_test_pred_reg):.3f}")

print(f"Training RMSE:     {np.sqrt(mean_squared_error(y_train_reg, y_train_pred_reg)):.3f}")
print(f"Testing RMSE:      {np.sqrt(mean_squared_error(y_test_reg, y_test_pred_reg)):.3f}")
print(f"Testing MAE:       {mean_absolute_error(y_test_reg, y_test_pred_reg):.3f}")

# Compare with other methods
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)
gb_reg = GradientBoostingRegressor(n_estimators=100, random_state=42)

rf_reg.fit(X_train_reg, y_train_reg)
gb_reg.fit(X_train_reg, y_train_reg)

rf_pred_reg = rf_reg.predict(X_test_reg)
gb_pred_reg = gb_reg.predict(X_test_reg)

print(f"\n🌲 Random Forest R²:     {r2_score(y_test_reg, rf_pred_reg):.3f}")
print(f"🌳 Gradient Boosting R²: {r2_score(y_test_reg, gb_pred_reg):.3f}")
print(f"🚀 XGBoost R²:           {r2_score(y_test_reg, y_test_pred_reg):.3f}")

print("\n📊 Feature Importances (Regression):")
reg_importance_df = pd.DataFrame({
    'Feature': feature_names_reg,
    'Importance': xgb_reg.feature_importances_
}).sort_values('Importance', ascending=False)

print(reg_importance_df.head(8))

## **📊 6. Advanced XGBoost Features**

In [None]:
# Using XGBoost's native API for more control
print("🔧 XGBoost Native API Features")

# Convert to DMatrix for native XGBoost API
dtrain = xgb.DMatrix(X_train_wine, label=y_train_wine)
dtest = xgb.DMatrix(X_test_wine, label=y_test_wine)

# Set parameters for native API
params = {
    'max_depth': 6,
    'eta': 0.1,  # learning_rate
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'objective': 'multi:softprob',
    'num_class': 3,
    'eval_metric': 'mlogloss',
    'seed': 42
}

# Train with evaluation set (early stopping)
evallist = [(dtrain, 'train'), (dtest, 'eval')]
num_round = 100

xgb_native = xgb.train(
    params, 
    dtrain, 
    num_round,
    evallist,
    early_stopping_rounds=10,
    verbose_eval=False
)

print(f"Best iteration: {xgb_native.best_iteration}")
print(f"Best score: {xgb_native.best_score:.4f}")

# Predict with native API
y_pred_native = xgb_native.predict(dtest)
y_pred_native_classes = np.argmax(y_pred_native, axis=1)

print(f"Native API Accuracy: {accuracy_score(y_test_wine, y_pred_native_classes):.3f}")

# Feature importance from native API
importance_dict = xgb_native.get_importance()
print("\n📊 Feature Importance (Native API - by gain):")
for feature, importance in sorted(importance_dict.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"{wine.feature_names[int(feature[1:])]: <20}: {importance:.3f}")

# Different importance types
importance_types = ['weight', 'gain', 'cover']
plt.figure(figsize=(15, 5))

for i, imp_type in enumerate(importance_types):
    plt.subplot(1, 3, i+1)
    importance = xgb_native.get_importance(importance_type=imp_type)
    
    # Convert to DataFrame for easier plotting
    imp_df = pd.DataFrame([
        {'feature': wine.feature_names[int(k[1:])], 'importance': v} 
        for k, v in importance.items()
    ]).sort_values('importance', ascending=False).head(10)
    
    plt.barh(imp_df['feature'], imp_df['importance'])
    plt.title(f'Feature Importance ({imp_type})')
    plt.xlabel('Importance')
    plt.gca().invert_yaxis()

plt.tight_layout()
plt.show()

## **📊 7. Comprehensive Visualizations**

In [None]:
# Create comprehensive visualization
fig, axes = plt.subplots(3, 3, figsize=(18, 15))

# 1. Feature Importance Comparison
top_features = xgb_importance_df.head(10)
axes[0, 0].barh(top_features['Feature'], top_features['Importance'], color='lightgreen')
axes[0, 0].set_xlabel('Importance')
axes[0, 0].set_title('Top 10 Feature Importances')
axes[0, 0].invert_yaxis()

# 2. Confusion Matrix
cm = confusion_matrix(y_test_wine, y_test_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Greens', ax=axes[0, 1],
            xticklabels=wine.target_names, yticklabels=wine.target_names)
axes[0, 1].set_xlabel('Predicted')
axes[0, 1].set_ylabel('Actual')
axes[0, 1].set_title('Confusion Matrix')

# 3. Learning Curves (using eval_set)
xgb_eval = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    random_state=42,
    eval_metric='mlogloss'
)

eval_set = [(X_train_wine, y_train_wine), (X_test_wine, y_test_wine)]
xgb_eval.fit(X_train_wine, y_train_wine, 
             eval_set=eval_set, 
             verbose=False)

# Get evaluation results
results = xgb_eval.evals_result()
epochs = len(results['validation_0']['mlogloss'])

axes[0, 2].plot(range(epochs), results['validation_0']['mlogloss'], label='Train')
axes[0, 2].plot(range(epochs), results['validation_1']['mlogloss'], label='Test')
axes[0, 2].set_xlabel('Epochs')
axes[0, 2].set_ylabel('Log Loss')
axes[0, 2].set_title('Learning Curves')
axes[0, 2].legend()
axes[0, 2].grid(True, alpha=0.3)

# 4. Model Comparison
models = ['Random Forest', 'Gradient Boosting', 'XGBoost']
accuracies = [
    accuracy_score(y_test_wine, rf_pred),
    accuracy_score(y_test_wine, gb_pred),
    accuracy_score(y_test_wine, y_test_pred)
]

bars = axes[1, 0].bar(models, accuracies, color=['skyblue', 'lightcoral', 'lightgreen'])
axes[1, 0].set_ylabel('Accuracy')
axes[1, 0].set_title('Model Comparison')
axes[1, 0].set_ylim([0.8, 1.0])

# Add value labels on bars
for bar, acc in zip(bars, accuracies):
    height = bar.get_height()
    axes[1, 0].text(bar.get_x() + bar.get_width()/2., height + 0.005,
                   f'{acc:.3f}', ha='center', va='bottom')

# 5. Regression: True vs Predicted
axes[1, 1].scatter(y_test_reg, y_test_pred_reg, alpha=0.6, color='green')
axes[1, 1].plot([y_test_reg.min(), y_test_reg.max()], 
                [y_test_reg.min(), y_test_reg.max()], 'r--', lw=2)
axes[1, 1].set_xlabel('True Values')
axes[1, 1].set_ylabel('Predicted Values')
axes[1, 1].set_title('Regression: True vs Predicted')

# 6. Residuals Plot
residuals = y_test_reg - y_test_pred_reg
axes[1, 2].scatter(y_test_pred_reg, residuals, alpha=0.6, color='purple')
axes[1, 2].axhline(y=0, color='red', linestyle='--')
axes[1, 2].set_xlabel('Predicted Values')
axes[1, 2].set_ylabel('Residuals')
axes[1, 2].set_title('Residual Plot')

# 7. Cross-validation scores
cv_scores = cross_val_score(xgb_clf, X_wine, y_wine, cv=5, scoring='accuracy')
axes[2, 0].bar(range(1, 6), cv_scores, color='orange', alpha=0.7)
axes[2, 0].axhline(cv_scores.mean(), color='red', linestyle='--', 
                  label=f'Mean: {cv_scores.mean():.3f}')
axes[2, 0].set_xlabel('Fold')
axes[2, 0].set_ylabel('Accuracy')
axes[2, 0].set_title('Cross-Validation Scores')
axes[2, 0].legend()

# 8. Tree visualization (first tree)
# Note: This shows the structure of the first tree in the ensemble
try:
    xgb.plot_tree(xgb_native, num_trees=0, ax=axes[2, 1])
    axes[2, 1].set_title('First Tree Structure')
except:
    axes[2, 1].text(0.5, 0.5, 'Tree visualization\nnot available', 
                   ha='center', va='center', transform=axes[2, 1].transAxes)
    axes[2, 1].set_title('Tree Structure (Not Available)')

# 9. Feature Importance Distribution
axes[2, 2].hist(xgb_clf.feature_importances_, bins=10, alpha=0.7, color='green')
axes[2, 2].set_xlabel('Feature Importance')
axes[2, 2].set_ylabel('Number of Features')
axes[2, 2].set_title('Distribution of Feature Importances')
axes[2, 2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"📊 Cross-Validation Results:")
print(f"CV Scores: {cv_scores}")
print(f"Mean CV Score: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})")

## **🎛️ 8. Hyperparameter Tuning**

In [None]:
print("🎛️ XGBoost Hyperparameter Tuning")

# Define parameter space for tuning
print("Key XGBoost Hyperparameters:")
print("""
1. n_estimators: Number of boosting rounds
2. max_depth: Maximum tree depth
3. learning_rate (eta): Step size shrinkage
4. subsample: Fraction of samples for each tree
5. colsample_bytree: Fraction of features for each tree
6. gamma: Minimum loss reduction for split
7. reg_alpha: L1 regularization
8. reg_lambda: L2 regularization
""")

# Grid Search with key parameters
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 4, 6],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Use a sample for faster computation
X_sample, _, y_sample, _ = train_test_split(X_wine, y_wine, train_size=0.4, random_state=42)

print("🔍 Grid Search (limited parameters for demo)")
grid_search = GridSearchCV(
    XGBClassifier(random_state=42, eval_metric='mlogloss'),
    param_grid,
    cv=3,
    scoring='accuracy',
    n_jobs=-1,
    verbose=0
)

grid_search.fit(X_sample, y_sample)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

# Random Search for more comprehensive tuning
print("\n🎲 Random Search")
from scipy.stats import randint, uniform

random_param_dist = {
    'n_estimators': randint(50, 301),
    'max_depth': randint(3, 10),
    'learning_rate': uniform(0.01, 0.3),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4),
    'gamma': uniform(0, 0.5),
    'reg_alpha': uniform(0, 1),
    'reg_lambda': uniform(1, 2)
}

random_search = RandomizedSearchCV(
    XGBClassifier(random_state=42, eval_metric='mlogloss'),
    random_param_dist,
    n_iter=30,
    cv=3,
    scoring='accuracy',
    random_state=42,
    n_jobs=-1
)

random_search.fit(X_sample, y_sample)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.3f}")

# Train final model with best parameters
best_xgb = XGBClassifier(**random_search.best_params_, random_state=42)
best_xgb.fit(X_train_wine, y_train_wine)
best_predictions = best_xgb.predict(X_test_wine)

print(f"\n🏆 Tuned Model Performance:")
print(f"Test Accuracy: {accuracy_score(y_test_wine, best_predictions):.3f}")
print(f"Original XGBoost: {accuracy_score(y_test_wine, y_test_pred):.3f}")
print(f"Improvement: {accuracy_score(y_test_wine, best_predictions) - accuracy_score(y_test_wine, y_test_pred):+.3f}")

# Analyze parameter importance
print("\n📊 Parameter Analysis:")
results_df = pd.DataFrame(random_search.cv_results_)
param_cols = [col for col in results_df.columns if col.startswith('param_')]

plt.figure(figsize=(15, 10))

# Plot parameter distributions vs score
for i, param in enumerate(['param_n_estimators', 'param_max_depth', 'param_learning_rate']):
    if param in results_df.columns:
        plt.subplot(2, 3, i+1)
        plt.scatter(results_df[param], results_df['mean_test_score'], alpha=0.6)
        plt.xlabel(param.replace('param_', ''))
        plt.ylabel('CV Score')
        plt.title(f'CV Score vs {param.replace("param_", "")}')

for i, param in enumerate(['param_subsample', 'param_colsample_bytree', 'param_gamma']):
    if param in results_df.columns:
        plt.subplot(2, 3, i+4)
        plt.scatter(results_df[param], results_df['mean_test_score'], alpha=0.6)
        plt.xlabel(param.replace('param_', ''))
        plt.ylabel('CV Score')
        plt.title(f'CV Score vs {param.replace("param_", "")}')

plt.tight_layout()
plt.show()

## **⚖️ 9. Bias-Variance and Regularization Analysis**

In [None]:
print("⚖️ Bias-Variance and Regularization Analysis")

# Analyze effect of learning rate
learning_rates = [0.01, 0.05, 0.1, 0.2, 0.3]
lr_scores = []

for lr in learning_rates:
    xgb_lr = XGBClassifier(
        n_estimators=100,
        learning_rate=lr,
        random_state=42,
        eval_metric='mlogloss'
    )
    score = cross_val_score(xgb_lr, X_wine, y_wine, cv=3).mean()
    lr_scores.append(score)

# Analyze effect of regularization
reg_lambdas = [0, 0.1, 0.5, 1, 2, 5, 10]
reg_scores = []

for reg_lambda in reg_lambdas:
    xgb_reg = XGBClassifier(
        n_estimators=100,
        reg_lambda=reg_lambda,
        random_state=42,
        eval_metric='mlogloss'
    )
    score = cross_val_score(xgb_reg, X_wine, y_wine, cv=3).mean()
    reg_scores.append(score)

# Analyze effect of tree depth
depths = range(1, 11)
depth_train_scores = []
depth_test_scores = []

for depth in depths:
    xgb_depth = XGBClassifier(
        n_estimators=100,
        max_depth=depth,
        random_state=42,
        eval_metric='mlogloss'
    )
    xgb_depth.fit(X_train_wine, y_train_wine)
    
    train_score = xgb_depth.score(X_train_wine, y_train_wine)
    test_score = xgb_depth.score(X_test_wine, y_test_wine)
    
    depth_train_scores.append(train_score)
    depth_test_scores.append(test_score)

# Plot analysis results
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.plot(learning_rates, lr_scores, 'o-', color='blue', linewidth=2)
plt.xlabel('Learning Rate')
plt.ylabel('CV Score')
plt.title('Effect of Learning Rate')
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 2)
plt.plot(reg_lambdas, reg_scores, 'o-', color='red', linewidth=2)
plt.xlabel('L2 Regularization (lambda)')
plt.ylabel('CV Score')
plt.title('Effect of L2 Regularization')
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 3)
plt.plot(depths, depth_train_scores, 'o-', label='Training', color='blue')
plt.plot(depths, depth_test_scores, 'o-', label='Testing', color='red')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Overfitting Analysis: Depth')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Early stopping analysis
print("\n🛑 Early Stopping Analysis")

xgb_early = XGBClassifier(
    n_estimators=500,  # Large number
    learning_rate=0.1,
    random_state=42,
    eval_metric='mlogloss'
)

eval_set = [(X_train_wine, y_train_wine), (X_test_wine, y_test_wine)]
xgb_early.fit(
    X_train_wine, y_train_wine,
    eval_set=eval_set,
    early_stopping_rounds=20,
    verbose=False
)

print(f"Optimal number of estimators: {xgb_early.best_iteration}")
print(f"Best validation score: {xgb_early.best_score:.4f}")

# Plot training history
results = xgb_early.evals_result()
epochs = len(results['validation_0']['mlogloss'])

plt.figure(figsize=(10, 6))
plt.plot(range(epochs), results['validation_0']['mlogloss'], label='Train', linewidth=2)
plt.plot(range(epochs), results['validation_1']['mlogloss'], label='Validation', linewidth=2)
plt.axvline(x=xgb_early.best_iteration, color='red', linestyle='--', 
           label=f'Best Iteration ({xgb_early.best_iteration})')
plt.xlabel('Iterations')
plt.ylabel('Log Loss')
plt.title('Training History with Early Stopping')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"\nOptimal hyperparameters found:")
print(f"• Learning Rate: {0.1} (moderate for stability)")
print(f"• Optimal Lambda: {reg_lambdas[np.argmax(reg_scores)]}")
print(f"• Optimal Depth: {depths[np.argmax(depth_test_scores)]}")
print(f"• Early Stopping: {xgb_early.best_iteration} iterations")

## **🆚 10. Comprehensive Model Comparison**

In [None]:
print("🆚 Comprehensive Model Comparison")

# Define models to compare
models = {
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='mlogloss'),
    'XGBoost (Tuned)': best_xgb
}

# Compare across multiple metrics
results = []
training_times = []
prediction_times = []

import time

for name, model in models.items():
    print(f"Training {name}...")
    
    # Measure training time
    start_time = time.time()
    model.fit(X_train_wine, y_train_wine)
    train_time = time.time() - start_time
    
    # Measure prediction time
    start_time = time.time()
    y_pred = model.predict(X_test_wine)
    pred_time = time.time() - start_time
    
    # Calculate metrics
    accuracy = accuracy_score(y_test_wine, y_pred)
    precision = precision_score(y_test_wine, y_pred, average='weighted')
    recall = recall_score(y_test_wine, y_pred, average='weighted')
    f1 = f1_score(y_test_wine, y_pred, average='weighted')
    
    # Cross-validation score
    cv_score = cross_val_score(model, X_wine, y_wine, cv=5).mean()
    cv_std = cross_val_score(model, X_wine, y_wine, cv=5).std()
    
    results.append({
        'Model': name,
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'CV Score': cv_score,
        'CV Std': cv_std,
        'Train Time': train_time,
        'Pred Time': pred_time
    })

# Create comparison DataFrame
comparison_df = pd.DataFrame(results)
print("\n📊 Model Comparison Results:")
print(comparison_df.round(4))

# Visualize comparison
fig, axes = plt.subplots(2, 3, figsize=(18, 10))

# Performance metrics
metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
colors = ['skyblue', 'lightcoral', 'lightgreen', 'orange', 'purple']

for i, metric in enumerate(metrics):
    ax = axes[i//2, i%2]
    bars = ax.bar(range(len(comparison_df)), comparison_df[metric], color=colors, alpha=0.7)
    ax.set_xlabel('Models')
    ax.set_ylabel(metric)
    ax.set_title(f'{metric} Comparison')
    ax.set_xticks(range(len(comparison_df)))
    ax.set_xticklabels(comparison_df['Model'], rotation=45, ha='right')
    ax.grid(True, alpha=0.3)
    
    # Add value labels
    for j, bar in enumerate(bars):
        height = bar.get_height()
        ax.annotate(f'{height:.3f}',
                   xy=(bar.get_x() + bar.get_width() / 2, height),
                   xytext=(0, 3),
                   textcoords="offset points",
                   ha='center', va='bottom', fontsize=9)

# Cross-validation scores with error bars
ax = axes[1, 0]
bars = ax.bar(range(len(comparison_df)), comparison_df['CV Score'], 
              yerr=comparison_df['CV Std'], color=colors, alpha=0.7, capsize=5)
ax.set_xlabel('Models')
ax.set_ylabel('CV Score')
ax.set_title('Cross-Validation Scores')
ax.set_xticks(range(len(comparison_df)))
ax.set_xticklabels(comparison_df['Model'], rotation=45, ha='right')
ax.grid(True, alpha=0.3)

# Training time comparison
ax = axes[1, 1]
bars = ax.bar(range(len(comparison_df)), comparison_df['Train Time'], color=colors, alpha=0.7)
ax.set_xlabel('Models')
ax.set_ylabel('Training Time (seconds)')
ax.set_title('Training Time Comparison')
ax.set_xticks(range(len(comparison_df)))
ax.set_xticklabels(comparison_df['Model'], rotation=45, ha='right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Feature importance comparison between tree-based methods
plt.figure(figsize=(15, 8))

tree_models = ['Random Forest', 'Gradient Boosting', 'XGBoost', 'XGBoost (Tuned)']
for i, model_name in enumerate(tree_models):
    model = models[model_name]
    
    # Get top 8 features
    importance_df = pd.DataFrame({
        'Feature': wine.feature_names,
        'Importance': model.feature_importances_
    }).sort_values('Importance', ascending=False).head(8)
    
    plt.subplot(2, 2, i+1)
    plt.barh(importance_df['Feature'], importance_df['Importance'], 
             color=colors[i+1], alpha=0.7)
    plt.xlabel('Importance')
    plt.title(f'{model_name} - Top Features')
    plt.gca().invert_yaxis()

plt.tight_layout()
plt.show()

# Statistical significance test
from scipy import stats

print("\n📈 Performance Summary:")
best_model = comparison_df.loc[comparison_df['CV Score'].idxmax(), 'Model']
best_score = comparison_df['CV Score'].max()

print(f"🏆 Best Model: {best_model}")
print(f"🎯 Best CV Score: {best_score:.3f}")
print(f"📊 Performance Ranking:")

ranking = comparison_df.sort_values('CV Score', ascending=False)
for i, (_, row) in enumerate(ranking.iterrows()):
    print(f"{i+1}. {row['Model']}: {row['CV Score']:.3f} ± {row['CV Std']:.3f}")

## **🔍 11. Real-World Example: Financial Credit Scoring**

In [None]:
# Create a realistic credit scoring dataset
print("💳 Financial Credit Scoring Example")

np.random.seed(42)
n_customers = 5000

# Generate customer features
credit_data = {
    'age': np.random.normal(40, 15, n_customers).clip(18, 80),
    'income': np.random.lognormal(10, 0.5, n_customers),
    'credit_history_length': np.random.exponential(8, n_customers).clip(0, 30),
    'existing_loans': np.random.poisson(2, n_customers),
    'employment_duration': np.random.exponential(5, n_customers).clip(0, 20),
    'debt_to_income': np.random.beta(2, 5, n_customers),
    'previous_defaults': np.random.poisson(0.3, n_customers),
    'credit_utilization': np.random.beta(2, 3, n_customers),
    'savings_balance': np.random.lognormal(8, 1, n_customers),
    'property_value': np.random.lognormal(11, 0.8, n_customers)
}

# Create realistic default probability
default_logit = (
    -3.0 +  # base (low default rate)
    -0.02 * credit_data['age'] +
    -0.00001 * credit_data['income'] +
    -0.1 * credit_data['credit_history_length'] +
    0.3 * credit_data['existing_loans'] +
    -0.05 * credit_data['employment_duration'] +
    2.0 * credit_data['debt_to_income'] +
    0.5 * credit_data['previous_defaults'] +
    1.5 * credit_data['credit_utilization'] +
    -0.00001 * credit_data['savings_balance'] +
    -0.00001 * credit_data['property_value']
)

# Convert to probability and generate binary outcome
default_prob = 1 / (1 + np.exp(-default_logit))
default_outcome = (np.random.random(n_customers) < default_prob).astype(int)

# Create DataFrame
credit_df = pd.DataFrame(credit_data)
credit_df['default'] = default_outcome

print(f"Credit dataset shape: {credit_df.shape}")
print(f"Default rate: {default_outcome.mean():.1%}")
print(f"Features: {list(credit_data.keys())}")

# Prepare data
feature_columns = list(credit_data.keys())
X_credit = credit_df[feature_columns]
y_credit = credit_df['default']

# Split data
X_train_credit, X_test_credit, y_train_credit, y_test_credit = train_test_split(
    X_credit, y_credit, test_size=0.2, random_state=42, stratify=y_credit
)

# Train XGBoost for credit scoring
credit_xgb = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1,
    scale_pos_weight=len(y_credit[y_credit==0])/len(y_credit[y_credit==1]),  # Handle class imbalance
    random_state=42,
    eval_metric='auc'
)

# Train with early stopping
eval_set = [(X_train_credit, y_train_credit), (X_test_credit, y_test_credit)]
credit_xgb.fit(
    X_train_credit, y_train_credit,
    eval_set=eval_set,
    early_stopping_rounds=20,
    verbose=False
)

# Make predictions
y_pred_credit = credit_xgb.predict(X_test_credit)
y_proba_credit = credit_xgb.predict_proba(X_test_credit)[:, 1]

print("\n🎯 Credit Scoring Results:")
print(f"Accuracy: {accuracy_score(y_test_credit, y_pred_credit):.3f}")
print(f"Precision: {precision_score(y_test_credit, y_pred_credit):.3f}")
print(f"Recall: {recall_score(y_test_credit, y_pred_credit):.3f}")

# AUC Score (important for credit scoring)
from sklearn.metrics import roc_auc_score, roc_curve
auc_score = roc_auc_score(y_test_credit, y_proba_credit)
print(f"AUC Score: {auc_score:.3f}")

# Feature importance for credit scoring
credit_importance = pd.DataFrame({
    'Feature': feature_columns,
    'Importance': credit_xgb.feature_importances_
}).sort_values('Importance', ascending=False)

print("\n📊 Most Important Risk Factors:")
for i, (_, row) in enumerate(credit_importance.iterrows()):
    print(f"{i+1}. {row['Feature']}: {row['Importance']:.3f}")

# Visualize credit scoring results
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# 1. Feature importance
axes[0, 0].barh(credit_importance['Feature'], credit_importance['Importance'])
axes[0, 0].set_xlabel('Importance')
axes[0, 0].set_title('Credit Risk Factors')
axes[0, 0].invert_yaxis()

# 2. ROC Curve
fpr, tpr, _ = roc_curve(y_test_credit, y_proba_credit)
axes[0, 1].plot(fpr, tpr, linewidth=2, label=f'AUC = {auc_score:.3f}')
axes[0, 1].plot([0, 1], [0, 1], 'k--', linewidth=1)
axes[0, 1].set_xlabel('False Positive Rate')
axes[0, 1].set_ylabel('True Positive Rate')
axes[0, 1].set_title('ROC Curve - Credit Scoring')
axes[0, 1].legend()

# 3. Confusion Matrix
cm_credit = confusion_matrix(y_test_credit, y_pred_credit)
sns.heatmap(cm_credit, annot=True, fmt='d', cmap='Blues', ax=axes[1, 0],
            xticklabels=['No Default', 'Default'], yticklabels=['No Default', 'Default'])
axes[1, 0].set_xlabel('Predicted')
axes[1, 0].set_ylabel('Actual')
axes[1, 0].set_title('Confusion Matrix')

# 4. Score distribution
axes[1, 1].hist(y_proba_credit[y_test_credit == 0], bins=50, alpha=0.5, 
                label='No Default', color='green', density=True)
axes[1, 1].hist(y_proba_credit[y_test_credit == 1], bins=50, alpha=0.5, 
                label='Default', color='red', density=True)
axes[1, 1].set_xlabel('Default Probability')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('Score Distribution')
axes[1, 1].legend()

plt.tight_layout()
plt.show()

# Business interpretation
print("\n💼 Business Interpretation:")
print("Key risk indicators for credit defaults:")
top_3_features = credit_importance.head(3)
for i, (_, row) in enumerate(top_3_features.iterrows()):
    print(f"{i+1}. {row['Feature']}: Primary risk factor")

print(f"\nModel Performance for Business:")
print(f"• AUC of {auc_score:.3f} indicates {'excellent' if auc_score > 0.8 else 'good'} discriminatory power")
print(f"• Can identify {recall_score(y_test_credit, y_pred_credit):.1%} of actual defaults")
print(f"• {precision_score(y_test_credit, y_pred_credit):.1%} of flagged cases are true defaults")
print(f"• Suitable for automated credit decisioning with human oversight")

## **✅ 12. Advantages and Disadvantages**

### **XGBoost Advantages:** ✅

1. **High Performance**: Often achieves state-of-the-art results
2. **Regularization**: Built-in L1 and L2 regularization prevents overfitting
3. **Handles Missing Values**: Native support for missing data
4. **Feature Importance**: Multiple importance metrics (gain, cover, weight)
5. **Flexibility**: Works for classification, regression, and ranking
6. **Parallel Processing**: Optimized for speed and scalability
7. **Early Stopping**: Built-in early stopping prevents overfitting
8. **Cross-Validation**: Native cross-validation support
9. **Memory Efficient**: Optimized memory usage
10. **Robustness**: Handles outliers well through tree-based learning

### **XGBoost Disadvantages:** ❌

1. **Hyperparameter Complexity**: Many parameters to tune
2. **Computational Cost**: Can be slower than simpler models
3. **Black Box**: Less interpretable than linear models or single trees
4. **Overfitting Risk**: Can overfit with small datasets or poor parameters
5. **Memory Usage**: Requires more memory than simple models
6. **Learning Curve**: Steeper learning curve than Random Forest

### **When to Use XGBoost:** 🎯

- **Competitive ML**: When you need maximum predictive performance
- **Tabular Data**: Excellent for structured/tabular datasets
- **Feature Rich**: When you have many features and complex relationships
- **Class Imbalance**: Good handling of imbalanced datasets
- **Large Datasets**: Scales well with large amounts of data
- **Ensemble Models**: As part of ensemble learning strategies

### **When NOT to Use XGBoost:** ⚠️

- **Simple Relationships**: When linear models would suffice
- **Real-time Inference**: When prediction speed is critical
- **Small Datasets**: Risk of overfitting with limited data
- **High Interpretability Required**: When you need to explain every decision
- **Resource Constraints**: Limited computational resources
- **Image/Text Data**: Deep learning often better for unstructured data

### **XGBoost vs Other Methods:**

| Aspect | Decision Tree | Random Forest | XGBoost |
|--------|---------------|---------------|---------|
| **Accuracy** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| **Speed** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐ |
| **Interpretability** | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
| **Overfitting Resistance** | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| **Feature Handling** | ⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| **Hyperparameter Sensitivity** | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ |

## **📝 13. Summary and Key Takeaways**

### **XGBoost in a Nutshell:** 🌰

XGBoost is a powerful, optimized gradient boosting framework that sequentially builds trees to correct errors of previous trees, incorporating advanced regularization and optimization techniques for superior performance.

### **Key Concepts Mastered:** 💡

1. **Gradient Boosting**: Sequential learning that corrects previous errors
2. **Regularization**: L1/L2 penalties prevent overfitting
3. **Tree Pruning**: Intelligent pruning improves generalization
4. **Learning Rate**: Controls step size for stable convergence
5. **Subsampling**: Reduces overfitting and improves speed
6. **Early Stopping**: Prevents overtraining automatically
7. **Feature Importance**: Multiple metrics for feature analysis

### **Essential Hyperparameters:** 🎛️

**Core Parameters:**
- `n_estimators`: Number of boosting rounds (50-500)
- `max_depth`: Tree depth (3-10)
- `learning_rate`: Step size (0.01-0.3)

**Regularization:**
- `reg_lambda`: L2 regularization (1-10)
- `reg_alpha`: L1 regularization (0-1)
- `gamma`: Minimum loss reduction (0-0.5)

**Sampling:**
- `subsample`: Row sampling (0.6-1.0)
- `colsample_bytree`: Column sampling (0.6-1.0)

### **Best Practices:** 🎯

1. **Start Simple**: Begin with default parameters
2. **Use Cross-Validation**: Always validate with CV
3. **Early Stopping**: Prevent overfitting with early stopping
4. **Handle Imbalance**: Use `scale_pos_weight` for imbalanced data
5. **Monitor Learning**: Watch training curves for overfitting
6. **Feature Engineering**: Good features still matter
7. **Hyperparameter Tuning**: Use systematic tuning approaches

### **Performance Optimization:** ⚡

- **Parallel Processing**: Set `n_jobs=-1`
- **Memory Management**: Use `tree_method='hist'` for large datasets
- **Early Stopping**: Save time with `early_stopping_rounds`
- **Approximate Methods**: Use `tree_method='approx'` for speed

### **Comparison Summary:**

| Model | Accuracy | Speed | Interpretability | Use Case |
|-------|----------|-------|------------------|----------|
| **Decision Tree** | Good | Fast | High | Simple, interpretable models |
| **Random Forest** | Better | Medium | Medium | Robust, general-purpose |
| **XGBoost** | Best | Medium | Medium | Maximum performance needed |

### **Real-World Applications:** 🌍

- **Finance**: Credit scoring, fraud detection
- **Healthcare**: Medical diagnosis, drug discovery
- **Marketing**: Customer segmentation, churn prediction
- **E-commerce**: Recommendation systems, pricing
- **Technology**: Ranking algorithms, CTR prediction

### **Next Steps:** 🚀

1. **Advanced XGBoost**: Explore XGBoost advanced features
2. **LightGBM/CatBoost**: Try other gradient boosting libraries
3. **Ensemble Methods**: Combine XGBoost with other models
4. **Feature Engineering**: Master feature creation techniques
5. **AutoML**: Explore automated machine learning tools
6. **Deep Learning**: Learn neural networks for complex patterns

### **Final Thoughts:** 💭

XGBoost represents the pinnacle of tree-based machine learning, offering:
- **State-of-the-art performance** on tabular data
- **Robust regularization** preventing overfitting
- **Efficient implementation** for production use
- **Rich ecosystem** with extensive documentation

While it requires more expertise than simpler models, XGBoost often provides the best performance for structured data problems, making it a essential tool in any data scientist's toolkit! 🚀✨

**Remember**: The best model is the one that solves your business problem effectively - sometimes that's XGBoost, sometimes it's a simple linear model. Always start with understanding your problem and data first! 🎯