# Module 10: Model Comparison and Selection

**Difficulty**: ‚≠ê‚≠ê‚≠ê Advanced
**Estimated Time**: 100 minutes
**Prerequisites**: 
- All modules 00-09 (complete ensemble methods series)
- Understanding of cross-validation and model evaluation
- Familiarity with all ensemble algorithms

## Learning Objectives

By the end of this notebook, you will be able to:
1. Design comprehensive benchmarking experiments for ensemble methods
2. Compare 10+ ensemble algorithms across multiple datasets and metrics
3. Evaluate trade-offs: accuracy, speed, memory, interpretability
4. Analyze hyperparameter sensitivity for different ensemble methods
5. Make informed decisions about which ensemble method to use in production
6. Create a decision framework for selecting ensemble methods

## Setup and Configuration

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import time
import pickle
import joblib
from pathlib import Path

# Machine learning
from sklearn.datasets import (
    load_breast_cancer, load_wine, load_diabetes, 
    fetch_california_housing, make_classification, make_regression
)
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score, f1_score, roc_auc_score, mean_squared_error, 
    r2_score, mean_absolute_error
)

# Ensemble methods
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import (
    BaggingClassifier, BaggingRegressor,
    RandomForestClassifier, RandomForestRegressor,
    AdaBoostClassifier, AdaBoostRegressor,
    GradientBoostingClassifier, GradientBoostingRegressor,
    StackingClassifier, StackingRegressor,
    VotingClassifier, VotingRegressor
)

# Advanced ensemble libraries
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier, CatBoostRegressor

# For interpretability
import shap

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
warnings.filterwarnings('ignore')

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)
pd.set_option('display.max_rows', 100)

print("‚úì Setup complete! All libraries imported successfully.")
print(f"‚úì XGBoost version: {xgb.__version__}")
print(f"‚úì LightGBM version: {lgb.__version__}")
print(f"‚úì SHAP version: {shap.__version__}")

## 1. Framework for Choosing Ensemble Methods

### Decision Factors

Choosing the right ensemble method depends on:

1. **Dataset Characteristics**:
   - Size (rows and columns)
   - Feature types (numerical, categorical, mixed)
   - Data quality (missing values, outliers)
   - Class balance (for classification)

2. **Performance Requirements**:
   - Accuracy/R¬≤ target
   - Training time budget
   - Inference latency constraints
   - Memory limitations

3. **Business Constraints**:
   - Interpretability needs
   - Production deployment complexity
   - Model maintenance overhead
   - Cost of errors (false positives vs false negatives)

4. **Technical Environment**:
   - Available hardware (CPU, GPU, memory)
   - Software stack compatibility
   - Team expertise
   - Existing infrastructure

### Methodology

We'll systematically evaluate ensemble methods using:
- **Multiple datasets** (diverse characteristics)
- **Standardized protocol** (same splits, same metrics)
- **Comprehensive metrics** (accuracy, speed, memory, interpretability)
- **Statistical rigor** (cross-validation, error bars)

## 2. Benchmark Setup

### Datasets

We'll use 4 diverse datasets:

**Classification:**
1. **Breast Cancer** (569 samples, 30 features) - Binary, medical domain
2. **Wine** (178 samples, 13 features) - Multiclass, small dataset

**Regression:**
3. **Diabetes** (442 samples, 10 features) - Medical progression
4. **California Housing** (20,640 samples, 8 features) - Large dataset

In [None]:
# Load and prepare classification datasets
def load_classification_datasets():
    """
    Load and prepare classification datasets for benchmarking.
    Returns dictionary of datasets with train/test splits.
    """
    datasets = {}
    
    # 1. Breast Cancer (binary classification)
    cancer = load_breast_cancer()
    X_train, X_test, y_train, y_test = train_test_split(
        cancer.data, cancer.target, test_size=0.2, random_state=RANDOM_STATE, stratify=cancer.target
    )
    datasets['breast_cancer'] = {
        'X_train': X_train, 'X_test': X_test,
        'y_train': y_train, 'y_test': y_test,
        'task': 'binary',
        'description': 'Breast Cancer (569 samples, 30 features)'
    }
    
    # 2. Wine (multiclass classification)
    wine = load_wine()
    X_train, X_test, y_train, y_test = train_test_split(
        wine.data, wine.target, test_size=0.2, random_state=RANDOM_STATE, stratify=wine.target
    )
    datasets['wine'] = {
        'X_train': X_train, 'X_test': X_test,
        'y_train': y_train, 'y_test': y_test,
        'task': 'multiclass',
        'description': 'Wine (178 samples, 13 features)'
    }
    
    return datasets

# Load and prepare regression datasets
def load_regression_datasets():
    """
    Load and prepare regression datasets for benchmarking.
    Returns dictionary of datasets with train/test splits.
    """
    datasets = {}
    
    # 1. Diabetes (regression)
    diabetes = load_diabetes()
    X_train, X_test, y_train, y_test = train_test_split(
        diabetes.data, diabetes.target, test_size=0.2, random_state=RANDOM_STATE
    )
    datasets['diabetes'] = {
        'X_train': X_train, 'X_test': X_test,
        'y_train': y_train, 'y_test': y_test,
        'description': 'Diabetes (442 samples, 10 features)'
    }
    
    # 2. California Housing (large regression)
    housing = fetch_california_housing()
    # Use subset for faster benchmarking
    X_subset = housing.data[:5000]
    y_subset = housing.target[:5000]
    X_train, X_test, y_train, y_test = train_test_split(
        X_subset, y_subset, test_size=0.2, random_state=RANDOM_STATE
    )
    datasets['housing'] = {
        'X_train': X_train, 'X_test': X_test,
        'y_train': y_train, 'y_test': y_test,
        'description': 'California Housing (5000 samples, 8 features)'
    }
    
    return datasets

# Load all datasets
classification_data = load_classification_datasets()
regression_data = load_regression_datasets()

print("Classification Datasets:")
for name, data in classification_data.items():
    print(f"  ‚Ä¢ {data['description']}")
    print(f"    Train: {len(data['y_train'])} samples, Test: {len(data['y_test'])} samples")

print("\nRegression Datasets:")
for name, data in regression_data.items():
    print(f"  ‚Ä¢ {data['description']}")
    print(f"    Train: {len(data['y_train'])} samples, Test: {len(data['y_test'])} samples")

### Benchmark Utilities

Create standardized utilities for timing, memory profiling, and evaluation.

In [None]:
import sys
from typing import Dict, Any

class ModelBenchmark:
    """
    Comprehensive benchmarking utility for comparing ensemble methods.
    
    Tracks:
    - Training time
    - Prediction time
    - Model size (memory)
    - Performance metrics
    - Cross-validation scores
    """
    
    def __init__(self, model, model_name: str):
        self.model = model
        self.model_name = model_name
        self.results = {}
        
    def benchmark_classification(self, X_train, X_test, y_train, y_test, 
                                 cv_folds: int = 5) -> Dict[str, Any]:
        """
        Run comprehensive benchmark for classification task.
        """
        results = {'model_name': self.model_name}
        
        # 1. Training time
        start_time = time.time()
        self.model.fit(X_train, y_train)
        train_time = time.time() - start_time
        results['train_time'] = train_time
        
        # 2. Prediction time
        start_time = time.time()
        y_pred = self.model.predict(X_test)
        pred_time = time.time() - start_time
        results['pred_time'] = pred_time
        results['pred_time_per_sample'] = pred_time / len(X_test)
        
        # 3. Performance metrics
        results['accuracy'] = accuracy_score(y_test, y_pred)
        results['f1'] = f1_score(y_test, y_pred, average='weighted')
        
        # ROC AUC (if binary and has predict_proba)
        if len(np.unique(y_test)) == 2 and hasattr(self.model, 'predict_proba'):
            y_proba = self.model.predict_proba(X_test)[:, 1]
            results['roc_auc'] = roc_auc_score(y_test, y_proba)
        else:
            results['roc_auc'] = np.nan
        
        # 4. Cross-validation score
        cv_scores = cross_val_score(self.model, X_train, y_train, cv=cv_folds, n_jobs=-1)
        results['cv_mean'] = cv_scores.mean()
        results['cv_std'] = cv_scores.std()
        
        # 5. Model size (approximate)
        results['model_size_mb'] = self._estimate_model_size()
        
        return results
    
    def benchmark_regression(self, X_train, X_test, y_train, y_test, 
                            cv_folds: int = 5) -> Dict[str, Any]:
        """
        Run comprehensive benchmark for regression task.
        """
        results = {'model_name': self.model_name}
        
        # 1. Training time
        start_time = time.time()
        self.model.fit(X_train, y_train)
        train_time = time.time() - start_time
        results['train_time'] = train_time
        
        # 2. Prediction time
        start_time = time.time()
        y_pred = self.model.predict(X_test)
        pred_time = time.time() - start_time
        results['pred_time'] = pred_time
        results['pred_time_per_sample'] = pred_time / len(X_test)
        
        # 3. Performance metrics
        results['r2'] = r2_score(y_test, y_pred)
        results['mse'] = mean_squared_error(y_test, y_pred)
        results['rmse'] = np.sqrt(results['mse'])
        results['mae'] = mean_absolute_error(y_test, y_pred)
        
        # 4. Cross-validation score
        cv_scores = cross_val_score(self.model, X_train, y_train, cv=cv_folds, 
                                   scoring='r2', n_jobs=-1)
        results['cv_mean'] = cv_scores.mean()
        results['cv_std'] = cv_scores.std()
        
        # 5. Model size (approximate)
        results['model_size_mb'] = self._estimate_model_size()
        
        return results
    
    def _estimate_model_size(self) -> float:
        """
        Estimate model size in MB by pickling.
        """
        try:
            pickled = pickle.dumps(self.model)
            size_mb = sys.getsizeof(pickled) / (1024 * 1024)
            return size_mb
        except:
            return np.nan

print("‚úì ModelBenchmark class defined successfully!")

## 3. Model Configuration

Define all ensemble methods to compare. We'll use reasonable default parameters for fair comparison.

In [None]:
def get_classification_models():
    """
    Get dictionary of classification models for benchmarking.
    All models use reasonable default parameters.
    """
    models = {
        'Decision Tree': DecisionTreeClassifier(
            max_depth=10, random_state=RANDOM_STATE
        ),
        
        'Bagging': BaggingClassifier(
            estimator=DecisionTreeClassifier(max_depth=10, random_state=RANDOM_STATE),
            n_estimators=50, random_state=RANDOM_STATE, n_jobs=-1
        ),
        
        'Random Forest': RandomForestClassifier(
            n_estimators=100, max_depth=10, random_state=RANDOM_STATE, n_jobs=-1
        ),
        
        'AdaBoost': AdaBoostClassifier(
            estimator=DecisionTreeClassifier(max_depth=3, random_state=RANDOM_STATE),
            n_estimators=50, learning_rate=1.0, random_state=RANDOM_STATE,
            algorithm='SAMME'
        ),
        
        'Gradient Boosting': GradientBoostingClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=3, 
            random_state=RANDOM_STATE
        ),
        
        'XGBoost': xgb.XGBClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=3,
            random_state=RANDOM_STATE, n_jobs=-1, verbosity=0
        ),
        
        'LightGBM': lgb.LGBMClassifier(
            n_estimators=100, learning_rate=0.1, max_depth=3,
            random_state=RANDOM_STATE, n_jobs=-1, verbose=-1
        ),
        
        'CatBoost': CatBoostClassifier(
            iterations=100, learning_rate=0.1, depth=3,
            random_state=RANDOM_STATE, verbose=0
        ),
    }
    
    return models

def get_regression_models():
    """
    Get dictionary of regression models for benchmarking.
    All models use reasonable default parameters.
    """
    models = {
        'Decision Tree': DecisionTreeRegressor(
            max_depth=10, random_state=RANDOM_STATE
        ),
        
        'Bagging': BaggingRegressor(
            estimator=DecisionTreeRegressor(max_depth=10, random_state=RANDOM_STATE),
            n_estimators=50, random_state=RANDOM_STATE, n_jobs=-1
        ),
        
        'Random Forest': RandomForestRegressor(
            n_estimators=100, max_depth=10, random_state=RANDOM_STATE, n_jobs=-1
        ),
        
        'AdaBoost': AdaBoostRegressor(
            estimator=DecisionTreeRegressor(max_depth=3, random_state=RANDOM_STATE),
            n_estimators=50, learning_rate=1.0, random_state=RANDOM_STATE
        ),
        
        'Gradient Boosting': GradientBoostingRegressor(
            n_estimators=100, learning_rate=0.1, max_depth=3, 
            random_state=RANDOM_STATE
        ),
        
        'XGBoost': xgb.XGBRegressor(
            n_estimators=100, learning_rate=0.1, max_depth=3,
            random_state=RANDOM_STATE, n_jobs=-1, verbosity=0
        ),
        
        'LightGBM': lgb.LGBMRegressor(
            n_estimators=100, learning_rate=0.1, max_depth=3,
            random_state=RANDOM_STATE, n_jobs=-1, verbose=-1
        ),
        
        'CatBoost': CatBoostRegressor(
            iterations=100, learning_rate=0.1, depth=3,
            random_state=RANDOM_STATE, verbose=0
        ),
    }
    
    return models

print("‚úì Model configurations defined!")
print(f"  Classification models: {len(get_classification_models())}")
print(f"  Regression models: {len(get_regression_models())}")

## 4. Performance Comparison

Run comprehensive benchmarks across all models and datasets.

In [None]:
# Benchmark classification models
print("Benchmarking Classification Models...")
print("=" * 80)

classification_results = []

for dataset_name, dataset in classification_data.items():
    print(f"\nDataset: {dataset['description']}")
    
    models = get_classification_models()
    
    for model_name, model in models.items():
        print(f"  Testing {model_name}...", end=' ')
        
        benchmark = ModelBenchmark(model, model_name)
        results = benchmark.benchmark_classification(
            dataset['X_train'], dataset['X_test'],
            dataset['y_train'], dataset['y_test']
        )
        results['dataset'] = dataset_name
        classification_results.append(results)
        
        print(f"Accuracy: {results['accuracy']:.4f}, Time: {results['train_time']:.3f}s")

# Convert to DataFrame for easy analysis
clf_results_df = pd.DataFrame(classification_results)
print("\n‚úì Classification benchmarking complete!")

In [None]:
# Benchmark regression models
print("Benchmarking Regression Models...")
print("=" * 80)

regression_results = []

for dataset_name, dataset in regression_data.items():
    print(f"\nDataset: {dataset['description']}")
    
    models = get_regression_models()
    
    for model_name, model in models.items():
        print(f"  Testing {model_name}...", end=' ')
        
        benchmark = ModelBenchmark(model, model_name)
        results = benchmark.benchmark_regression(
            dataset['X_train'], dataset['X_test'],
            dataset['y_train'], dataset['y_test']
        )
        results['dataset'] = dataset_name
        regression_results.append(results)
        
        print(f"R¬≤: {results['r2']:.4f}, Time: {results['train_time']:.3f}s")

# Convert to DataFrame for easy analysis
reg_results_df = pd.DataFrame(regression_results)
print("\n‚úì Regression benchmarking complete!")

### Classification Results Analysis

In [None]:
# Display comprehensive classification results
print("Classification Results Summary")
print("=" * 100)

for dataset_name in clf_results_df['dataset'].unique():
    print(f"\n{dataset_name.upper()} Dataset:")
    subset = clf_results_df[clf_results_df['dataset'] == dataset_name].copy()
    subset = subset.sort_values('accuracy', ascending=False)
    
    display_cols = ['model_name', 'accuracy', 'f1', 'train_time', 'pred_time', 'model_size_mb']
    print(subset[display_cols].to_string(index=False))
    
    # Highlight best performers
    best_model = subset.iloc[0]
    print(f"\n  üèÜ Best Model: {best_model['model_name']} (Accuracy: {best_model['accuracy']:.4f})")
    print(f"  ‚ö° Fastest Training: {subset.loc[subset['train_time'].idxmin(), 'model_name']} ({subset['train_time'].min():.3f}s)")
    print(f"  üíæ Smallest Size: {subset.loc[subset['model_size_mb'].idxmin(), 'model_name']} ({subset['model_size_mb'].min():.2f} MB)")

In [None]:
# Visualize classification results
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

datasets = clf_results_df['dataset'].unique()

for idx, dataset_name in enumerate(datasets):
    subset = clf_results_df[clf_results_df['dataset'] == dataset_name].copy()
    subset = subset.sort_values('accuracy', ascending=True)
    
    row = idx // 2
    col = idx % 2
    ax = axes[row, col]
    
    # Accuracy comparison
    bars = ax.barh(subset['model_name'], subset['accuracy'], 
                   color=plt.cm.viridis(subset['accuracy']), edgecolor='black')
    ax.set_xlabel('Accuracy', fontsize=11, fontweight='bold')
    ax.set_title(f'{dataset_name.replace("_", " ").title()}', fontsize=12, fontweight='bold')
    ax.set_xlim(0.7, 1.0)
    
    # Add value labels
    for i, (bar, val) in enumerate(zip(bars, subset['accuracy'])):
        ax.text(val + 0.01, bar.get_y() + bar.get_height()/2, 
               f'{val:.3f}', va='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.savefig('classification_accuracy_comparison.png', dpi=100, bbox_inches='tight')
plt.show()

print("‚úì Visualization saved as 'classification_accuracy_comparison.png'")

### Regression Results Analysis

In [None]:
# Display comprehensive regression results
print("Regression Results Summary")
print("=" * 100)

for dataset_name in reg_results_df['dataset'].unique():
    print(f"\n{dataset_name.upper()} Dataset:")
    subset = reg_results_df[reg_results_df['dataset'] == dataset_name].copy()
    subset = subset.sort_values('r2', ascending=False)
    
    display_cols = ['model_name', 'r2', 'rmse', 'mae', 'train_time', 'pred_time', 'model_size_mb']
    print(subset[display_cols].to_string(index=False))
    
    # Highlight best performers
    best_model = subset.iloc[0]
    print(f"\n  üèÜ Best Model: {best_model['model_name']} (R¬≤: {best_model['r2']:.4f})")
    print(f"  ‚ö° Fastest Training: {subset.loc[subset['train_time'].idxmin(), 'model_name']} ({subset['train_time'].min():.3f}s)")
    print(f"  üíæ Smallest Size: {subset.loc[subset['model_size_mb'].idxmin(), 'model_name']} ({subset['model_size_mb'].min():.2f} MB)")

In [None]:
# Visualize regression results
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

datasets = reg_results_df['dataset'].unique()

for idx, dataset_name in enumerate(datasets):
    subset = reg_results_df[reg_results_df['dataset'] == dataset_name].copy()
    subset = subset.sort_values('r2', ascending=True)
    
    ax = axes[idx]
    
    # R¬≤ comparison
    bars = ax.barh(subset['model_name'], subset['r2'], 
                   color=plt.cm.viridis(subset['r2']), edgecolor='black')
    ax.set_xlabel('R¬≤ Score', fontsize=11, fontweight='bold')
    ax.set_title(f'{dataset_name.replace("_", " ").title()}', fontsize=12, fontweight='bold')
    
    # Add value labels
    for i, (bar, val) in enumerate(zip(bars, subset['r2'])):
        ax.text(val + 0.01, bar.get_y() + bar.get_height()/2, 
               f'{val:.3f}', va='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.savefig('regression_r2_comparison.png', dpi=100, bbox_inches='tight')
plt.show()

print("‚úì Visualization saved as 'regression_r2_comparison.png'")

### Speed vs Accuracy Trade-off

In [None]:
# Visualize speed-accuracy tradeoff for classification
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

for idx, dataset_name in enumerate(clf_results_df['dataset'].unique()):
    subset = clf_results_df[clf_results_df['dataset'] == dataset_name]
    
    ax = axes[idx]
    
    # Scatter plot: training time vs accuracy
    scatter = ax.scatter(subset['train_time'], subset['accuracy'], 
                        s=200, alpha=0.6, c=range(len(subset)), cmap='viridis',
                        edgecolors='black', linewidth=1.5)
    
    # Annotate points
    for _, row in subset.iterrows():
        ax.annotate(row['model_name'], 
                   (row['train_time'], row['accuracy']),
                   xytext=(5, 5), textcoords='offset points',
                   fontsize=8, fontweight='bold')
    
    ax.set_xlabel('Training Time (seconds)', fontsize=11, fontweight='bold')
    ax.set_ylabel('Accuracy', fontsize=11, fontweight='bold')
    ax.set_title(f'{dataset_name.replace("_", " ").title()}\nSpeed vs Accuracy Trade-off', 
                fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)
    
    # Highlight Pareto frontier (best speed-accuracy combinations)
    # Models that are not dominated by any other model
    pareto_optimal = []
    for i, row1 in subset.iterrows():
        dominated = False
        for j, row2 in subset.iterrows():
            if i != j:
                # row2 dominates row1 if it's both faster and more accurate
                if row2['train_time'] < row1['train_time'] and row2['accuracy'] > row1['accuracy']:
                    dominated = True
                    break
        if not dominated:
            pareto_optimal.append(row1)
    
    if pareto_optimal:
        pareto_df = pd.DataFrame(pareto_optimal).sort_values('train_time')
        ax.plot(pareto_df['train_time'], pareto_df['accuracy'], 
               'r--', linewidth=2, alpha=0.5, label='Pareto Frontier')
        ax.legend()

plt.tight_layout()
plt.savefig('speed_accuracy_tradeoff.png', dpi=100, bbox_inches='tight')
plt.show()

print("‚úì Speed-accuracy tradeoff visualization saved!")

## 5. Interpretability Comparison

Compare ensemble methods on interpretability dimensions:
- Feature importance availability
- SHAP support
- Model complexity
- Debugging ease

In [None]:
# Create interpretability comparison matrix
interpretability_matrix = {
    'Model': [
        'Decision Tree', 'Bagging', 'Random Forest', 'AdaBoost',
        'Gradient Boosting', 'XGBoost', 'LightGBM', 'CatBoost'
    ],
    'Feature Importance': [
        'Native', 'Native', 'Native', 'Native',
        'Native', 'Native', 'Native', 'Native'
    ],
    'SHAP Support': [
        'TreeExplainer', 'Limited', 'TreeExplainer', 'Limited',
        'TreeExplainer', 'TreeExplainer', 'TreeExplainer', 'TreeExplainer'
    ],
    'Complexity': [
        'Low', 'Medium', 'Medium', 'Medium',
        'High', 'High', 'High', 'High'
    ],
    'Debugging Ease': [
        'Easy', 'Hard', 'Hard', 'Medium',
        'Hard', 'Medium', 'Medium', 'Medium'
    ],
    'Visualization': [
        'Tree plots', 'Limited', 'Limited', 'Limited',
        'Limited', 'Tree plots', 'Tree plots', 'Tree plots'
    ],
    'Interpretability Score': [
        9, 4, 5, 6,
        5, 7, 7, 7
    ]
}

interp_df = pd.DataFrame(interpretability_matrix)
print("Interpretability Comparison Matrix")
print("=" * 100)
print(interp_df.to_string(index=False))

# Visualize interpretability scores
plt.figure(figsize=(12, 6))
colors = plt.cm.RdYlGn(interp_df['Interpretability Score'] / 10)
bars = plt.barh(interp_df['Model'], interp_df['Interpretability Score'], 
               color=colors, edgecolor='black', linewidth=1.5)
plt.xlabel('Interpretability Score (1-10)', fontsize=11, fontweight='bold')
plt.title('Model Interpretability Comparison\n(Higher = More Interpretable)', 
         fontsize=12, fontweight='bold')
plt.xlim(0, 10)

# Add value labels
for bar, val in zip(bars, interp_df['Interpretability Score']):
    plt.text(val + 0.2, bar.get_y() + bar.get_height()/2, 
            f'{val}/10', va='center', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.savefig('interpretability_comparison.png', dpi=100, bbox_inches='tight')
plt.show()

### SHAP Value Demonstration

Demonstrate SHAP interpretation for tree-based ensembles.

In [None]:
# Train models for SHAP demonstration
X_train = classification_data['breast_cancer']['X_train']
X_test = classification_data['breast_cancer']['X_test']
y_train = classification_data['breast_cancer']['y_train']

# Train XGBoost and Random Forest for comparison
xgb_model = xgb.XGBClassifier(n_estimators=50, max_depth=3, random_state=RANDOM_STATE, verbosity=0)
xgb_model.fit(X_train, y_train)

rf_model = RandomForestClassifier(n_estimators=50, max_depth=10, random_state=RANDOM_STATE)
rf_model.fit(X_train, y_train)

# Create SHAP explainers
xgb_explainer = shap.TreeExplainer(xgb_model)
rf_explainer = shap.TreeExplainer(rf_model)

# Calculate SHAP values for test set (use subset for speed)
sample_size = min(100, len(X_test))
X_sample = X_test[:sample_size]

xgb_shap_values = xgb_explainer.shap_values(X_sample)
rf_shap_values = rf_explainer.shap_values(X_sample)

print("‚úì SHAP values calculated successfully!")
print(f"  Sample size: {sample_size}")
print(f"  XGBoost SHAP values shape: {xgb_shap_values.shape if isinstance(xgb_shap_values, np.ndarray) else 'list of arrays'}")

In [None]:
# Visualize SHAP summary plots
fig, axes = plt.subplots(1, 2, figsize=(18, 6))

# XGBoost SHAP
plt.sca(axes[0])
shap.summary_plot(xgb_shap_values, X_sample, show=False, max_display=10)
axes[0].set_title('XGBoost - SHAP Feature Importance', fontsize=12, fontweight='bold')

# Random Forest SHAP  
plt.sca(axes[1])
# For multi-output (binary classification), use first class
shap_vals = rf_shap_values[1] if isinstance(rf_shap_values, list) else rf_shap_values
shap.summary_plot(shap_vals, X_sample, show=False, max_display=10)
axes[1].set_title('Random Forest - SHAP Feature Importance', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig('shap_comparison.png', dpi=100, bbox_inches='tight')
plt.show()

print("\nüìä Key Insights:")
print("  ‚Ä¢ SHAP values show both feature importance AND direction of impact")
print("  ‚Ä¢ Red = high feature value, Blue = low feature value")
print("  ‚Ä¢ Position on x-axis shows positive or negative impact on prediction")
print("  ‚Ä¢ XGBoost and Random Forest may identify different important features")

## 6. Hyperparameter Sensitivity Analysis

Analyze how sensitive each ensemble method is to key hyperparameters.

In [None]:
# Test n_estimators sensitivity (number of trees/iterations)
n_estimators_range = [10, 25, 50, 100, 200]

X_train = classification_data['breast_cancer']['X_train']
X_test = classification_data['breast_cancer']['X_test']
y_train = classification_data['breast_cancer']['y_train']
y_test = classification_data['breast_cancer']['y_test']

sensitivity_results = []

print("Testing n_estimators sensitivity...")

for n_est in n_estimators_range:
    # Random Forest
    rf = RandomForestClassifier(n_estimators=n_est, max_depth=10, random_state=RANDOM_STATE, n_jobs=-1)
    rf.fit(X_train, y_train)
    rf_score = accuracy_score(y_test, rf.predict(X_test))
    sensitivity_results.append({'Model': 'Random Forest', 'n_estimators': n_est, 'Accuracy': rf_score})
    
    # XGBoost
    xgb_model = xgb.XGBClassifier(n_estimators=n_est, max_depth=3, random_state=RANDOM_STATE, verbosity=0)
    xgb_model.fit(X_train, y_train)
    xgb_score = accuracy_score(y_test, xgb_model.predict(X_test))
    sensitivity_results.append({'Model': 'XGBoost', 'n_estimators': n_est, 'Accuracy': xgb_score})
    
    # LightGBM
    lgb_model = lgb.LGBMClassifier(n_estimators=n_est, max_depth=3, random_state=RANDOM_STATE, verbose=-1)
    lgb_model.fit(X_train, y_train)
    lgb_score = accuracy_score(y_test, lgb_model.predict(X_test))
    sensitivity_results.append({'Model': 'LightGBM', 'n_estimators': n_est, 'Accuracy': lgb_score})
    
    print(f"  n_estimators={n_est}: RF={rf_score:.4f}, XGB={xgb_score:.4f}, LGB={lgb_score:.4f}")

sensitivity_df = pd.DataFrame(sensitivity_results)

# Visualize
plt.figure(figsize=(12, 6))
for model_name in sensitivity_df['Model'].unique():
    subset = sensitivity_df[sensitivity_df['Model'] == model_name]
    plt.plot(subset['n_estimators'], subset['Accuracy'], marker='o', 
            linewidth=2, markersize=8, label=model_name)

plt.xlabel('Number of Estimators', fontsize=11, fontweight='bold')
plt.ylabel('Accuracy', fontsize=11, fontweight='bold')
plt.title('Hyperparameter Sensitivity: n_estimators\n(Breast Cancer Dataset)', 
         fontsize=12, fontweight='bold')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('n_estimators_sensitivity.png', dpi=100, bbox_inches='tight')
plt.show()

print("\nüìä Insights:")
print("  ‚Ä¢ Performance typically plateaus after 50-100 estimators")
print("  ‚Ä¢ Boosting methods (XGBoost, LightGBM) may converge faster")
print("  ‚Ä¢ More estimators = longer training but diminishing returns")

In [None]:
# Test learning rate sensitivity (for boosting methods)
learning_rates = [0.01, 0.05, 0.1, 0.3, 0.5, 1.0]

lr_results = []

print("Testing learning rate sensitivity...")

for lr in learning_rates:
    # XGBoost
    xgb_model = xgb.XGBClassifier(n_estimators=100, learning_rate=lr, max_depth=3, 
                                 random_state=RANDOM_STATE, verbosity=0)
    xgb_model.fit(X_train, y_train)
    xgb_score = accuracy_score(y_test, xgb_model.predict(X_test))
    lr_results.append({'Model': 'XGBoost', 'learning_rate': lr, 'Accuracy': xgb_score})
    
    # LightGBM
    lgb_model = lgb.LGBMClassifier(n_estimators=100, learning_rate=lr, max_depth=3, 
                                  random_state=RANDOM_STATE, verbose=-1)
    lgb_model.fit(X_train, y_train)
    lgb_score = accuracy_score(y_test, lgb_model.predict(X_test))
    lr_results.append({'Model': 'LightGBM', 'learning_rate': lr, 'Accuracy': lgb_score})
    
    # Gradient Boosting
    gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=lr, max_depth=3, 
                                         random_state=RANDOM_STATE)
    gb_model.fit(X_train, y_train)
    gb_score = accuracy_score(y_test, gb_model.predict(X_test))
    lr_results.append({'Model': 'Gradient Boosting', 'learning_rate': lr, 'Accuracy': gb_score})
    
    print(f"  learning_rate={lr:.2f}: XGB={xgb_score:.4f}, LGB={lgb_score:.4f}, GB={gb_score:.4f}")

lr_df = pd.DataFrame(lr_results)

# Visualize
plt.figure(figsize=(12, 6))
for model_name in lr_df['Model'].unique():
    subset = lr_df[lr_df['Model'] == model_name]
    plt.plot(subset['learning_rate'], subset['Accuracy'], marker='o', 
            linewidth=2, markersize=8, label=model_name)

plt.xlabel('Learning Rate', fontsize=11, fontweight='bold')
plt.ylabel('Accuracy', fontsize=11, fontweight='bold')
plt.title('Hyperparameter Sensitivity: Learning Rate\n(Boosting Methods Only)', 
         fontsize=12, fontweight='bold')
plt.xscale('log')
plt.legend(fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('learning_rate_sensitivity.png', dpi=100, bbox_inches='tight')
plt.show()

print("\nüìä Insights:")
print("  ‚Ä¢ Learning rate 0.1 is often a good default")
print("  ‚Ä¢ Too high (>0.5): risk of overfitting and instability")
print("  ‚Ä¢ Too low (<0.01): very slow convergence, need more estimators")
print("  ‚Ä¢ Optimal rate depends on n_estimators (inverse relationship)")

## 7. Decision Framework

Based on our analysis, here's a practical decision framework for selecting ensemble methods.

In [None]:
# Create decision framework table
decision_framework = {
    'Scenario': [
        'Small dataset (<1000 samples)',
        'Large dataset (>100k samples)',
        'Need high interpretability',
        'Categorical features',
        'Speed critical (fast inference)',
        'Maximum accuracy (any cost)',
        'Limited memory',
        'Quick prototyping',
        'Production deployment',
        'Imbalanced classes'
    ],
    'Recommended Method': [
        'Random Forest',
        'LightGBM',
        'Decision Tree ‚Üí Random Forest',
        'CatBoost',
        'LightGBM',
        'Stacking (XGB + LGB + RF)',
        'LightGBM or Decision Tree',
        'Random Forest or XGBoost',
        'XGBoost or LightGBM',
        'XGBoost or LightGBM (with scale_pos_weight)'
    ],
    'Alternative': [
        'XGBoost',
        'XGBoost',
        'Gradient Boosting + SHAP',
        'LightGBM (categorical_feature)',
        'XGBoost',
        'Voting Ensemble',
        'Bagging',
        'Gradient Boosting',
        'CatBoost',
        'Random Forest (balanced class weights)'
    ],
    'Rationale': [
        'RF robust to overfitting on small data',
        'LGB optimized for large-scale data',
        'Tree models + feature importance',
        'Native categorical handling',
        'Fast inference, small model size',
        'Combine strengths of multiple models',
        'Memory-efficient implementations',
        'Good default performance, easy to use',
        'Proven reliability, good tooling',
        'Built-in class weighting support'
    ]
}

framework_df = pd.DataFrame(decision_framework)
print("Decision Framework for Ensemble Method Selection")
print("=" * 120)
print(framework_df.to_string(index=False))

print("\n\nüí° General Guidelines:")
print("\n1. START SIMPLE:")
print("   ‚Ä¢ Always baseline with single decision tree or logistic regression")
print("   ‚Ä¢ Then try Random Forest (easy, robust, good default)")
print("   ‚Ä¢ If RF works well, try gradient boosting for extra performance")

print("\n2. CHOOSE BOOSTING METHOD:")
print("   ‚Ä¢ XGBoost: Most mature, best documentation, widest support")
print("   ‚Ä¢ LightGBM: Fastest, most memory-efficient, great for large data")
print("   ‚Ä¢ CatBoost: Best for categorical features, good default parameters")

print("\n3. WHEN TO STACK/VOTE:")
print("   ‚Ä¢ Competition: Stack everything for maximum performance")
print("   ‚Ä¢ Production: Usually stick to single best model (simpler)")
print("   ‚Ä¢ Voting can help if models disagree in useful ways")

print("\n4. HYPERPARAMETER TUNING:")
print("   ‚Ä¢ Start with defaults")
print("   ‚Ä¢ Tune n_estimators first (use early stopping for boosting)")
print("   ‚Ä¢ Then max_depth or num_leaves")
print("   ‚Ä¢ Finally learning_rate (lower = better but slower)")

print("\n5. VALIDATION STRATEGY:")
print("   ‚Ä¢ Always use cross-validation for reliable estimates")
print("   ‚Ä¢ Watch for overfitting: train vs validation gap")
print("   ‚Ä¢ Use stratified CV for imbalanced classification")

## 8. Production Considerations

Practical considerations for deploying ensemble models in production.

In [None]:
# Model serialization comparison
print("Model Serialization Comparison")
print("=" * 80)

# Train models
models_to_serialize = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
    'XGBoost': xgb.XGBClassifier(n_estimators=100, random_state=RANDOM_STATE, verbosity=0),
    'LightGBM': lgb.LGBMClassifier(n_estimators=100, random_state=RANDOM_STATE, verbose=-1)
}

serialization_results = []

for name, model in models_to_serialize.items():
    model.fit(X_train, y_train)
    
    # Test pickle
    pickle_path = f'/tmp/model_{name.replace(" ", "_")}.pkl'
    start = time.time()
    with open(pickle_path, 'wb') as f:
        pickle.dump(model, f)
    pickle_time = time.time() - start
    pickle_size = Path(pickle_path).stat().st_size / (1024 * 1024)  # MB
    
    # Test joblib
    joblib_path = f'/tmp/model_{name.replace(" ", "_")}.joblib'
    start = time.time()
    joblib.dump(model, joblib_path)
    joblib_time = time.time() - start
    joblib_size = Path(joblib_path).stat().st_size / (1024 * 1024)  # MB
    
    # Load and inference time
    start = time.time()
    loaded_model = joblib.load(joblib_path)
    _ = loaded_model.predict(X_test)
    inference_time = time.time() - start
    
    serialization_results.append({
        'Model': name,
        'Pickle Size (MB)': pickle_size,
        'Joblib Size (MB)': joblib_size,
        'Save Time (s)': joblib_time,
        'Load + Inference (s)': inference_time
    })
    
    print(f"{name}:")
    print(f"  Pickle: {pickle_size:.2f} MB")
    print(f"  Joblib: {joblib_size:.2f} MB (recommended)")
    print(f"  Load + Inference: {inference_time:.4f}s")
    print()

serial_df = pd.DataFrame(serialization_results)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Model size comparison
ax = axes[0]
x = np.arange(len(serial_df))
width = 0.35
ax.bar(x - width/2, serial_df['Pickle Size (MB)'], width, label='Pickle', alpha=0.8)
ax.bar(x + width/2, serial_df['Joblib Size (MB)'], width, label='Joblib', alpha=0.8)
ax.set_ylabel('Size (MB)', fontweight='bold')
ax.set_title('Serialized Model Size', fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(serial_df['Model'])
ax.legend()
ax.grid(True, alpha=0.3, axis='y')

# Inference latency
ax = axes[1]
bars = ax.bar(serial_df['Model'], serial_df['Load + Inference (s)'], 
             color=plt.cm.viridis(np.linspace(0, 1, len(serial_df))), 
             edgecolor='black', alpha=0.8)
ax.set_ylabel('Time (seconds)', fontweight='bold')
ax.set_title('Load + Inference Latency', fontweight='bold')
ax.grid(True, alpha=0.3, axis='y')

# Add value labels
for bar, val in zip(bars, serial_df['Load + Inference (s)']):
    ax.text(bar.get_x() + bar.get_width()/2, val + 0.001, 
           f'{val:.4f}s', ha='center', va='bottom', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.savefig('production_metrics.png', dpi=100, bbox_inches='tight')
plt.show()

print("\nüí° Production Recommendations:")
print("  ‚Ä¢ Use joblib for serialization (faster, more efficient than pickle)")
print("  ‚Ä¢ LightGBM often has smallest deployment size")
print("  ‚Ä¢ Consider model compression for edge deployment")
print("  ‚Ä¢ Monitor inference latency in production (p50, p95, p99)")
print("  ‚Ä¢ Set up model versioning and A/B testing infrastructure")

### Production Deployment Checklist

In [None]:
print("Production Deployment Checklist for Ensemble Models")
print("=" * 80)

checklist = """
PRE-DEPLOYMENT:
‚ñ° Model validation on held-out test set
‚ñ° Cross-validation scores documented
‚ñ° Feature importance analysis completed
‚ñ° Model interpretability requirements met
‚ñ° Hyperparameters logged and reproducible
‚ñ° Training data versioned
‚ñ° Model artifacts saved (joblib recommended)
‚ñ° Preprocessing pipeline included with model
‚ñ° Input validation logic implemented
‚ñ° Output calibration checked (for probabilities)

DEPLOYMENT:
‚ñ° Inference latency measured (p50, p95, p99)
‚ñ° Memory footprint acceptable
‚ñ° Batch vs real-time inference decided
‚ñ° Model versioning system in place
‚ñ° A/B testing framework ready
‚ñ° Rollback procedure defined
‚ñ° Error handling for edge cases
‚ñ° API endpoint designed and documented
‚ñ° Load testing completed
‚ñ° Logging and monitoring configured

POST-DEPLOYMENT:
‚ñ° Prediction distribution monitoring
‚ñ° Feature drift detection
‚ñ° Model performance tracking
‚ñ° Retraining schedule defined
‚ñ° Alert thresholds set
‚ñ° Business metrics tracked
‚ñ° Feedback loop for model improvement
‚ñ° Documentation updated
‚ñ° Team trained on model maintenance
‚ñ° Incident response plan ready
"""

print(checklist)

print("\n‚ö†Ô∏è  Common Production Pitfalls:")
print("  1. Training-serving skew (different preprocessing)")
print("  2. Data drift (features change over time)")
print("  3. Missing values handled differently")
print("  4. Categorical encoding inconsistencies")
print("  5. Model staleness (not retrained regularly)")
print("  6. Insufficient monitoring and alerting")
print("  7. No rollback plan when model fails")
print("  8. Overfitting to cross-validation folds")

## 9. Best Practices Summary

In [None]:
print("Best Practices for Ensemble Methods")
print("=" * 80)

best_practices = """
1. START SIMPLE, INCREASE COMPLEXITY
   ‚Ä¢ Baseline: Single decision tree or logistic regression
   ‚Ä¢ Next: Random Forest (robust, good defaults)
   ‚Ä¢ Then: Gradient boosting if you need more performance
   ‚Ä¢ Finally: Stacking/voting for competitions

2. ALWAYS BASELINE FIRST
   ‚Ä¢ Establish simple baseline performance
   ‚Ä¢ Measure improvement from ensemble methods
   ‚Ä¢ Justify complexity with clear gains
   ‚Ä¢ Document baseline and ensemble comparison

3. CROSS-VALIDATE PROPERLY
   ‚Ä¢ Use stratified K-fold for classification
   ‚Ä¢ Use time-based splits for time series
   ‚Ä¢ Report mean and std of CV scores
   ‚Ä¢ Watch for train-val gap (overfitting indicator)

4. MONITOR FOR OVERFITTING
   ‚Ä¢ Track train vs validation performance
   ‚Ä¢ Use learning curves to diagnose
   ‚Ä¢ Apply early stopping for boosting methods
   ‚Ä¢ Regularize (max_depth, min_samples_split, etc.)

5. CONSIDER BUSINESS CONSTRAINTS
   ‚Ä¢ Interpretability requirements
   ‚Ä¢ Inference latency budget
   ‚Ä¢ Model size limitations
   ‚Ä¢ Training time acceptable
   ‚Ä¢ Cost of different error types

6. HYPERPARAMETER TUNING STRATEGY
   ‚Ä¢ Start with default parameters
   ‚Ä¢ Use random search or Bayesian optimization
   ‚Ä¢ Tune most important parameters first:
     - n_estimators (with early stopping)
     - max_depth or num_leaves
     - learning_rate (inversely related to n_estimators)
   ‚Ä¢ Validate on separate test set

7. FEATURE ENGINEERING MATTERS
   ‚Ä¢ Good features > complex models
   ‚Ä¢ Tree ensembles handle non-linearity well
   ‚Ä¢ But they can't create interactions automatically
   ‚Ä¢ Feature engineering often more impactful than model choice

8. PRODUCTION READINESS
   ‚Ä¢ Save preprocessing pipeline with model
   ‚Ä¢ Version everything (data, code, models)
   ‚Ä¢ Monitor drift and performance
   ‚Ä¢ Plan for retraining schedule
   ‚Ä¢ Have rollback strategy

9. INTERPRETABILITY
   ‚Ä¢ Use feature importance for global understanding
   ‚Ä¢ Use SHAP for instance-level explanations
   ‚Ä¢ Validate that important features make business sense
   ‚Ä¢ Consider simpler model if interpretation is critical

10. CONTINUOUS IMPROVEMENT
    ‚Ä¢ Collect feedback on model predictions
    ‚Ä¢ Retrain with new data regularly
    ‚Ä¢ A/B test model updates
    ‚Ä¢ Keep iterating on features and hyperparameters
"""

print(best_practices)

## Exercises

Apply what you've learned about comparing and selecting ensemble methods.

### Exercise 1: Custom Benchmark on New Dataset

Choose a dataset from scikit-learn (or load your own) and run a complete benchmark:
1. Load and split the data
2. Test at least 5 ensemble methods
3. Compare accuracy, speed, and memory
4. Create visualizations
5. Make a recommendation with justification

In [None]:
# Your code here
# Suggested datasets: load_digits(), fetch_covtype(), make_classification()
# Use the ModelBenchmark class defined earlier


### Exercise 2: Method Selection for Specific Scenarios

For each scenario below, choose the best ensemble method and explain why:

**Scenario A**: Medical diagnosis system
- 5,000 patient records
- High interpretability required (doctors need to understand)
- False negatives very costly

**Scenario B**: Ad click prediction
- 10 million examples
- Inference must be <10ms
- Many categorical features (user_id, location, etc.)

**Scenario C**: Kaggle competition
- Tabular data, 100k rows
- Any method allowed
- Only metric: AUC score

In [None]:
# Your analysis here
# For each scenario:
# 1. List key constraints
# 2. Choose method and explain
# 3. Suggest hyperparameters
# 4. Note potential issues


### Exercise 3: Production Optimization Challenge

You've deployed an XGBoost model that's too slow for production:
- Current: 500ms inference latency
- Target: <50ms
- Must maintain >95% of current accuracy

Explore optimization strategies:
1. Reduce number of estimators
2. Reduce max depth
3. Switch to LightGBM
4. Feature selection (fewer features)
5. Model compression

Implement and compare at least 3 strategies.

In [None]:
# Your code here
# 1. Create baseline slow model
# 2. Test optimization strategies
# 3. Track latency and accuracy trade-off
# 4. Visualize results


### Exercise 4: Hyperparameter Sensitivity Experiment

Design an experiment to test how sensitive different ensemble methods are to suboptimal hyperparameters:

1. Choose 3 ensemble methods
2. For each, deliberately use bad hyperparameters (too shallow, too few trees, etc.)
3. Measure performance degradation vs optimal
4. Determine which method is most "robust" to poor tuning

**Hypothesis**: Some methods (like Random Forest) are less sensitive to hyperparameters than others (like XGBoost).

In [None]:
# Your code here
# Create systematic hyperparameter sweep
# Test: {very bad, bad, default, good, optimal}
# Quantify sensitivity


## Summary

### Key Takeaways

1. **No Universal Best Method**: Choice depends on dataset, requirements, and constraints

2. **Modern Gradient Boosting Dominates Tabular Data**:
   - XGBoost: Most mature, best documentation
   - LightGBM: Fastest, most scalable
   - CatBoost: Best for categorical features

3. **Random Forest: Excellent Default Choice**:
   - Robust, less sensitive to hyperparameters
   - Good baseline before trying boosting
   - Works well on small to medium datasets

4. **Trade-offs Matter**:
   - Accuracy vs Speed
   - Performance vs Interpretability
   - Complexity vs Maintainability

5. **Production Considerations**:
   - Model size and inference latency
   - Serialization and versioning
   - Monitoring and maintenance
   - Business constraints

6. **Hyperparameter Sensitivity**:
   - Some methods more robust than others
   - n_estimators and learning_rate most impactful
   - Use automated tuning (Optuna, Hyperopt)

7. **Decision Framework**:
   - Start simple ‚Üí Random Forest ‚Üí Boosting ‚Üí Stacking
   - Match method to dataset characteristics
   - Consider business and technical constraints
   - Validate thoroughly before deployment

### What's Next?

**Module 11**: Final Kaggle Competition Project
- Apply everything learned in a complete workflow
- Titanic dataset (classic benchmark)
- Full pipeline: EDA ‚Üí Feature Engineering ‚Üí Model Selection ‚Üí Optimization
- Production-ready code and best practices

### Additional Resources

- **XGBoost Documentation**: https://xgboost.readthedocs.io/
- **LightGBM Documentation**: https://lightgbm.readthedocs.io/
- **CatBoost Documentation**: https://catboost.ai/docs/
- **SHAP Documentation**: https://shap.readthedocs.io/
- **Paper**: "Gradient Boosting Machines: A Tutorial" (Natekin & Knoll, 2013)
- **Kaggle**: Practice on real competitions to master ensemble methods