# Module 10: Model Comparison and Selection

**Difficulty**: ⭐⭐⭐
**Estimated Time**: 60 minutes
**Prerequisites**: 
- All previous modules (00-09)

## Learning Objectives
By the end of this notebook, you will be able to:
1. Compare all ensemble methods on multiple datasets and metrics
2. Understand speed vs accuracy trade-offs for each method
3. Choose the right ensemble method for different scenarios
4. Create a decision framework for model selection
5. Benchmark ensemble methods systematically
6. Understand when to use which ensemble technique

## 1. Setup and Imports

In [None]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn
from sklearn.datasets import (
    load_breast_cancer, load_wine, load_digits,
    make_classification, make_regression
)
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import (
    RandomForestClassifier, RandomForestRegressor,
    GradientBoostingClassifier, GradientBoostingRegressor,
    AdaBoostClassifier, BaggingClassifier,
    VotingClassifier, StackingClassifier
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, mean_squared_error, r2_score

# Boosting libraries
from xgboost import XGBClassifier, XGBRegressor
from lightgbm import LGBMClassifier, LGBMRegressor
try:
    from catboost import CatBoostClassifier, CatBoostRegressor
    CATBOOST_AVAILABLE = True
except ImportError:
    CATBOOST_AVAILABLE = False

# Configuration
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
np.random.seed(42)

## 2. Comprehensive Benchmark Framework

In [None]:
def benchmark_models(models_dict, X_train, X_test, y_train, y_test, task='classification'):
    """
    Benchmark multiple models and return performance metrics.
    
    Parameters:
    -----------
    models_dict : dict
        Dictionary of {model_name: model_instance}
    X_train, X_test, y_train, y_test : arrays
        Training and test data
    task : str
        'classification' or 'regression'
    
    Returns:
    --------
    DataFrame with benchmark results
    """
    results = []
    
    for name, model in models_dict.items():
        print(f"Training {name}...", end=' ')
        
        # Training time
        start = time()
        model.fit(X_train, y_train)
        train_time = time() - start
        
        # Prediction time
        start = time()
        y_pred = model.predict(X_test)
        pred_time = time() - start
        
        # Performance metrics
        if task == 'classification':
            train_score = accuracy_score(y_train, model.predict(X_train))
            test_score = accuracy_score(y_test, y_pred)
            metric_name = 'Accuracy'
        else:
            train_score = r2_score(y_train, model.predict(X_train))
            test_score = r2_score(y_test, y_pred)
            metric_name = 'R²'
        
        results.append({
            'Model': name,
            f'Train {metric_name}': train_score,
            f'Test {metric_name}': test_score,
            'Overfit Gap': train_score - test_score,
            'Train Time (s)': train_time,
            'Pred Time (s)': pred_time
        })
        
        print(f"Done ({train_time:.2f}s)")
    
    return pd.DataFrame(results)

## 3. Benchmark 1: Small Dataset (Breast Cancer)

In [None]:
# Load small dataset
cancer = load_breast_cancer()
X, y = cancer.data, cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Small Dataset: {X.shape}")
print(f"Train: {X_train.shape}, Test: {X_test.shape}\n")

In [None]:
# Define models for comparison
models_small = {
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Bagging': BaggingClassifier(n_estimators=100, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'AdaBoost': AdaBoostClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss'),
    'LightGBM': LGBMClassifier(n_estimators=100, random_state=42, verbose=-1),
}

if CATBOOST_AVAILABLE:
    models_small['CatBoost'] = CatBoostClassifier(iterations=100, random_state=42, verbose=False)

# Run benchmark
results_small = benchmark_models(models_small, X_train, X_test, y_train, y_test)
results_small = results_small.sort_values('Test Accuracy', ascending=False)

print("\n" + "="*80)
print("BENCHMARK RESULTS - SMALL DATASET")
print("="*80)
print(results_small.to_string(index=False))

In [None]:
# Visualize results
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Accuracy comparison
ax1 = axes[0]
results_plot = results_small.sort_values('Test Accuracy')
y_pos = np.arange(len(results_plot))
ax1.barh(y_pos, results_plot['Test Accuracy'], alpha=0.7)
ax1.set_yticks(y_pos)
ax1.set_yticklabels(results_plot['Model'])
ax1.set_xlabel('Test Accuracy')
ax1.set_title('Model Accuracy Comparison')
ax1.grid(axis='x', alpha=0.3)

# Training time
ax2 = axes[1]
results_plot = results_small.sort_values('Train Time (s)')
y_pos = np.arange(len(results_plot))
ax2.barh(y_pos, results_plot['Train Time (s)'], alpha=0.7, color='orange')
ax2.set_yticks(y_pos)
ax2.set_yticklabels(results_plot['Model'])
ax2.set_xlabel('Training Time (seconds)')
ax2.set_title('Training Speed Comparison')
ax2.grid(axis='x', alpha=0.3)

# Overfitting gap
ax3 = axes[2]
results_plot = results_small.sort_values('Overfit Gap')
y_pos = np.arange(len(results_plot))
colors = ['green' if x < 0.05 else 'orange' if x < 0.1 else 'red' 
          for x in results_plot['Overfit Gap']]
ax3.barh(y_pos, results_plot['Overfit Gap'], alpha=0.7, color=colors)
ax3.set_yticks(y_pos)
ax3.set_yticklabels(results_plot['Model'])
ax3.set_xlabel('Train-Test Gap')
ax3.set_title('Overfitting Analysis')
ax3.axvline(x=0.05, color='green', linestyle='--', alpha=0.5, label='Good (<0.05)')
ax3.axvline(x=0.1, color='orange', linestyle='--', alpha=0.5, label='Moderate (<0.1)')
ax3.legend()
ax3.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

## 4. Benchmark 2: Large Dataset

In [None]:
# Create large synthetic dataset
X_large, y_large = make_classification(
    n_samples=50000,
    n_features=50,
    n_informative=30,
    n_redundant=10,
    random_state=42
)

X_train_large, X_test_large, y_train_large, y_test_large = train_test_split(
    X_large, y_large, test_size=0.2, random_state=42
)

print(f"Large Dataset: {X_large.shape}")
print(f"Train: {X_train_large.shape}, Test: {X_test_large.shape}\n")

In [None]:
# Models optimized for large datasets
models_large = {
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGBoost': XGBClassifier(n_estimators=100, random_state=42, eval_metric='logloss', n_jobs=-1),
    'LightGBM': LGBMClassifier(n_estimators=100, random_state=42, verbose=-1, n_jobs=-1),
}

if CATBOOST_AVAILABLE:
    models_large['CatBoost'] = CatBoostClassifier(iterations=100, random_state=42, verbose=False)

# Run benchmark
results_large = benchmark_models(models_large, X_train_large, X_test_large, 
                                 y_train_large, y_test_large)
results_large = results_large.sort_values('Test Accuracy', ascending=False)

print("\n" + "="*80)
print("BENCHMARK RESULTS - LARGE DATASET")
print("="*80)
print(results_large.to_string(index=False))

In [None]:
# Speed vs Accuracy scatter plot
fig, ax = plt.subplots(figsize=(10, 6))

ax.scatter(results_large['Train Time (s)'], results_large['Test Accuracy'], 
          s=200, alpha=0.6, c=range(len(results_large)), cmap='viridis')

for idx, row in results_large.iterrows():
    ax.annotate(row['Model'], 
               (row['Train Time (s)'], row['Test Accuracy']),
               xytext=(5, 5), textcoords='offset points', fontsize=10)

ax.set_xlabel('Training Time (seconds)', fontsize=12)
ax.set_ylabel('Test Accuracy', fontsize=12)
ax.set_title('Speed vs Accuracy Trade-off (Large Dataset)', fontsize=14, fontweight='bold')
ax.grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey Insights:")
print("- Upper-left corner: Fast AND accurate (ideal)")
print("- Lower-right corner: Slow AND less accurate (avoid)")
print("- Trade-off: Choose based on priorities (speed vs accuracy)")

## 5. Ensemble Method Selection Guide

In [None]:
# Create decision matrix
decision_matrix = pd.DataFrame({
    'Method': ['Random Forest', 'Gradient Boosting', 'XGBoost', 'LightGBM', 'CatBoost', 
               'AdaBoost', 'Bagging', 'Stacking', 'Voting'],
    'Accuracy': ['★★★★', '★★★★★', '★★★★★', '★★★★★', '★★★★★', 
                '★★★', '★★★', '★★★★★', '★★★★'],
    'Speed': ['★★★★', '★★★', '★★★★', '★★★★★', '★★★', 
             '★★★', '★★★★★', '★★', '★★★'],
    'Small Data': ['★★★★', '★★★★', '★★★★', '★★★', '★★★★★', 
                  '★★★', '★★★★', '★★★★', '★★★★'],
    'Large Data': ['★★★★', '★★★', '★★★★', '★★★★★', '★★★★', 
                  '★★', '★★★', '★★★', '★★★'],
    'Categorical Features': ['★★', '★★', '★★', '★★★★', '★★★★★', 
                            '★★', '★★', '★★★', '★★'],
    'Interpretability': ['★★★', '★★', '★★', '★★', '★★', 
                        '★★★', '★★★', '★', '★★'],
    'Overfitting Resistance': ['★★★★', '★★★', '★★★★', '★★★', '★★★★★', 
                              '★★', '★★★★', '★★★', '★★★★']
})

print("\n" + "="*100)
print("ENSEMBLE METHOD SELECTION GUIDE")
print("="*100)
print(decision_matrix.to_string(index=False))
print("\n★ = Poor, ★★★ = Good, ★★★★★ = Excellent")

## 6. Decision Framework

In [None]:
def recommend_ensemble(dataset_size, has_categorical, accuracy_priority, 
                      interpretability_needed, training_time_limit):
    """
    Recommend ensemble method based on requirements.
    
    Parameters:
    -----------
    dataset_size : str
        'small' (<10K), 'medium' (10K-1M), 'large' (>1M)
    has_categorical : bool
        Whether dataset has many categorical features
    accuracy_priority : str
        'highest', 'high', 'moderate'
    interpretability_needed : bool
        Whether model interpretability is important
    training_time_limit : str
        'strict', 'moderate', 'flexible'
    
    Returns:
    --------
    List of recommended methods with rationale
    """
    recommendations = []
    
    # Categorical features dominate
    if has_categorical:
        recommendations.append({
            'method': 'CatBoost',
            'reason': 'Best categorical feature handling',
            'priority': 1
        })
        recommendations.append({
            'method': 'LightGBM',
            'reason': 'Good categorical support + fast',
            'priority': 2
        })
    
    # Large dataset
    if dataset_size == 'large':
        recommendations.append({
            'method': 'LightGBM',
            'reason': 'Fastest on large data',
            'priority': 1
        })
        recommendations.append({
            'method': 'XGBoost',
            'reason': 'Good speed/accuracy balance',
            'priority': 2
        })
    
    # Small dataset
    elif dataset_size == 'small':
        recommendations.append({
            'method': 'CatBoost',
            'reason': 'Best overfitting resistance',
            'priority': 1
        })
        recommendations.append({
            'method': 'Random Forest',
            'reason': 'Robust and easy to tune',
            'priority': 2
        })
    
    # Highest accuracy needed
    if accuracy_priority == 'highest':
        recommendations.append({
            'method': 'Stacking',
            'reason': 'Combines strengths of multiple models',
            'priority': 1
        })
    
    # Interpretability needed
    if interpretability_needed:
        recommendations.append({
            'method': 'Random Forest',
            'reason': 'Feature importance + relatively interpretable',
            'priority': 1
        })
        recommendations.append({
            'method': 'Voting Ensemble',
            'reason': 'Simple averaging, easy to explain',
            'priority': 2
        })
    
    # Strict time limit
    if training_time_limit == 'strict':
        recommendations.append({
            'method': 'Random Forest',
            'reason': 'Parallelizable, fast training',
            'priority': 1
        })
        recommendations.append({
            'method': 'LightGBM',
            'reason': 'Fastest boosting algorithm',
            'priority': 2
        })
    
    # Remove duplicates and sort by priority
    unique_recs = {}
    for rec in recommendations:
        method = rec['method']
        if method not in unique_recs or rec['priority'] < unique_recs[method]['priority']:
            unique_recs[method] = rec
    
    final_recs = sorted(unique_recs.values(), key=lambda x: x['priority'])
    
    return final_recs[:3]  # Return top 3

# Example usage
print("\nExample Recommendation Scenarios:\n")
print("="*80)

# Scenario 1
print("\nScenario 1: Large dataset, many categorical features, need high accuracy")
recs = recommend_ensemble('large', True, 'high', False, 'moderate')
for i, rec in enumerate(recs, 1):
    print(f"{i}. {rec['method']:15s} - {rec['reason']}")

# Scenario 2
print("\nScenario 2: Small dataset, need interpretability, strict time limit")
recs = recommend_ensemble('small', False, 'moderate', True, 'strict')
for i, rec in enumerate(recs, 1):
    print(f"{i}. {rec['method']:15s} - {rec['reason']}")

# Scenario 3
print("\nScenario 3: Medium dataset, need highest accuracy, flexible time")
recs = recommend_ensemble('medium', False, 'highest', False, 'flexible')
for i, rec in enumerate(recs, 1):
    print(f"{i}. {rec['method']:15s} - {rec['reason']}")

## 7. Summary: Complete Decision Guide

### Quick Selection Guide:

#### By Dataset Size:

**Small Data (<10K rows)**:
1. **CatBoost** - Best overfitting resistance
2. **Random Forest** - Robust, easy to tune
3. **XGBoost** - Regularization helps prevent overfitting

**Medium Data (10K-1M rows)**:
1. **XGBoost** - Best all-around performance
2. **LightGBM** - Fast with good accuracy
3. **CatBoost** - Excellent defaults

**Large Data (>1M rows)**:
1. **LightGBM** - Fastest training
2. **XGBoost** - Good speed/accuracy balance
3. **Random Forest** - Parallelizable

#### By Feature Type:

**Many Categorical Features**:
1. **CatBoost** - Native categorical handling
2. **LightGBM** - Good categorical support
3. **Target Encoding + XGBoost** - Manual encoding needed

**Numerical Features Only**:
1. **XGBoost** - State-of-the-art
2. **LightGBM** - Faster alternative
3. **Gradient Boosting** - scikit-learn baseline

#### By Objective:

**Maximum Accuracy (competitions)**:
1. **Stacking** (XGBoost + LightGBM + CatBoost)
2. **Voting Ensemble** - Simpler alternative
3. **Multi-level Stacking** - Ultimate performance

**Speed Critical**:
1. **LightGBM** - Fastest boosting
2. **Random Forest** - Parallelizable
3. **Voting** - Simple averaging

**Interpretability Needed**:
1. **Random Forest** - Feature importance
2. **Gradient Boosting** - SHAP values
3. **Single Decision Tree** - Most interpretable

**Production Deployment**:
1. **LightGBM** - Fast prediction
2. **XGBoost** - Stable, well-tested
3. **Random Forest** - Robust, simple

### Method Comparison Summary:

| Method | When to Use | When to Avoid |
|--------|------------|---------------|
| **Random Forest** | Small-medium data, baseline, interpretability | Large data, highest accuracy needed |
| **Gradient Boosting** | Scikit-learn ecosystem, baseline | Speed critical, large data |
| **XGBoost** | General purpose, competitions | Extreme speed needed |
| **LightGBM** | Large data, speed critical | Small data (<10K rows) |
| **CatBoost** | Categorical features, small data | Extreme speed needed |
| **AdaBoost** | Simple boosting, teaching | Production use (outdated) |
| **Bagging** | Reduce variance, simple | Need maximum accuracy |
| **Stacking** | Maximum accuracy, competitions | Speed needed, interpretability |
| **Voting** | Quick ensemble, robustness | Complex patterns, extreme accuracy |

### Production Considerations:

1. **Prediction Latency**:
   - Fastest: LightGBM, Random Forest
   - Moderate: XGBoost, CatBoost
   - Slower: Stacking (multiple models)

2. **Model Size**:
   - Smallest: Boosting models (compressed)
   - Larger: Random Forest (stores all trees)
   - Largest: Stacking (multiple models)

3. **Maintenance**:
   - Easiest: Random Forest, Voting
   - Moderate: XGBoost, LightGBM, CatBoost
   - Complex: Stacking (multiple models to update)

### Final Recommendations:

**Default Choice**: Start with **XGBoost**
- Best all-around performance
- Well-documented
- Works for most problems

**Speed Critical**: Use **LightGBM**
- Fastest training and prediction
- Good accuracy
- Excellent for large data

**Categorical Heavy**: Use **CatBoost**
- Best categorical handling
- Great defaults
- Resistant to overfitting

**Maximum Accuracy**: Use **Stacking**
- Combine XGBoost + LightGBM + CatBoost
- Worth the complexity for critical applications
- Test if improvement justifies cost

**Simplicity/Interpretability**: Use **Random Forest**
- Easy to understand
- Good feature importance
- Robust defaults

### Remember:

1. **Always benchmark** on your specific problem
2. **Cross-validate** for reliable estimates
3. **Check if ensemble helps** vs simple models
4. **Consider deployment constraints** early
5. **Monitor performance** in production
6. **Update models** as data distribution changes

## 8. Exercises

### Exercise 1: Custom Benchmark

Create your own benchmark comparing at least 5 ensemble methods on a dataset of your choice. Include:
- Accuracy metrics
- Training and prediction time
- Overfitting analysis
- Visualizations

In [None]:
# Your code here


### Exercise 2: Hyperparameter Impact

For XGBoost, LightGBM, and CatBoost, compare:
1. Default parameters
2. Conservative parameters (prevent overfitting)
3. Aggressive parameters (maximize accuracy)

Which tuning strategy works best for different dataset sizes?

In [None]:
# Your code here


### Exercise 3: Ensemble Combination

Create a comprehensive ensemble that combines:
- XGBoost, LightGBM, CatBoost as base models
- Stacking OR voting (choose based on data size)
- Compare with best individual model

Is the ensemble worth the added complexity?

In [None]:
# Your code here


## Additional Resources

- [Scikit-learn Ensemble Guide](https://scikit-learn.org/stable/modules/ensemble.html)
- [XGBoost Documentation](https://xgboost.readthedocs.io/)
- [LightGBM Documentation](https://lightgbm.readthedocs.io/)
- [CatBoost Documentation](https://catboost.ai/docs/)
- [Kaggle Ensemble Guide](https://mlwave.com/kaggle-ensembling-guide/)
- [Model Selection Best Practices](https://machinelearningmastery.com/a-gentle-introduction-to-model-selection-for-machine-learning/)