# Module 07: CatBoost

**Difficulty**: ‚≠ê‚≠ê‚≠ê Advanced
**Estimated Time**: 90 minutes
**Prerequisites**: 
- Module 05: XGBoost
- Module 06: LightGBM

## Learning Objectives

By the end of this notebook, you will be able to:
1. Understand CatBoost's unique algorithms: ordered boosting and symmetric trees
2. Leverage CatBoost's superior categorical encoding without preprocessing
3. Explain how ordered boosting prevents target leakage
4. Optimize CatBoost hyperparameters for best performance
5. Use built-in overfitting detection and visualization tools
6. Compare XGBoost, LightGBM, and CatBoost systematically
7. Apply CatBoost's model analysis tools for interpretation
8. Make informed decisions about which gradient boosting library to use

## Setup and Configuration

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import time
import warnings
warnings.filterwarnings('ignore')

# Machine learning
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    roc_auc_score, roc_curve, log_loss
)

# Gradient boosting libraries for comparison
try:
    import xgboost as xgb
    XGB_AVAILABLE = True
except ImportError:
    XGB_AVAILABLE = False

try:
    import lightgbm as lgb
    LGB_AVAILABLE = True
except ImportError:
    LGB_AVAILABLE = False

# CatBoost
try:
    import catboost as cb
    from catboost import CatBoostClassifier, CatBoostRegressor, Pool
    print(f"CatBoost version: {cb.__version__}")
except ImportError:
    print("Installing CatBoost...")
    !pip install catboost -q
    import catboost as cb
    from catboost import CatBoostClassifier, CatBoostRegressor, Pool
    print(f"CatBoost version: {cb.__version__}")

# SHAP for interpretability
try:
    import shap
    SHAP_AVAILABLE = True
    print(f"SHAP version: {shap.__version__}")
except ImportError:
    print("SHAP not available. Install with: pip install shap")
    SHAP_AVAILABLE = False

# Configuration
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

# Reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("\nSetup complete! All libraries imported successfully.")

## 1. What Makes CatBoost Special?

### CatBoost = Categorical Boosting

Developed by Yandex (2017), CatBoost is the newest of the three major gradient boosting libraries.

### Key Innovations

#### 1.1 Ordered Boosting

**Problem with standard gradient boosting**:
- Predictions on training samples used to compute gradients
- Same samples used to build trees
- **Target leakage**: Model sees its own predictions during training
- Can cause overfitting

**CatBoost's solution: Ordered Boosting**:
```
Standard boosting:
1. Predict on all samples
2. Compute gradients
3. Build tree using same samples  ‚Üê Leakage!

Ordered boosting:
1. For each sample i:
   - Use only samples before i (j < i) to compute prediction
   - No target leakage!
2. Compute gradients
3. Build tree
```

**Benefits**:
- Prevents target leakage
- Better generalization
- More robust predictions

#### 1.2 Symmetric Trees (Oblivious Trees)

**Traditional trees**:
- Each node can split on different features
- Asymmetric structure

**CatBoost symmetric trees**:
- Same splitting criterion at each level
- All nodes at same level split on same feature and threshold
- Balanced binary tree structure

```
Symmetric tree (depth=3):
              [Feature A > 5]
             /              \
      [Feature B > 3]   [Feature B > 3]
       /        \         /        \
   [C>1]     [C>1]    [C>1]     [C>1]
   /  \      /  \     /  \      /  \
  L0  L1    L2  L3   L4  L5    L6  L7
```

**Benefits**:
- **Fast prediction**: O(depth) instead of O(depth √ó features)
- **Less overfitting**: Simpler structure
- **Better regularization**: Constrained tree structure
- **CPU cache friendly**: Sequential memory access

#### 1.3 Superior Categorical Encoding

Unlike one-hot encoding or label encoding, CatBoost uses:

**Ordered Target Statistics**:
- For each category, compute average target value
- Use **ordered** scheme to prevent target leakage
- For sample i, use only samples j < i with same category
- Add prior to prevent overfitting on rare categories

**Formula**:
$$\text{TargetStat}(cat) = \frac{\sum_{j<i, cat_j=cat} y_j + \alpha \cdot P}{\sum_{j<i, cat_j=cat} 1 + \alpha}$$

Where:
- $P$ = prior (overall average)
- $\alpha$ = prior weight (smoothing parameter)

**Benefits**:
- No preprocessing needed
- No one-hot encoding explosion
- Handles high-cardinality categories
- No target leakage (ordered scheme)
- Often better accuracy than manual encoding

#### 1.4 Built-in Overfitting Detection

CatBoost automatically:
- Monitors train vs validation metrics
- Detects when overfitting starts
- Can visualize overfitting detector
- Stops early if overfitting detected

### CatBoost vs XGBoost vs LightGBM

| Feature | XGBoost | LightGBM | CatBoost |
|---------|---------|----------|----------|
| Tree growth | Level-wise | Leaf-wise | Level-wise (symmetric) |
| Split finding | Exact/Approx | Histogram | Symmetric splits |
| Categorical | Manual encoding | Native (basic) | Native (advanced) |
| Target leakage | Possible | Possible | Prevented (ordered) |
| Speed | Fast | Fastest | Moderate |
| Overfitting | Moderate risk | Higher risk | Lower risk |
| Default params | Good | Need tuning | Excellent |
| Prediction speed | Fast | Fast | Fastest |
| Training speed | Fast | Fastest | Slower |
| Interpretability | Good | Good | Excellent |
| GPU support | Yes | Yes | Yes |

In [None]:
# Create dataset with categorical features
print("Creating dataset with categorical features...")

n_samples = 10000

# Numeric features
X_numeric = np.random.randn(n_samples, 10)

# Categorical features with varying cardinality
cat_low = np.random.choice(['A', 'B', 'C'], size=n_samples)
cat_medium = np.random.choice([f'Cat{i}' for i in range(10)], size=n_samples)
cat_high = np.random.choice([f'ID{i}' for i in range(100)], size=n_samples)

# Create DataFrame
df = pd.DataFrame(X_numeric, columns=[f'num_{i}' for i in range(10)])
df['cat_low_card'] = cat_low
df['cat_med_card'] = cat_medium
df['cat_high_card'] = cat_high

# Create target influenced by both numeric and categorical features
target_numeric = X_numeric[:, :3].sum(axis=1)
target_cat_low = (cat_low == 'A').astype(float) * 3
target_cat_med = np.array([hash(c) % 5 for c in cat_medium]) / 5
y_continuous = target_numeric + target_cat_low + target_cat_med + np.random.randn(n_samples)
y = (y_continuous > y_continuous.median()).astype(int)

print(f"\nDataset created:")
print(f"Samples: {len(df):,}")
print(f"Numeric features: 10")
print(f"Categorical features: 3")
print(f"  - Low cardinality: {df['cat_low_card'].nunique()} categories")
print(f"  - Medium cardinality: {df['cat_med_card'].nunique()} categories")
print(f"  - High cardinality: {df['cat_high_card'].nunique()} categories")
print(f"\nTarget distribution: {np.bincount(y)}")
print(f"\nSample data:")
print(df.head())

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=0.2, random_state=RANDOM_STATE
)

# Identify categorical columns
cat_features = ['cat_low_card', 'cat_med_card', 'cat_high_card']
cat_feature_indices = [df.columns.get_loc(col) for col in cat_features]

print(f"Training samples: {len(X_train):,}")
print(f"Test samples: {len(X_test):,}")
print(f"Categorical feature indices: {cat_feature_indices}")

## 2. CatBoost Basics

### Using CatBoost with Categorical Features

CatBoost makes it incredibly easy - just specify which columns are categorical!

In [None]:
# Train CatBoost with categorical features
print("Training CatBoost with native categorical support...\n")

cat_model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    random_state=RANDOM_STATE,
    verbose=20  # Print every 20 iterations
)

# Fit - just specify cat_features!
cat_model.fit(
    X_train, y_train,
    cat_features=cat_features,
    eval_set=(X_test, y_test),
    plot=False
)

# Evaluate
train_acc = cat_model.score(X_train, y_train)
test_acc = cat_model.score(X_test, y_test)

print(f"\nResults:")
print(f"Train accuracy: {train_acc:.4f}")
print(f"Test accuracy: {test_acc:.4f}")
print(f"Overfitting: {train_acc - test_acc:.4f}")

## 3. Comparison with Target Encoding

Let's compare CatBoost's ordered target statistics with manual target encoding.

In [None]:
# Manual target encoding (naive - causes leakage!)
def target_encode(X_train, y_train, X_test, cat_cols):
    """Simple target encoding (has leakage - for comparison only)"""
    X_train_encoded = X_train.copy()
    X_test_encoded = X_test.copy()
    
    for col in cat_cols:
        # Compute mean target for each category
        target_means = X_train.groupby(col)[y_train.name].mean()
        
        # Replace categories with means
        X_train_encoded[col] = X_train[col].map(target_means).fillna(y_train.mean())
        X_test_encoded[col] = X_test[col].map(target_means).fillna(y_train.mean())
    
    return X_train_encoded, X_test_encoded

# Apply target encoding
X_train_copy = X_train.copy()
X_test_copy = X_test.copy()
y_train_series = pd.Series(y_train, index=X_train.index, name='target')

X_train_te, X_test_te = target_encode(
    X_train_copy, y_train_series, X_test_copy, cat_features
)

print("Target encoding applied.")
print(f"\nExample - cat_low_card mapping:")
print(X_train.groupby('cat_low_card')[y_train_series.name].mean())

In [None]:
# Train CatBoost on target-encoded data (no cat_features specified)
print("Training CatBoost on target-encoded features...")

cat_model_te = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    random_state=RANDOM_STATE,
    verbose=False
)

cat_model_te.fit(
    X_train_te, y_train,
    eval_set=(X_test_te, y_test)
)

train_acc_te = cat_model_te.score(X_train_te, y_train)
test_acc_te = cat_model_te.score(X_test_te, y_test)

# Compare
comparison = pd.DataFrame([
    {
        'Method': 'CatBoost Native',
        'Train Acc': train_acc,
        'Test Acc': test_acc,
        'Overfitting': train_acc - test_acc
    },
    {
        'Method': 'Target Encoding',
        'Train Acc': train_acc_te,
        'Test Acc': test_acc_te,
        'Overfitting': train_acc_te - test_acc_te
    }
])

print("\n" + "=" * 70)
print("CatBoost Native vs Manual Target Encoding")
print("=" * 70)
print(comparison.to_string(index=False))

print("\n‚úÖ CatBoost's ordered target statistics:")
print("   - Prevents target leakage")
print("   - Better generalization")
print("   - No manual preprocessing needed")

## 4. CatBoost Hyperparameters

### Core Parameters

**`iterations`**: Number of trees (like n_estimators)
- Default: 1000
- Use early stopping

**`learning_rate`**: Shrinkage rate
- Default: auto (usually ~0.03)
- Lower = need more iterations
- Typical: 0.01-0.3

**`depth`**: Tree depth
- Default: 6
- CatBoost trees are symmetric, so depth is very important
- Each increase doubles number of leaves: 2^depth leaves
- Typical: 4-10

**`l2_leaf_reg`**: L2 regularization
- Default: 3.0
- Higher = more regularization
- Typical: 1-10

### Advanced Parameters

**`border_count`**: Number of splits for numerical features
- Default: 254 (like LightGBM's max_bin)
- Lower = faster, less accurate
- Typical: 32, 64, 128, 254

**`bagging_temperature`**: Bayesian bootstrap parameter
- Default: 1.0
- 0 = no randomness, > 0 = more randomness
- Alternative to subsample

**`random_strength`**: Amount of randomness for splits
- Default: 1.0
- Higher = more randomness = less overfitting

**`rsm`** (random subspace method): Feature fraction
- Default: 1.0
- Fraction of features to use per tree
- Like colsample_bytree in XGBoost

In [None]:
# Effect of depth parameter
depths = [3, 4, 5, 6, 7, 8, 10]
train_accs = []
test_accs = []
train_times = []

for depth in depths:
    model = CatBoostClassifier(
        iterations=100,
        depth=depth,
        learning_rate=0.1,
        random_state=RANDOM_STATE,
        verbose=False
    )
    
    start = time.time()
    model.fit(X_train, y_train, cat_features=cat_features)
    train_times.append(time.time() - start)
    
    train_accs.append(model.score(X_train, y_train))
    test_accs.append(model.score(X_test, y_test))

# Plot results
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy
axes[0].plot(depths, train_accs, marker='o', linewidth=2, markersize=8, label='Train')
axes[0].plot(depths, test_accs, marker='s', linewidth=2, markersize=8, label='Test')
axes[0].axvline(6, color='red', linestyle='--', alpha=0.5, label='Default (6)')
axes[0].set_xlabel('Tree Depth', fontsize=12)
axes[0].set_ylabel('Accuracy', fontsize=12)
axes[0].set_title('Effect of Tree Depth', fontsize=13, fontweight='bold')
axes[0].legend(fontsize=10)
axes[0].grid(True, alpha=0.3)

# Training time
axes[1].bar(range(len(depths)), train_times, color='steelblue', edgecolor='black')
axes[1].set_xticks(range(len(depths)))
axes[1].set_xticklabels(depths)
axes[1].set_xlabel('Tree Depth', fontsize=12)
axes[1].set_ylabel('Training Time (seconds)', fontsize=12)
axes[1].set_title('Training Time vs Depth', fontsize=13, fontweight='bold')
axes[1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

best_depth = depths[np.argmax(test_accs)]
print(f"\nOptimal depth: {best_depth}")
print(f"Test accuracy: {max(test_accs):.4f}")
print(f"\nNote: Depth {10} has 2^{10} = {2**10} leaves!")
print("Higher depth = more complex model, longer training")

In [None]:
# Effect of L2 regularization
l2_values = [0.1, 1, 3, 5, 10, 20]
train_accs = []
test_accs = []

for l2 in l2_values:
    model = CatBoostClassifier(
        iterations=100,
        depth=6,
        l2_leaf_reg=l2,
        learning_rate=0.1,
        random_state=RANDOM_STATE,
        verbose=False
    )
    
    model.fit(X_train, y_train, cat_features=cat_features)
    train_accs.append(model.score(X_train, y_train))
    test_accs.append(model.score(X_test, y_test))

# Plot
plt.figure(figsize=(10, 6))
plt.plot(l2_values, train_accs, marker='o', linewidth=2, markersize=8, label='Train')
plt.plot(l2_values, test_accs, marker='s', linewidth=2, markersize=8, label='Test')
plt.axvline(3, color='red', linestyle='--', alpha=0.5, label='Default (3)')
plt.xlabel('L2 Regularization (l2_leaf_reg)', fontsize=12)
plt.ylabel('Accuracy', fontsize=12)
plt.title('Effect of L2 Regularization', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nObservation:")
print("Higher L2 regularization reduces overfitting but may underfit.")
print("Default value of 3 is often a good starting point.")

## 5. Built-in Cross-Validation

CatBoost has excellent built-in cross-validation with the `cv` method.

In [None]:
# Prepare data as Pool object
train_pool = Pool(
    data=X_train,
    label=y_train,
    cat_features=cat_features
)

# Set parameters
params = {
    'iterations': 200,
    'depth': 6,
    'learning_rate': 0.05,
    'loss_function': 'Logloss',
    'random_state': RANDOM_STATE,
    'verbose': False
}

# Run cross-validation
print("Running 5-fold cross-validation...\n")
cv_results = cb.cv(
    pool=train_pool,
    params=params,
    fold_count=5,
    stratified=True,
    partition_random_seed=RANDOM_STATE,
    plot=False,
    verbose=False
)

print("Cross-Validation Results:")
print("=" * 70)
print(f"Best iteration: {cv_results['test-Logloss-mean'].idxmin()}")
print(f"Best CV Logloss: {cv_results['test-Logloss-mean'].min():.4f}")
print(f"Std deviation: {cv_results.loc[cv_results['test-Logloss-mean'].idxmin(), 'test-Logloss-std']:.4f}")

In [None]:
# Visualize CV results
plt.figure(figsize=(12, 6))
plt.plot(cv_results['test-Logloss-mean'], label='CV Mean', linewidth=2)
plt.fill_between(
    range(len(cv_results)),
    cv_results['test-Logloss-mean'] - cv_results['test-Logloss-std'],
    cv_results['test-Logloss-mean'] + cv_results['test-Logloss-std'],
    alpha=0.2,
    label='CV Std'
)
best_iter = cv_results['test-Logloss-mean'].idxmin()
plt.axvline(best_iter, color='red', linestyle='--', linewidth=2,
            label=f'Best iteration ({best_iter})')
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Logloss', fontsize=12)
plt.title('CatBoost Cross-Validation', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nBuilt-in CV is fast and easy!")
print("Use to find optimal number of iterations.")

## 6. Overfitting Detection

CatBoost has built-in overfitting detection that monitors the difference between train and validation metrics.

In [None]:
# Split data for validation
X_train_split, X_val, y_train_split, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=RANDOM_STATE
)

# Train with overfitting detector
model_overfit = CatBoostClassifier(
    iterations=500,
    depth=8,  # Deep trees to demonstrate overfitting
    learning_rate=0.1,
    random_state=RANDOM_STATE,
    od_type='Iter',  # Overfitting detection type
    od_wait=20,      # Wait 20 iterations before stopping
    verbose=50
)

print("Training with overfitting detection...\n")
model_overfit.fit(
    X_train_split, y_train_split,
    cat_features=cat_features,
    eval_set=(X_val, y_val),
    plot=False
)

print(f"\nBest iteration: {model_overfit.get_best_iteration()}")
print(f"Stopped at iteration: {model_overfit.tree_count_}")

In [None]:
# Visualize training progress
train_logloss = model_overfit.evals_result_['learn']['Logloss']
val_logloss = model_overfit.evals_result_['validation']['Logloss']

plt.figure(figsize=(12, 6))
plt.plot(train_logloss, label='Train', linewidth=2)
plt.plot(val_logloss, label='Validation', linewidth=2)
best_iter = model_overfit.get_best_iteration()
plt.axvline(best_iter, color='red', linestyle='--', linewidth=2,
            label=f'Best iteration ({best_iter})')
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Logloss', fontsize=12)
plt.title('Training with Overfitting Detection', fontsize=14, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n‚úÖ Overfitting detector automatically found optimal stopping point!")
print("No need to manually monitor validation metrics.")

## 7. Three-Way Comparison: XGBoost vs LightGBM vs CatBoost

Let's systematically compare all three libraries on the same dataset.

In [None]:
# Load breast cancer dataset for fair comparison
cancer_data = load_breast_cancer()
X_cancer, y_cancer = cancer_data.data, cancer_data.target

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=RANDOM_STATE
)

print(f"Dataset: {len(X_train_c)} train, {len(X_test_c)} test, {X_cancer.shape[1]} features")

# Common parameters (as similar as possible)
n_trees = 100
max_depth = 6
lr = 0.1

results = []

In [None]:
# 1. CatBoost
print("Training CatBoost...")
start = time.time()
cat_comp = CatBoostClassifier(
    iterations=n_trees,
    depth=max_depth,
    learning_rate=lr,
    random_state=RANDOM_STATE,
    verbose=False
)
cat_comp.fit(X_train_c, y_train_c)
cat_time = time.time() - start

start_pred = time.time()
cat_pred = cat_comp.predict(X_test_c)
cat_pred_time = time.time() - start_pred

cat_acc = accuracy_score(y_test_c, cat_pred)
cat_auc = roc_auc_score(y_test_c, cat_comp.predict_proba(X_test_c)[:, 1])

results.append({
    'Model': 'CatBoost',
    'Train Time': cat_time,
    'Pred Time': cat_pred_time,
    'Accuracy': cat_acc,
    'AUC': cat_auc
})

print(f"  Time: {cat_time:.3f}s, Accuracy: {cat_acc:.4f}")

In [None]:
# 2. XGBoost
if XGB_AVAILABLE:
    print("Training XGBoost...")
    start = time.time()
    xgb_comp = xgb.XGBClassifier(
        n_estimators=n_trees,
        max_depth=max_depth,
        learning_rate=lr,
        random_state=RANDOM_STATE,
        eval_metric='logloss',
        verbosity=0
    )
    xgb_comp.fit(X_train_c, y_train_c)
    xgb_time = time.time() - start
    
    start_pred = time.time()
    xgb_pred = xgb_comp.predict(X_test_c)
    xgb_pred_time = time.time() - start_pred
    
    xgb_acc = accuracy_score(y_test_c, xgb_pred)
    xgb_auc = roc_auc_score(y_test_c, xgb_comp.predict_proba(X_test_c)[:, 1])
    
    results.append({
        'Model': 'XGBoost',
        'Train Time': xgb_time,
        'Pred Time': xgb_pred_time,
        'Accuracy': xgb_acc,
        'AUC': xgb_auc
    })
    
    print(f"  Time: {xgb_time:.3f}s, Accuracy: {xgb_acc:.4f}")

In [None]:
# 3. LightGBM
if LGB_AVAILABLE:
    print("Training LightGBM...")
    start = time.time()
    lgb_comp = lgb.LGBMClassifier(
        n_estimators=n_trees,
        max_depth=max_depth,
        learning_rate=lr,
        num_leaves=2**max_depth - 1,  # Approximate equivalent
        random_state=RANDOM_STATE,
        verbose=-1
    )
    lgb_comp.fit(X_train_c, y_train_c)
    lgb_time = time.time() - start
    
    start_pred = time.time()
    lgb_pred = lgb_comp.predict(X_test_c)
    lgb_pred_time = time.time() - start_pred
    
    lgb_acc = accuracy_score(y_test_c, lgb_pred)
    lgb_auc = roc_auc_score(y_test_c, lgb_comp.predict_proba(X_test_c)[:, 1])
    
    results.append({
        'Model': 'LightGBM',
        'Train Time': lgb_time,
        'Pred Time': lgb_pred_time,
        'Accuracy': lgb_acc,
        'AUC': lgb_auc
    })
    
    print(f"  Time: {lgb_time:.3f}s, Accuracy: {lgb_acc:.4f}")

In [None]:
# Display comparison
df_comparison = pd.DataFrame(results)

print("\n" + "=" * 80)
print("XGBoost vs LightGBM vs CatBoost - Comprehensive Comparison")
print("=" * 80)
print(df_comparison.to_string(index=False))

# Find best in each category
print("\n" + "=" * 80)
print("Winners:")
print("=" * 80)
print(f"Fastest training: {df_comparison.loc[df_comparison['Train Time'].idxmin(), 'Model']}")
print(f"Fastest prediction: {df_comparison.loc[df_comparison['Pred Time'].idxmin(), 'Model']}")
print(f"Best accuracy: {df_comparison.loc[df_comparison['Accuracy'].idxmax(), 'Model']}")
print(f"Best AUC: {df_comparison.loc[df_comparison['AUC'].idxmax(), 'Model']}")

In [None]:
# Visualize comparison
if len(results) > 1:
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Training time
    axes[0, 0].bar(df_comparison['Model'], df_comparison['Train Time'], 
                   color=['#f39c12', '#e74c3c', '#2ecc71'], edgecolor='black')
    axes[0, 0].set_ylabel('Time (seconds)', fontsize=11)
    axes[0, 0].set_title('Training Time', fontsize=12, fontweight='bold')
    axes[0, 0].grid(axis='y', alpha=0.3)
    
    # Prediction time
    axes[0, 1].bar(df_comparison['Model'], df_comparison['Pred Time'] * 1000,  # Convert to ms
                   color=['#f39c12', '#e74c3c', '#2ecc71'], edgecolor='black')
    axes[0, 1].set_ylabel('Time (milliseconds)', fontsize=11)
    axes[0, 1].set_title('Prediction Time', fontsize=12, fontweight='bold')
    axes[0, 1].grid(axis='y', alpha=0.3)
    
    # Accuracy
    axes[1, 0].bar(df_comparison['Model'], df_comparison['Accuracy'], 
                   color=['#f39c12', '#e74c3c', '#2ecc71'], edgecolor='black')
    axes[1, 0].set_ylabel('Accuracy', fontsize=11)
    axes[1, 0].set_title('Test Accuracy', fontsize=12, fontweight='bold')
    axes[1, 0].set_ylim([0.9, 1.0])
    axes[1, 0].grid(axis='y', alpha=0.3)
    
    # AUC
    axes[1, 1].bar(df_comparison['Model'], df_comparison['AUC'], 
                   color=['#f39c12', '#e74c3c', '#2ecc71'], edgecolor='black')
    axes[1, 1].set_ylabel('AUC-ROC', fontsize=11)
    axes[1, 1].set_title('Test AUC-ROC', fontsize=12, fontweight='bold')
    axes[1, 1].set_ylim([0.9, 1.0])
    axes[1, 1].grid(axis='y', alpha=0.3)
    
    plt.tight_layout()
    plt.show()

print("\nüìä All three libraries are excellent!")
print("Choice depends on your specific use case and priorities.")

## 8. CatBoost Model Analysis Tools

CatBoost provides excellent tools for model interpretation.

In [None]:
# Get feature importance
feature_importance = cat_comp.get_feature_importance()
feature_names_cancer = cancer_data.feature_names

fi_df = pd.DataFrame({
    'Feature': feature_names_cancer,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print("Top 10 Features:")
print("=" * 70)
print(fi_df.head(10).to_string(index=False))

In [None]:
# Visualize feature importance
top_features = fi_df.head(15)

plt.figure(figsize=(10, 8))
plt.barh(range(len(top_features)), top_features['Importance'], 
         color='#f39c12', edgecolor='black')
plt.yticks(range(len(top_features)), top_features['Feature'])
plt.gca().invert_yaxis()
plt.xlabel('Importance', fontsize=12)
plt.title('CatBoost Feature Importance (Top 15)', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# SHAP values with CatBoost
if SHAP_AVAILABLE:
    print("Computing SHAP values for CatBoost...")
    explainer = shap.TreeExplainer(cat_comp)
    X_test_sample = X_test_c[:100]
    shap_values = explainer.shap_values(X_test_sample)
    
    print(f"SHAP values computed! Shape: {shap_values.shape}")
else:
    print("SHAP not available. Install with: pip install shap")

In [None]:
if SHAP_AVAILABLE:
    # Global feature importance
    plt.figure(figsize=(10, 8))
    shap.summary_plot(shap_values, X_test_sample, 
                     feature_names=feature_names_cancer,
                     plot_type='bar', show=False)
    plt.title('CatBoost - Global Feature Importance (SHAP)', 
             fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()

In [None]:
if SHAP_AVAILABLE:
    # Detailed SHAP summary
    plt.figure(figsize=(10, 10))
    shap.summary_plot(shap_values, X_test_sample,
                     feature_names=feature_names_cancer, show=False)
    plt.title('CatBoost - SHAP Summary Plot', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    print("\nCatBoost integrates seamlessly with SHAP!")
    print("Use for production model interpretation.")

## Exercises

### Exercise 1: Hyperparameter Tuning Comparison

Compare hyperparameter tuning for all three libraries:

1. Create a classification dataset
2. For each library (XGBoost, LightGBM, CatBoost):
   - Define parameter grid with equivalent parameters
   - Use RandomizedSearchCV or GridSearchCV
   - Find optimal parameters
   - Measure tuning time
3. Compare:
   - Best cross-validation scores
   - Time to find best parameters
   - Sensitivity to hyperparameters
4. Determine which library is most sensitive to tuning

In [None]:
# Your code here


### Exercise 2: Categorical Feature Handling Showdown

Create comprehensive test of categorical handling:

1. Generate dataset with:
   - 3 low-cardinality categoricals (3-5 values)
   - 3 medium-cardinality (20-50 values)
   - 2 high-cardinality (100+ values)
   - 10 numeric features
2. Test encoding strategies:
   - CatBoost native
   - LightGBM native
   - XGBoost + one-hot encoding
   - XGBoost + target encoding
   - XGBoost + label encoding
3. Measure for each:
   - Accuracy
   - Training time
   - Memory usage
   - Code complexity (lines of preprocessing)
4. Determine clear winner for categorical handling

In [None]:
# Your code here


### Exercise 3: Ordered Boosting Impact Analysis

Test CatBoost's ordered boosting claim:

1. Create small dataset prone to overfitting (2000 samples, many features)
2. Train CatBoost with different boosting types:
   - `boosting_type='Ordered'` (default)
   - `boosting_type='Plain'` (like standard gradient boosting)
3. For each:
   - Train multiple times with different train/val splits
   - Measure train vs validation gap
   - Test generalization on holdout set
4. Compare:
   - Overfitting degree
   - Stability across runs
   - Final test performance
5. Determine if ordered boosting provides measurable benefit

In [None]:
# Your code here


### Exercise 4: Symmetric Tree Analysis

Investigate CatBoost's symmetric tree structure:

1. Train CatBoost model and save tree structure
2. Use `get_feature_statistics()` to analyze tree properties
3. Compare with XGBoost/LightGBM trees:
   - Average tree depth
   - Number of splits
   - Prediction time
4. Test prediction speed scaling:
   - Measure prediction time vs tree depth
   - Compare CatBoost vs others
5. Verify that symmetric trees are faster for prediction

**Hint**: Use CatBoost's `calc_feature_statistics()` method

In [None]:
# Your code here


### Exercise 5: Production Deployment Comparison

Simulate production deployment scenario:

1. Train all three models (XGBoost, LightGBM, CatBoost) on realistic dataset
2. Save models to disk (each library's format)
3. Measure:
   - Model file size
   - Load time from disk
   - Prediction latency (single sample)
   - Prediction throughput (batch of 10,000)
   - Memory footprint during prediction
4. Test deployment features:
   - Model versioning
   - Feature importance extraction
   - Prediction explanation (SHAP)
5. Create deployment recommendation matrix:
   - When to use each library in production
   - Trade-offs (speed vs accuracy vs ease of use)

In [None]:
# Your code here


## Summary

### Key Concepts

1. **CatBoost Innovations**:
   - **Ordered boosting**: Prevents target leakage, better generalization
   - **Symmetric trees**: Faster prediction, less overfitting, cache-friendly
   - **Superior categorical encoding**: Ordered target statistics, no preprocessing
   - **Built-in overfitting detection**: Automatic monitoring and early stopping

2. **Critical Hyperparameters**:
   - `iterations`: Number of trees (use early stopping)
   - `depth`: Tree depth, critical due to symmetric structure (4-10)
   - `learning_rate`: Shrinkage rate (0.01-0.3)
   - `l2_leaf_reg`: L2 regularization (1-10)
   - `border_count`: Number of splits (32-254)
   - `bagging_temperature`: Bayesian bootstrap randomness

3. **Best Practices**:
   - Use categorical features natively - don't encode!
   - Start with default parameters (often excellent)
   - Use built-in cross-validation for hyperparameter search
   - Enable overfitting detection (`od_type='Iter'`)
   - Lower learning rate + more iterations = better results
   - Monitor train/validation gap

4. **Advantages**:
   - **Best categorical handling**: No preprocessing, excellent accuracy
   - **Excellent defaults**: Works well out-of-the-box
   - **Less overfitting**: Ordered boosting and symmetric trees
   - **Fast prediction**: Symmetric tree structure
   - **Great interpretability**: Built-in tools and SHAP support
   - **Robust**: Less sensitive to hyperparameters

5. **When to Use CatBoost**:
   - ‚úÖ Many categorical features
   - ‚úÖ Need robust default parameters
   - ‚úÖ Want less overfitting risk
   - ‚úÖ Need fast prediction in production
   - ‚úÖ Limited time for hyperparameter tuning
   - ‚úÖ Interpretability is important
   - ‚ö†Ô∏è Training speed critical: Consider LightGBM
   - ‚ö†Ô∏è Very large datasets: LightGBM may be faster

### Strengths

- Superior categorical feature handling
- Excellent default parameters
- Robust to overfitting
- Fast prediction speed
- Great interpretability tools
- Built-in overfitting detection
- Easy to use
- Good documentation

### Weaknesses

- Slower training than LightGBM
- Symmetric trees may limit expressiveness
- Newer library (smaller community than XGBoost)
- GPU support less mature than others
- Less control over tree structure

### XGBoost vs LightGBM vs CatBoost: Decision Guide

**Choose XGBoost when**:
- Need maximum stability and maturity
- Large community and resources important
- Numeric features only
- Want fine-grained control
- Proven track record matters

**Choose LightGBM when**:
- Training speed is critical
- Very large datasets (100K+ samples)
- High-dimensional sparse data
- Limited memory
- Some categorical features

**Choose CatBoost when**:
- Many categorical features
- Limited tuning time (good defaults)
- Prediction speed matters
- Need robust out-of-box performance
- Want less overfitting
- Interpretability important

**Truth**: All three are excellent! Try all and pick what works best.

### Comparison Summary

| Aspect | XGBoost | LightGBM | CatBoost |
|--------|---------|----------|----------|
| **Training Speed** | Fast | Fastest | Moderate |
| **Prediction Speed** | Fast | Fast | Fastest |
| **Accuracy** | Excellent | Excellent | Excellent |
| **Categorical Handling** | Manual | Good | Best |
| **Default Parameters** | Good | Need tuning | Excellent |
| **Overfitting Risk** | Moderate | Higher | Lower |
| **Large Datasets** | Good | Best | Good |
| **Interpretability** | Good | Good | Excellent |
| **Ease of Use** | Moderate | Moderate | Easiest |
| **Maturity** | Most mature | Mature | Newer |
| **Community** | Largest | Large | Growing |

### Tuning Strategy

**Phase 1: Start simple**
1. Use default parameters first
2. Enable overfitting detection
3. Use built-in CV

**Phase 2: Adjust complexity**
4. Tune `depth` (start 6, try 4-10)
5. Adjust `l2_leaf_reg` if overfitting

**Phase 3: Optimize learning**
6. Lower `learning_rate` (0.03-0.05)
7. Increase `iterations` accordingly
8. Use early stopping

**Phase 4: Fine-tune**
9. Adjust `border_count` for speed/accuracy
10. Try `bagging_temperature` for regularization

### Production Tips

1. **Model persistence**: Use `save_model()` / `load_model()`
2. **Categorical features**: Save category mappings
3. **Versioning**: Include hyperparameters in model metadata
4. **Monitoring**: Track feature importance over time
5. **Prediction speed**: Leverage symmetric trees
6. **A/B testing**: Compare model versions
7. **Retraining**: Automated pipeline with overfitting detection

### What's Next?

In **Module 08: Stacking and Blending**, we'll explore:
- Meta-learning and ensemble of ensembles
- Combining XGBoost, LightGBM, and CatBoost
- Multi-level stacking
- Blending vs stacking differences
- Choosing optimal meta-model

### Additional Resources

- **Paper**: "CatBoost: unbiased boosting with categorical features" (Prokhorenkova et al., 2018)
- **Documentation**: [CatBoost Official Docs](https://catboost.ai/docs/)
- **Tutorial**: [CatBoost Tutorial](https://github.com/catboost/tutorials)
- **Comparison**: [Benchmarking Gradient Boosting Libraries](https://catboost.ai/docs/concepts/benchmarks.html)
- **GitHub**: [Yandex CatBoost](https://github.com/catboost/catboost)