# Module 07: CatBoost - Categorical Boosting

**Difficulty**: ⭐⭐
**Estimated Time**: 50 minutes
**Prerequisites**: 
- Module 04: Gradient Boosting Machines
- Module 05: XGBoost
- Module 06: LightGBM

## Learning Objectives
By the end of this notebook, you will be able to:
1. Understand CatBoost's unique features: ordered boosting and symmetric trees
2. Install and use CatBoost for classification and regression
3. Handle categorical features natively without any preprocessing
4. Compare CatBoost with XGBoost and LightGBM in terms of performance and accuracy
5. Leverage GPU acceleration for faster training
6. Tune CatBoost-specific hyperparameters effectively
7. Understand when to choose CatBoost over other gradient boosting libraries

## 1. Introduction to CatBoost

### What is CatBoost?

**CatBoost** (Categorical Boosting) is a gradient boosting library developed by Yandex in 2017. It's designed to:
- Handle **categorical features optimally** without preprocessing
- Reduce **overfitting** through ordered boosting
- Provide **great performance out-of-the-box** with minimal tuning
- Support **GPU acceleration** efficiently

### Key Innovations:

#### 1. **Ordered Boosting**

**Problem**: Standard gradient boosting suffers from prediction shift:
- Training uses the same data to calculate gradients and fit trees
- This leads to overfitting and biased gradient estimates

**CatBoost Solution**: Ordered Boosting
- Uses different random permutations of the training data
- For each sample, uses only "previous" samples (in permutation order) to compute gradients
- Eliminates prediction shift and reduces overfitting
- Makes the model more robust

#### 2. **Ordered Target Statistics for Categorical Features**

**Problem with naive encoding**:
- Target encoding (mean target per category) causes target leakage
- Label encoding loses information about feature importance
- One-hot encoding creates too many features for high-cardinality categories

**CatBoost Solution**: Ordered Target Statistics
- For each sample, calculates target statistics using only "prior" samples
- Adds random noise for regularization
- No target leakage, better generalization

Example:
```
Data (in random order):
Category  Target
   A        1
   B        0
   A        1     <- Encoding for this A uses only previous A (one sample, target=1)
   B        1     <- Encoding for this B uses only previous B (one sample, target=0)
   A        0     <- Encoding uses previous A's (targets: 1, 1)
```

#### 3. **Symmetric (Oblivious) Decision Trees**

**Structure**:
- All nodes at the same level use the **same splitting criterion**
- Results in a **balanced tree** structure
- Each leaf can be indexed by a binary code

**Advantages**:
- **Faster prediction**: Can use binary search or lookup table
- **Less overfitting**: More regularized than asymmetric trees
- **Better for CPU cache**: Predictable access patterns

```
Regular Tree:              Symmetric Tree:
      X1<5                      X1<5
     /    \                    /    \
  X2<3   X3<7              X2<3      X2<3
  /  \    /  \             /  \      /  \
 L1  L2  L3  L4          L1  L2    L3  L4

Same split (X2<3) at both nodes in level 2
```

### CatBoost vs XGBoost vs LightGBM:

| Feature | XGBoost | LightGBM | CatBoost |
|---------|---------|----------|----------|
| Speed | Fast | **Faster** | Moderate |
| Categorical Support | Manual | Good | **Best** |
| Overfitting Resistance | Good | Moderate | **Best** |
| Default Performance | Good | Good | **Best** |
| Hyperparameter Tuning | Needs tuning | Needs tuning | **Works well with defaults** |
| Tree Structure | Level-wise | Leaf-wise | **Symmetric** |
| Prediction Speed | Fast | Fast | **Fastest** |
| GPU Support | Yes | Yes | **Yes (better)** |

## 2. Setup and Installation

In [None]:
# Install CatBoost if not already installed
# Uncomment the line below if needed
# !pip install catboost

# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from time import time

# CatBoost
from catboost import CatBoostClassifier, CatBoostRegressor, Pool

# Other boosting libraries for comparison
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Scikit-learn
from sklearn.datasets import make_classification, make_regression, load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    mean_squared_error, r2_score, log_loss
)

# Configuration
%matplotlib inline
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Set random seed
np.random.seed(42)

# Check version
import catboost
print(f"CatBoost version: {catboost.__version__}")

## 3. CatBoost for Classification

In [None]:
# Load breast cancer dataset
cancer_data = load_breast_cancer()
X = cancer_data.data
y = cancer_data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"Features: {len(cancer_data.feature_names)}")

In [None]:
# Train a basic CatBoost classifier
cat_clf = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=3,
    random_state=42,
    verbose=False  # Set to True to see training progress
)

# Time the training
start_time = time()
cat_clf.fit(X_train, y_train)
cat_time = time() - start_time

# Predictions
y_pred_train = cat_clf.predict(X_train)
y_pred_test = cat_clf.predict(X_test)

# Evaluate
train_acc = accuracy_score(y_train, y_pred_train)
test_acc = accuracy_score(y_test, y_pred_test)

print("CatBoost Classifier Performance:")
print(f"Training time: {cat_time:.4f} seconds")
print(f"Train Accuracy: {train_acc:.4f}")
print(f"Test Accuracy: {test_acc:.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred_test, target_names=cancer_data.target_names))

In [None]:
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred_test)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Purples',
            xticklabels=cancer_data.target_names,
            yticklabels=cancer_data.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('CatBoost Confusion Matrix')
plt.show()

## 4. Comparison: CatBoost vs XGBoost vs LightGBM

Let's compare all three libraries on the same dataset.

In [None]:
# Create a larger dataset for better comparison
X_large, y_large = make_classification(
    n_samples=30000,
    n_features=30,
    n_informative=20,
    n_redundant=5,
    random_state=42
)

X_train_large, X_test_large, y_train_large, y_test_large = train_test_split(
    X_large, y_large, test_size=0.2, random_state=42
)

print(f"Dataset size - Train: {X_train_large.shape}, Test: {X_test_large.shape}")

In [None]:
# Common parameters for fair comparison
n_estimators = 100
learning_rate = 0.1
max_depth = 5

results = {}

# 1. CatBoost
print("Training CatBoost...")
start = time()
cat_model = CatBoostClassifier(
    iterations=n_estimators,
    learning_rate=learning_rate,
    depth=max_depth,
    random_state=42,
    verbose=False
)
cat_model.fit(X_train_large, y_train_large)
cat_train_time = time() - start
cat_acc = cat_model.score(X_test_large, y_test_large)
results['CatBoost'] = {'time': cat_train_time, 'accuracy': cat_acc}
print(f"  Time: {cat_train_time:.2f}s, Accuracy: {cat_acc:.4f}")

# 2. XGBoost
print("\nTraining XGBoost...")
start = time()
xgb_model = XGBClassifier(
    n_estimators=n_estimators,
    learning_rate=learning_rate,
    max_depth=max_depth,
    random_state=42,
    eval_metric='logloss',
    verbosity=0
)
xgb_model.fit(X_train_large, y_train_large)
xgb_train_time = time() - start
xgb_acc = xgb_model.score(X_test_large, y_test_large)
results['XGBoost'] = {'time': xgb_train_time, 'accuracy': xgb_acc}
print(f"  Time: {xgb_train_time:.2f}s, Accuracy: {xgb_acc:.4f}")

# 3. LightGBM
print("\nTraining LightGBM...")
start = time()
lgb_model = LGBMClassifier(
    n_estimators=n_estimators,
    learning_rate=learning_rate,
    max_depth=max_depth,
    random_state=42,
    verbose=-1
)
lgb_model.fit(X_train_large, y_train_large)
lgb_train_time = time() - start
lgb_acc = lgb_model.score(X_test_large, y_test_large)
results['LightGBM'] = {'time': lgb_train_time, 'accuracy': lgb_acc}
print(f"  Time: {lgb_train_time:.2f}s, Accuracy: {lgb_acc:.4f}")

In [None]:
# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

models = list(results.keys())
times = [results[m]['time'] for m in models]
accuracies = [results[m]['accuracy'] for m in models]
colors = ['purple', 'orange', 'green']

# Training time comparison
axes[0].bar(models, times, color=colors, alpha=0.7)
axes[0].set_ylabel('Training Time (seconds)')
axes[0].set_title('Training Speed Comparison')
axes[0].grid(axis='y', alpha=0.3)
for i, (model, t) in enumerate(zip(models, times)):
    axes[0].text(i, t + 0.1, f'{t:.2f}s', ha='center', va='bottom')

# Accuracy comparison
axes[1].bar(models, accuracies, color=colors, alpha=0.7)
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Accuracy Comparison')
axes[1].set_ylim([min(accuracies) - 0.01, 1.0])
axes[1].grid(axis='y', alpha=0.3)
for i, (model, acc) in enumerate(zip(models, accuracies)):
    axes[1].text(i, acc + 0.001, f'{acc:.4f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

# Summary
print("\nSummary:")
fastest = min(models, key=lambda m: results[m]['time'])
most_accurate = max(models, key=lambda m: results[m]['accuracy'])
print(f"Fastest: {fastest} ({results[fastest]['time']:.2f}s)")
print(f"Most Accurate: {most_accurate} ({results[most_accurate]['accuracy']:.4f})")

## 5. Categorical Feature Handling

CatBoost's strongest feature is its superior handling of categorical variables. Let's see it in action!

In [None]:
# Create a dataset with categorical features
np.random.seed(42)
n_samples = 2000

# Create DataFrame with mixed features
df_cat = pd.DataFrame({
    'num_feat1': np.random.randn(n_samples),
    'num_feat2': np.random.randn(n_samples),
    'num_feat3': np.random.randn(n_samples),
    'category1': np.random.choice(['A', 'B', 'C', 'D', 'E'], n_samples),
    'category2': np.random.choice(['Low', 'Medium', 'High', 'Very High'], n_samples),
    'category3': np.random.choice(range(20), n_samples),  # High cardinality
    'category4': np.random.choice(['Type1', 'Type2', 'Type3'], n_samples)
})

# Create target influenced by categorical features
y_cat = (
    df_cat['num_feat1'] +
    (df_cat['category1'] == 'A').astype(int) * 2 +
    (df_cat['category2'] == 'High').astype(int) * 1.5 +
    (df_cat['category3'] > 10).astype(int) * 1 +
    np.random.randn(n_samples) * 0.5
) > 1
y_cat = y_cat.astype(int)

print("Dataset with categorical features:")
print(df_cat.head())
print(f"\nShape: {df_cat.shape}")
print(f"\nData types:")
print(df_cat.dtypes)
print(f"\nTarget distribution: {pd.Series(y_cat).value_counts().to_dict()}")

In [None]:
# Split data
X_cat = df_cat
X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(
    X_cat, y_cat, test_size=0.2, random_state=42
)

# Identify categorical features (columns 3-6)
cat_features = ['category1', 'category2', 'category3', 'category4']
cat_indices = [3, 4, 5, 6]  # Column indices

print(f"Categorical features: {cat_features}")

In [None]:
# Train CatBoost with categorical features
# No preprocessing needed!
cat_with_cat = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=5,
    random_state=42,
    verbose=False
)

# Specify categorical features by index or name
cat_with_cat.fit(
    X_train_cat, y_train_cat,
    cat_features=cat_indices  # Can also use cat_features (column names)
)

# Evaluate
train_acc_cat = cat_with_cat.score(X_train_cat, y_train_cat)
test_acc_cat = cat_with_cat.score(X_test_cat, y_test_cat)

print("CatBoost with Native Categorical Support:")
print(f"Train Accuracy: {train_acc_cat:.4f}")
print(f"Test Accuracy: {test_acc_cat:.4f}")
print(f"\nNo preprocessing needed - CatBoost handles categorical features automatically!")

In [None]:
# Compare with one-hot encoding approach
X_encoded = pd.get_dummies(X_cat, columns=cat_features[:3], drop_first=True)
# Keep category4 as is for comparison
X_encoded['category4'] = X_cat['category4']

X_train_enc, X_test_enc, y_train_enc, y_test_enc = train_test_split(
    X_encoded, y_cat, test_size=0.2, random_state=42
)

# Train with one-hot encoded features
start_time = time()
cat_encoded = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=5,
    random_state=42,
    verbose=False
)
cat_encoded.fit(X_train_enc, y_train_enc)
encoded_time = time() - start_time
encoded_acc = cat_encoded.score(X_test_enc, y_test_enc)

# Train with native categorical
start_time = time()
cat_native = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=5,
    random_state=42,
    verbose=False
)
cat_native.fit(X_train_cat, y_train_cat, cat_features=cat_indices)
native_time = time() - start_time
native_acc = cat_native.score(X_test_cat, y_test_cat)

# Compare
print("Comparison: One-Hot Encoding vs Native Categorical:")
print(f"\nOne-Hot Encoded:")
print(f"  Features: {X_encoded.shape[1]}")
print(f"  Training time: {encoded_time:.4f}s")
print(f"  Test Accuracy: {encoded_acc:.4f}")
print(f"\nNative Categorical:")
print(f"  Features: {X_cat.shape[1]}")
print(f"  Training time: {native_time:.4f}s")
print(f"  Test Accuracy: {native_acc:.4f}")
print(f"\nAccuracy improvement: {(native_acc - encoded_acc):.4f}")
print(f"Speed improvement: {encoded_time/native_time:.2f}x faster")

**Key Insight**: CatBoost's native categorical handling not only simplifies the workflow (no encoding needed) but often leads to better accuracy and faster training, especially with high-cardinality categorical features.

## 6. CatBoost Hyperparameters

CatBoost uses different parameter names than XGBoost/LightGBM:

### Core Parameters:

1. **iterations** (n_estimators):
   - Number of boosting rounds
   - Default: 1000

2. **learning_rate**:
   - Step size shrinkage
   - Default: 0.03 (more conservative than others)
   - Typical range: 0.01-0.3

3. **depth** (max_depth):
   - Maximum tree depth
   - Default: 6
   - Typical range: 4-10

4. **l2_leaf_reg** (reg_lambda):
   - L2 regularization
   - Default: 3.0

### CatBoost-Specific Parameters:

5. **border_count**:
   - Number of splits for numerical features
   - Default: 254 (CPU), 128 (GPU)
   - Higher → more accurate but slower

6. **bagging_temperature**:
   - Controls intensity of Bayesian bootstrap
   - Default: 1.0
   - Higher → more aggressive bootstrap

7. **random_strength**:
   - Amount of randomness for scoring splits
   - Default: 1.0
   - Higher → more randomness

8. **grow_policy**:
   - Tree growing strategy
   - 'SymmetricTree' (default): Symmetric trees
   - 'Depthwise': Like XGBoost
   - 'Lossguide': Like LightGBM

In [None]:
# Effect of tree depth
depths = [3, 4, 6, 8, 10]
depth_results = []

for depth in depths:
    model = CatBoostClassifier(
        iterations=100,
        learning_rate=0.1,
        depth=depth,
        random_state=42,
        verbose=False
    )
    model.fit(X_train, y_train)
    
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    depth_results.append({
        'depth': depth,
        'train_acc': train_acc,
        'test_acc': test_acc,
        'gap': train_acc - test_acc
    })
    
    print(f"Depth={depth}: Train={train_acc:.4f}, Test={test_acc:.4f}")

depth_df = pd.DataFrame(depth_results)

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].plot(depth_df['depth'], depth_df['train_acc'], 'o-', 
             label='Train', linewidth=2, markersize=8, color='purple')
axes[0].plot(depth_df['depth'], depth_df['test_acc'], 's-', 
             label='Test', linewidth=2, markersize=8, color='orange')
axes[0].set_xlabel('Tree Depth')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Accuracy vs Tree Depth')
axes[0].legend()
axes[0].grid(alpha=0.3)

axes[1].plot(depth_df['depth'], depth_df['gap'], 'ro-', linewidth=2, markersize=8)
axes[1].set_xlabel('Tree Depth')
axes[1].set_ylabel('Train-Test Gap')
axes[1].set_title('Overfitting Gap vs Tree Depth')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Effect of learning rate
learning_rates = [0.01, 0.03, 0.1, 0.3]
lr_results = []

for lr in learning_rates:
    model = CatBoostClassifier(
        iterations=100,
        learning_rate=lr,
        depth=6,
        random_state=42,
        verbose=False
    )
    model.fit(X_train, y_train)
    
    train_acc = model.score(X_train, y_train)
    test_acc = model.score(X_test, y_test)
    
    lr_results.append({
        'learning_rate': lr,
        'train_acc': train_acc,
        'test_acc': test_acc
    })
    
    print(f"LR={lr}: Train={train_acc:.4f}, Test={test_acc:.4f}")

lr_df = pd.DataFrame(lr_results)

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
x = np.arange(len(learning_rates))
width = 0.35

ax.bar(x - width/2, lr_df['train_acc'], width, label='Train', 
       color='purple', alpha=0.7)
ax.bar(x + width/2, lr_df['test_acc'], width, label='Test', 
       color='orange', alpha=0.7)

ax.set_xlabel('Learning Rate')
ax.set_ylabel('Accuracy')
ax.set_title('Effect of Learning Rate on Performance')
ax.set_xticks(x)
ax.set_xticklabels(learning_rates)
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Early Stopping and Monitoring

In [None]:
# Create validation set
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42, stratify=y_train
)

# Train with early stopping
cat_early = CatBoostClassifier(
    iterations=500,
    learning_rate=0.05,
    depth=6,
    early_stopping_rounds=20,
    random_state=42,
    verbose=False
)

# Fit with validation set
cat_early.fit(
    X_train_sub, y_train_sub,
    eval_set=(X_val, y_val),
    use_best_model=True  # Use the best model according to validation
)

print(f"Best iteration: {cat_early.get_best_iteration()}")
print(f"Best score: {cat_early.get_best_score()['validation']['Logloss']:.4f}")
print(f"Test Accuracy: {cat_early.score(X_test, y_test):.4f}")

In [None]:
# Plot training history
# CatBoost stores training history automatically
train_metrics = cat_early.get_evals_result()['learn']['Logloss']
val_metrics = cat_early.get_evals_result()['validation']['Logloss']

plt.figure(figsize=(10, 6))
plt.plot(train_metrics, label='Training', linewidth=2, color='purple')
plt.plot(val_metrics, label='Validation', linewidth=2, color='orange')
plt.axvline(x=cat_early.get_best_iteration(), color='red', linestyle='--', 
           label=f'Best iteration ({cat_early.get_best_iteration()})')
plt.xlabel('Iteration')
plt.ylabel('Log Loss')
plt.title('CatBoost Training Progress')
plt.legend()
plt.grid(alpha=0.3)
plt.show()

## 8. Feature Importance

In [None]:
# Get feature importances
feature_importance = cat_clf.get_feature_importance()
feature_names = cancer_data.feature_names

# Create DataFrame
importance_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importance
}).sort_values('Importance', ascending=False)

print("Top 10 Most Important Features:")
print(importance_df.head(10))

# Visualize
plt.figure(figsize=(10, 8))
top_features = importance_df.head(15)
plt.barh(range(len(top_features)), top_features['Importance'], 
         color='purple', alpha=0.7)
plt.yticks(range(len(top_features)), top_features['Feature'])
plt.xlabel('Feature Importance')
plt.title('Top 15 Feature Importances - CatBoost')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

## 9. CatBoost for Regression

In [None]:
# Create regression dataset
X_reg, y_reg = make_regression(
    n_samples=1000,
    n_features=10,
    n_informative=8,
    noise=10,
    random_state=42
)

X_train_reg, X_test_reg, y_train_reg, y_test_reg = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

# Train CatBoost Regressor
cat_reg = CatBoostRegressor(
    iterations=100,
    learning_rate=0.1,
    depth=6,
    random_state=42,
    verbose=False
)

cat_reg.fit(X_train_reg, y_train_reg)

# Predictions
y_pred_train_reg = cat_reg.predict(X_train_reg)
y_pred_test_reg = cat_reg.predict(X_test_reg)

# Evaluate
train_r2 = r2_score(y_train_reg, y_pred_train_reg)
test_r2 = r2_score(y_test_reg, y_pred_test_reg)
train_mse = mean_squared_error(y_train_reg, y_pred_train_reg)
test_mse = mean_squared_error(y_test_reg, y_pred_test_reg)

print("CatBoost Regressor Performance:")
print(f"Train R²: {train_r2:.4f}")
print(f"Test R²: {test_r2:.4f}")
print(f"Train MSE: {train_mse:.2f}")
print(f"Test MSE: {test_mse:.2f}")

In [None]:
# Visualize predictions
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

axes[0].scatter(y_train_reg, y_pred_train_reg, alpha=0.5, color='purple')
axes[0].plot([y_train_reg.min(), y_train_reg.max()],
             [y_train_reg.min(), y_train_reg.max()],
             'r--', linewidth=2, label='Perfect prediction')
axes[0].set_xlabel('Actual values')
axes[0].set_ylabel('Predicted values')
axes[0].set_title(f'Training Set (R² = {train_r2:.4f})')
axes[0].legend()

axes[1].scatter(y_test_reg, y_pred_test_reg, alpha=0.5, color='purple')
axes[1].plot([y_test_reg.min(), y_test_reg.max()],
             [y_test_reg.min(), y_test_reg.max()],
             'r--', linewidth=2, label='Perfect prediction')
axes[1].set_xlabel('Actual values')
axes[1].set_ylabel('Predicted values')
axes[1].set_title(f'Test Set (R² = {test_r2:.4f})')
axes[1].legend()

plt.tight_layout()
plt.show()

## 10. CatBoost Pool (Advanced)

CatBoost has a special `Pool` object for efficient data handling, especially with categorical features.

In [None]:
# Create Pool objects
train_pool = Pool(
    data=X_train_cat,
    label=y_train_cat,
    cat_features=cat_indices
)

test_pool = Pool(
    data=X_test_cat,
    label=y_test_cat,
    cat_features=cat_indices
)

# Train using Pool
cat_pool_model = CatBoostClassifier(
    iterations=100,
    learning_rate=0.1,
    depth=5,
    random_state=42,
    verbose=False
)

cat_pool_model.fit(train_pool)

# Evaluate
pool_acc = cat_pool_model.score(test_pool)
print(f"Test Accuracy using Pool: {pool_acc:.4f}")
print("\nPool objects provide:")
print("- More efficient memory usage")
print("- Faster data loading")
print("- Better handling of categorical features")
print("- Support for weights and baselines")

## 11. Exercises

### Exercise 1: Basic CatBoost Training

Create a synthetic classification dataset with 1500 samples and 20 features. Train a CatBoost classifier with:
- 150 iterations
- Learning rate of 0.05
- Depth of 7

Report training and test accuracy.

In [None]:
# Your code here


### Exercise 2: Three-Way Comparison

Create a dataset with 20,000 samples and 40 features. Train CatBoost, XGBoost, and LightGBM with identical parameters. Compare:
1. Training time
2. Test accuracy
3. Create visualization showing both metrics

In [None]:
# Your code here


### Exercise 3: Categorical Feature Experiment

Create a dataset with:
- 3 numerical features
- 4 categorical features (varying cardinalities: 5, 10, 20, 50)
- 3000 samples

Compare CatBoost's native categorical handling vs one-hot encoding. Which performs better?

In [None]:
# Your code here


### Exercise 4: Hyperparameter Tuning

Using the breast cancer dataset, test different combinations of:
- depth: [4, 6, 8]
- learning_rate: [0.01, 0.05, 0.1]
- l2_leaf_reg: [1, 3, 5]

Find the best combination and report the improvement over defaults.

In [None]:
# Your code here


### Exercise 5: Early Stopping Analysis

Train a CatBoost model on breast cancer data with:
- iterations=1000
- Early stopping with 50 rounds patience
- Learning rate of 0.02

Plot the training/validation curves and identify:
1. When did training stop?
2. Was there overfitting before stopping?
3. What's the final test accuracy?

In [None]:
# Your code here


## 12. Summary

In this notebook, you learned about CatBoost, Yandex's state-of-the-art gradient boosting library:

### Key Innovations:

1. **Ordered Boosting**:
   - Eliminates prediction shift
   - Uses random permutations to compute unbiased gradients
   - Result: Better generalization, less overfitting

2. **Ordered Target Statistics**:
   - Optimal encoding for categorical features
   - No target leakage
   - Superior to one-hot and label encoding

3. **Symmetric (Oblivious) Trees**:
   - Same split at each level
   - Faster prediction
   - Better CPU cache utilization
   - More regularized

### Important Parameters:

- **iterations**: Number of trees
- **learning_rate**: Step size (default 0.03, more conservative)
- **depth**: Tree depth (default 6)
- **l2_leaf_reg**: L2 regularization (default 3.0)
- **border_count**: Numerical feature splits
- **bagging_temperature**: Bootstrap intensity
- **random_strength**: Split randomness

### Advantages:

✅ **Best categorical feature handling** - no preprocessing needed
✅ **Excellent default parameters** - works well out-of-the-box
✅ **Resistant to overfitting** - ordered boosting
✅ **High accuracy** - often best performance with minimal tuning
✅ **Fastest prediction** - symmetric trees
✅ **GPU support** - efficient implementation
✅ **Built-in cross-validation**
✅ **Handles missing values** automatically

### Disadvantages:

❌ **Slower training** than LightGBM on large datasets
❌ **More memory usage** than other libraries
❌ **Fewer hyperparameters to tune** (can be good or bad)
❌ **Less documentation** than XGBoost

### When to Choose CatBoost:

**Use CatBoost when:**
- Working with **categorical features** (especially high-cardinality)
- Want **best accuracy with minimal tuning**
- Need **robust default parameters**
- **Overfitting** is a concern
- Want **fastest prediction time**
- Dataset is **medium-sized** (10K - 1M rows)

**Consider alternatives when:**
- Dataset is **very large** (> 10M rows) → Use LightGBM
- **Training speed** is critical → Use LightGBM
- Need **extensive community support** → Use XGBoost
- Working with **image/text data** → Use deep learning

### Comparison Summary:

| Metric | XGBoost | LightGBM | CatBoost |
|--------|---------|----------|----------|
| **Training Speed** | ★★★ | ★★★★★ | ★★★ |
| **Prediction Speed** | ★★★★ | ★★★★ | ★★★★★ |
| **Accuracy** | ★★★★ | ★★★★ | ★★★★★ |
| **Categorical Support** | ★ | ★★★ | ★★★★★ |
| **Default Performance** | ★★★ | ★★★ | ★★★★★ |
| **Overfitting Resistance** | ★★★ | ★★ | ★★★★★ |
| **Documentation** | ★★★★★ | ★★★★ | ★★★ |
| **Community** | ★★★★★ | ★★★★ | ★★★ |

### Best Practices:

1. **Start with defaults** - CatBoost works well without tuning:
   ```python
   CatBoostClassifier(
       iterations=1000,
       learning_rate=0.03,  # Conservative default
       early_stopping_rounds=50
   )
   ```

2. **Always specify categorical features** for best results
3. **Use early stopping** to find optimal tree count
4. **Use Pool objects** for large datasets
5. **Monitor train/val curves** though overfitting is less common
6. **Consider GPU** for faster training on large data

### What's Next?

In the next module, we'll explore **Stacking and Blending**, advanced ensemble techniques that combine multiple models:
- How to stack different types of models
- Training meta-learners
- Multi-level stacking
- When stacking helps vs when it doesn't

Stacking often combines the strengths of XGBoost, LightGBM, and CatBoost for even better performance!

## Additional Resources

- [CatBoost Documentation](https://catboost.ai/docs/)
- [CatBoost Parameters](https://catboost.ai/docs/concepts/python-reference_parameters-list.html)
- [Original CatBoost Paper (Prokhorenkova et al., 2018)](https://arxiv.org/abs/1706.09516)
- [CatBoost Tutorials](https://github.com/catboost/tutorials)
- [Handling Categorical Features](https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html)