# Day 5: CatBoost for Trading

## üéØ Learning Objectives
- Understand CatBoost innovations
- Handle categorical features natively
- Apply ordered boosting
- Compare with XGBoost and LightGBM

---

## üìö Theory: CatBoost

### What Makes CatBoost Different

| Feature | CatBoost Innovation |
|---------|--------------------|
| Categorical | Target encoding with ordered statistics |
| Boosting | Ordered boosting (reduces overfitting) |
| Trees | Symmetric (oblivious) trees |
| GPU | Excellent GPU support |

### Ordered Target Statistics
Traditional target encoding leaks information. CatBoost uses:
$$\hat{x}_k = \frac{\sum_{j<i} [x_j = x_i] \cdot y_j + a \cdot p}{\sum_{j<i} [x_j = x_i] + a}$$

Where:
- Only uses samples **before** current sample
- $a$: smoothing parameter
- $p$: prior (target mean)

### Ordered Boosting
- Standard boosting: Use all data for residuals
- Ordered boosting: Use random permutation, predict with trees built on preceding samples
- Reduces prediction shift (target leakage)

### Symmetric Trees
- Same feature and threshold at each level
- Faster inference
- Natural regularization

---

In [None]:
import numpy as np
import pandas as pd
import yfinance as yf
from datetime import datetime, timedelta
import matplotlib.pyplot as plt
import warnings
from catboost import CatBoostClassifier, Pool
import lightgbm as lgb
from xgboost import XGBClassifier
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
import time

warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')

TRADING_DAYS = 252
RISK_FREE_RATE = 0.05

# Download data
ticker = 'AAPL'
end_date = datetime.now()
start_date = end_date - timedelta(days=5*365)

print("üì• Downloading data...")
data = yf.download(ticker, start=start_date, end=end_date, progress=False, auto_adjust=True)
prices = data['Close']
volume = data['Volume']
returns = prices.pct_change().dropna()

print(f"‚úÖ Data: {len(prices)} days")

In [None]:
# Create features with categorical variables
df = pd.DataFrame(index=prices.index)
df['price'] = prices
df['return'] = returns

# Numerical features
for lag in [1, 5, 10, 20]:
    df[f'momentum_{lag}'] = prices.pct_change(lag)

for window in [5, 10, 20]:
    df[f'volatility_{window}'] = returns.rolling(window).std()

df['volume_ratio'] = volume / volume.rolling(20).mean()
df['ma_5_20'] = prices.rolling(5).mean() / prices.rolling(20).mean() - 1

# RSI
delta = prices.diff()
gain = delta.where(delta > 0, 0).rolling(14).mean()
loss = (-delta.where(delta < 0, 0)).rolling(14).mean()
df['rsi'] = 100 - (100 / (1 + gain / loss))

# Categorical features
df['day_of_week'] = df.index.dayofweek.astype(str)  # Keep as string for CatBoost
df['month'] = df.index.month.astype(str)

# Volatility regime (categorical)
vol_20 = returns.rolling(20).std()
vol_quantiles = pd.qcut(vol_20.dropna(), q=3, labels=['low', 'medium', 'high'])
df['vol_regime'] = vol_quantiles.astype(str)

# Target
df['next_return'] = returns.shift(-1)
df['target'] = (df['next_return'] > 0).astype(int)

df = df.dropna()
print(f"üìä Samples: {len(df)}")

In [None]:
# Prepare data
feature_cols = [c for c in df.columns if c not in ['price', 'return', 'next_return', 'target']]
categorical_cols = ['day_of_week', 'month', 'vol_regime']
numerical_cols = [c for c in feature_cols if c not in categorical_cols]

X = df[feature_cols]
y = df['target']

# Split
split_idx = int(len(df) * 0.8)
X_train, X_test = X.iloc[:split_idx], X.iloc[split_idx:]
y_train, y_test = y.iloc[:split_idx], y.iloc[split_idx:]
returns_test = df['next_return'].iloc[split_idx:]

# Get categorical indices for CatBoost
cat_indices = [X.columns.get_loc(c) for c in categorical_cols]

print(f"Train: {len(X_train)}, Test: {len(X_test)}")
print(f"Categorical columns: {categorical_cols}")
print(f"Categorical indices: {cat_indices}")

In [None]:
# CatBoost with native categorical handling
train_pool = Pool(X_train, y_train, cat_features=cat_indices)
test_pool = Pool(X_test, y_test, cat_features=cat_indices)

cat = CatBoostClassifier(
    iterations=200,
    depth=5,
    learning_rate=0.1,
    cat_features=cat_indices,
    random_seed=42,
    verbose=False
)

start = time.time()
cat.fit(train_pool, eval_set=test_pool)
cat_time = time.time() - start

print("\n" + "="*60)
print("CATBOOST TRAINING")
print("="*60)
print(f"Training time: {cat_time:.2f} seconds")
print(f"Best iteration: {cat.best_iteration_}")

In [None]:
# Compare all three boosting methods
# For fair comparison, encode categoricals for XGBoost/LightGBM
X_encoded = X.copy()
for col in categorical_cols:
    X_encoded[col] = X_encoded[col].astype('category').cat.codes

X_train_enc = X_encoded.iloc[:split_idx]
X_test_enc = X_encoded.iloc[split_idx:]

# XGBoost
xgb = XGBClassifier(n_estimators=200, max_depth=5, learning_rate=0.1,
                   random_state=42, eval_metric='logloss')
start = time.time()
xgb.fit(X_train_enc, y_train)
xgb_time = time.time() - start

# LightGBM
lgbm = lgb.LGBMClassifier(n_estimators=200, max_depth=5, learning_rate=0.1,
                          random_state=42, verbosity=-1)
start = time.time()
lgbm.fit(X_train_enc, y_train)
lgbm_time = time.time() - start

print("\n" + "="*70)
print("BOOSTING METHOD COMPARISON")
print("="*70)

results = []
for name, model, test_data, train_time in [
    ('XGBoost', xgb, X_test_enc, xgb_time),
    ('LightGBM', lgbm, X_test_enc, lgbm_time),
    ('CatBoost', cat, X_test, cat_time)
]:
    y_pred = model.predict(test_data)
    y_proba = model.predict_proba(test_data)[:, 1]
    
    results.append({
        'Model': name,
        'Train Time': f'{train_time:.2f}s',
        'Accuracy': accuracy_score(y_test, y_pred),
        'F1': f1_score(y_test, y_pred),
        'AUC': roc_auc_score(y_test, y_proba)
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))

In [None]:
# Feature Importance from CatBoost
importance = pd.DataFrame({
    'Feature': feature_cols,
    'Importance': cat.feature_importances_
}).sort_values('Importance', ascending=False)

plt.figure(figsize=(10, 8))
colors = ['coral' if f in categorical_cols else 'steelblue' for f in importance['Feature']]
plt.barh(importance['Feature'], importance['Importance'], color=colors)
plt.xlabel('Importance')
plt.title('CatBoost Feature Importance (Orange = Categorical)')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

In [None]:
# Trading Strategy Backtest
y_pred = cat.predict(X_test)
y_proba = cat.predict_proba(X_test)[:, 1]

backtest = pd.DataFrame(index=y_test.index)
backtest['actual_return'] = returns_test.values

# Strategies
backtest['signal'] = y_pred
backtest['strategy_return'] = backtest['signal'] * backtest['actual_return']
backtest['strategy_cum'] = (1 + backtest['strategy_return']).cumprod()

backtest['prob_signal'] = y_proba
backtest['prob_return'] = backtest['prob_signal'] * backtest['actual_return']
backtest['prob_cum'] = (1 + backtest['prob_return']).cumprod()

backtest['buy_hold_cum'] = (1 + backtest['actual_return']).cumprod()

# Plot
plt.figure(figsize=(14, 6))
plt.plot(backtest.index, backtest['buy_hold_cum'], label='Buy & Hold', linewidth=2)
plt.plot(backtest.index, backtest['strategy_cum'], label='CatBoost Binary', linewidth=2)
plt.plot(backtest.index, backtest['prob_cum'], label='CatBoost Probability', linewidth=2)
plt.title(f'CatBoost Trading Strategies ({ticker})', fontsize=14)
plt.xlabel('Date')
plt.ylabel('Cumulative Return')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Performance
def calc_metrics(returns, cumulative):
    total = cumulative.iloc[-1] - 1
    sharpe = (returns.mean() * TRADING_DAYS - RISK_FREE_RATE) / (returns.std() * np.sqrt(TRADING_DAYS)) if returns.std() > 0 else 0
    peak = cumulative.cummax()
    mdd = ((cumulative - peak) / peak).min()
    return total, sharpe, mdd

print("\n" + "="*60)
print("STRATEGY PERFORMANCE")
print("="*60)

strategies = [
    ('Buy & Hold', 'actual_return', 'buy_hold_cum'),
    ('CatBoost Binary', 'strategy_return', 'strategy_cum'),
    ('CatBoost Probability', 'prob_return', 'prob_cum')
]

print(f"\n{'Strategy':<25} {'Total Ret':>12} {'Sharpe':>10} {'Max DD':>10}")
print("-" * 60)

for name, ret_col, cum_col in strategies:
    total, sharpe, mdd = calc_metrics(backtest[ret_col], backtest[cum_col])
    print(f"{name:<25} {total:>12.2%} {sharpe:>10.2f} {mdd:>10.2%}")

In [None]:
# Next day prediction
latest = X.iloc[-1:]
pred = cat.predict(latest)[0]
proba = cat.predict_proba(latest)[0]

print("\n" + "="*60)
print(f"üìä NEXT DAY PREDICTION FOR {ticker}")
print("="*60)
print(f"\nDate: {df.index[-1].strftime('%Y-%m-%d')}")
print(f"\nCategorical Context:")
for col in categorical_cols:
    print(f"  {col}: {latest[col].values[0]}")
print(f"\nPrediction: {'üìà UP' if pred == 1 else 'üìâ DOWN'}")
print(f"Probability (Down/Up): [{proba[0]:.2%}, {proba[1]:.2%}]")

---

## üè¢ Real-World Applications

| Company | CatBoost Use Case |
|---------|------------------|
| Yandex | Web search ranking |
| CloudFlare | Security classification |
| Finance | Many categorical features |
| Retail | Customer segmentation |

### Key Interview Points
1. **Why CatBoost for categoricals?** - Ordered target statistics prevent leakage
2. **Ordered boosting?** - Reduces overfitting by using different data for residuals
3. **Symmetric trees?** - Same split at each level, faster inference
4. **When to use?** - Many categorical features, robust out-of-box performance

---
## üìÖ Tomorrow: Ensemble Stacking