# CatBoost (CatBoostClassifier)

These notebooks implement Chapter 12 (Gradient Boosting Machines) models on this repo's **feature-extracted dataset**.

Shared pipeline (kept consistent across models):
- Load `dataset/features/all_features.parquet` (fallback: `all_features.csv`)
- Forward label per asset: `y_ret_1d_fwd = shift(-1) of ret_1d`
- Classification target: `y_up_fwd = 1[y_ret_1d_fwd > 0]`
- **Time-wise split**: train = first 7 years, val = middle, test = last 18 months
- Train a boosting model on numeric features (median imputation)
- Convert predicted probabilities into weekly-rebalanced Top-K long-only portfolio weights
- Run `src/backtester.engine.run_backtest` + render `src/backtester.bokeh_plots.build_interactive_portfolio_layout`

Note: `xgboost`, `lightgbm`, and `catboost` are not installed in the current environment; those notebooks include an explicit install hint.


## CatBoost - Ordered Boosting (Key Ideas)

CatBoost is designed to handle categorical features and reduce target leakage.
A key idea is **ordered boosting**, where statistics used for encoding/learning are computed in a way that avoids using future information for a given sample (a permutation-based scheme).

For our dataset (numeric features), we still use CatBoost as a gradient-boosted tree model.


## Data, Label, and Portfolio Construction

### Forward label
For each asset, with daily return `ret_1d` at time t, define the forward return label:

$$y_t = r_{t+1}.$$

In code: `groupby('Asset_ID')['ret_1d'].shift(-1)`.

### Binary direction target

$$y^{\uparrow}_t = \mathbb{1}[y_t > 0].$$

### From predicted probability to a score
Let the model output $p_{i,t} = P(y^{\uparrow}_{i,t}=1 \mid x_{i,t})$. We map it to a centered score:

$$s_{i,t} = p_{i,t} - 0.5.$$

### Weekly Top-K long-only portfolio
On each weekly rebalance date $t$, select the Top-K assets by score, keep those with $s_{i,t} > 0$, and set equal weights:

$$w_{i,t} =
\begin{cases}
\frac{1}{|\mathcal{L}_t|}, & i\in\mathcal{L}_t \\
0, & \text{otherwise}
\end{cases}$$

Weights are held constant between rebalances.


In [None]:
from __future__ import annotations

from pathlib import Path
import sys

import numpy as np
import pandas as pd

from bokeh.io import output_notebook, show

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score


In [None]:
# Resolve project root robustly by walking parents.
# We require BOTH `dataset/` and `src/` to exist at the same level.
CWD = Path.cwd().resolve()
PROJECT_ROOT = None
for p in [CWD, *CWD.parents]:
    if (p / 'dataset').exists() and (p / 'src').exists():
        PROJECT_ROOT = p
        break

if PROJECT_ROOT is None:
    raise RuntimeError(f'Could not locate project root from CWD={CWD}')

if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

print('CWD:', CWD)
print('PROJECT_ROOT:', PROJECT_ROOT)


In [None]:
SEED = 42
rng = np.random.default_rng(SEED)

FEATURES_PARQUET_PATH = PROJECT_ROOT / 'dataset' / 'features' / 'all_features.parquet'
FEATURES_CSV_PATH = PROJECT_ROOT / 'dataset' / 'features' / 'all_features.csv'

TARGET_COL = 'ret_1d'
TARGET_FWD_COL = 'y_ret_1d_fwd'
TARGET_DIR_COL = 'y_up_fwd'

# Backtest knobs
REBALANCE_FREQ = 'W'
TOP_K = 20
INITIAL_EQUITY = 1_000_000.0
TXN_COST_BPS = 5.0

# Time-wise split
TRAIN_YEARS = 7
TEST_MONTHS = 18

def load_feature_dataset() -> pd.DataFrame:
    if FEATURES_PARQUET_PATH.exists():
        df = pd.read_parquet(FEATURES_PARQUET_PATH)
        if 'Date' in df.columns:
            df['Date'] = pd.to_datetime(df['Date'])
            df = df.set_index('Date')
    elif FEATURES_CSV_PATH.exists():
        df = pd.read_csv(FEATURES_CSV_PATH, parse_dates=['Date']).set_index('Date')
    else:
        raise FileNotFoundError('Feature dataset not found under dataset/features/.')

    required = {'Asset_ID', TARGET_COL}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f'Missing required columns: {sorted(missing)}')

    return df.sort_index()

# Load
_df0 = load_feature_dataset().copy()

# Forward label per asset
_df0[TARGET_FWD_COL] = _df0.groupby('Asset_ID', sort=False)[TARGET_COL].shift(-1)
_df0 = _df0.dropna(subset=[TARGET_FWD_COL]).sort_index()

# Binary direction label
_df0[TARGET_DIR_COL] = (_df0[TARGET_FWD_COL].astype(float) > 0.0).astype(int)

# Time split
start = pd.Timestamp(_df0.index.min())
end = pd.Timestamp(_df0.index.max())
train_end = start + pd.DateOffset(years=TRAIN_YEARS)
test_start = end - pd.DateOffset(months=TEST_MONTHS)

if train_end >= test_start:
    raise ValueError(
        f'Not enough history for requested split: start={start.date()} train_end={train_end.date()} test_start={test_start.date()} end={end.date()}'
    )

train_mask = _df0.index < train_end
val_mask = (_df0.index >= train_end) & (_df0.index < test_start)
test_mask = _df0.index >= test_start

_df_train = _df0.loc[train_mask].copy()
_df_val = _df0.loc[val_mask].copy()
_df_test = _df0.loc[test_mask].copy()

print('date range:', start.date(), '->', end.date())
print('train:', _df_train.index.min().date(), '->', _df_train.index.max().date(), 'rows:', _df_train.shape[0])
print('val  :', _df_val.index.min().date(), '->', _df_val.index.max().date(), 'rows:', _df_val.shape[0])
print('test :', _df_test.index.min().date(), '->', _df_test.index.max().date(), 'rows:', _df_test.shape[0])
print('assets:', _df0['Asset_ID'].nunique())

# Features: all numeric except identifiers/labels
exclude = {'Asset_ID', TARGET_FWD_COL, TARGET_DIR_COL}
feature_cols = [c for c in _df0.columns if c not in exclude]
numeric_feature_cols = [c for c in feature_cols if pd.api.types.is_numeric_dtype(_df0[c])]

print('n_features_numeric:', len(numeric_feature_cols))
print('example features:', numeric_feature_cols[:12])

def to_xy(d: pd.DataFrame):
    X = d.loc[:, numeric_feature_cols].replace([np.inf, -np.inf], np.nan)
    y = d.loc[:, TARGET_DIR_COL].astype(int)
    return X, y

X_train, y_train = to_xy(_df_train)
X_val, y_val = to_xy(_df_val)
X_test, y_test = to_xy(_df_test)

print('X_train:', X_train.shape, 'y_train:', y_train.shape)
print('X_val  :', X_val.shape, 'y_val  :', y_val.shape)
print('X_test :', X_test.shape, 'y_test :', y_test.shape)


In [None]:
MODEL_TITLE = 'CatBoost (CatBoostClassifier) (Time Split) - Weekly Top-K Long-Only'

try:
    from catboost import CatBoostClassifier
except ModuleNotFoundError as e:
    raise ModuleNotFoundError(
        "catboost is not installed. Install it (example):

  pip install catboost
"
    ) from e

cat = CatBoostClassifier(
    iterations=2000,
    learning_rate=0.03,
    depth=6,
    loss_function='Logloss',
    random_seed=SEED,
    verbose=False,
)

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('model', cat),
])

pipe.fit(X_train, y_train)

if y_val.nunique() > 1:
    auc = float(roc_auc_score(y_val, pipe.predict_proba(X_val)[:, 1]))
    print('val_auc:', auc)
else:
    print('val_auc: n/a (single class in validation)')


In [None]:
# Convert test predictions into a prediction matrix and run the regular backtest.

from src.backtester.data import load_cleaned_assets, align_close_prices
from src.backtester.engine import BacktestConfig, run_backtest
from src.backtester.report import compute_backtest_report
from src.backtester.bokeh_plots import build_interactive_portfolio_layout
from src.backtester.portfolio import equal_weight

# Score = P(up) - 0.5
proba_test = pipe.predict_proba(X_test)[:, 1]
score = proba_test - 0.5

long = pd.DataFrame({'Date': _df_test.index, 'Asset_ID': _df_test['Asset_ID'].to_numpy(), 'score': score})
pred_matrix = long.pivot_table(index='Date', columns='Asset_ID', values='score', aggfunc='mean').sort_index()

# Backtest assets = full universe present in the feature dataset
bt_assets = sorted(_df0['Asset_ID'].unique().tolist())

# Load OHLCV and slice to TEST window so Start is not 2016
assets_ohlcv = load_cleaned_assets(symbols=bt_assets, cleaned_dir=str(PROJECT_ROOT / 'dataset' / 'cleaned'))
close_prices = align_close_prices(assets_ohlcv).sort_index()

bt_start = pd.Timestamp(_df_test.index.min())
bt_end = pd.Timestamp(_df_test.index.max())
close_prices = close_prices.loc[bt_start:bt_end]

# Market proxy OHLCV for the Bokeh dashboard (same slice)
assets_ohlcv_slice = {k: v.loc[bt_start:bt_end] for k, v in assets_ohlcv.items()}
market_df = pd.DataFrame({
    'Open': pd.concat([d['Open'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'High': pd.concat([d['High'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'Low': pd.concat([d['Low'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'Close': pd.concat([d['Close'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'Volume': pd.concat([d['Volume'] for d in assets_ohlcv_slice.values()], axis=1).sum(axis=1),
}).sort_index()

# Align prediction matrix to backtest calendar
pred_matrix = pred_matrix.reindex(close_prices.index)

output_notebook()

# Weekly Top-K equal-weight weights
rebal_dates = set(pd.Series(pred_matrix.index, index=pred_matrix.index).resample(REBALANCE_FREQ).last().dropna().tolist())

w_last = pd.Series(0.0, index=bt_assets)
w_rows = []
for dt in pred_matrix.index:
    if dt in rebal_dates:
        row = pred_matrix.loc[dt].dropna().sort_values(ascending=False)
        top = row.head(min(TOP_K, len(row)))
        candidates = [a for a, v in top.items() if np.isfinite(v) and float(v) > 0.0]
        if len(candidates) == 0:
            w_last = pd.Series(0.0, index=bt_assets)
        else:
            w_dict = equal_weight(candidates)
            w_last = pd.Series(0.0, index=bt_assets)
            for a, w in w_dict.items():
                if a in w_last.index:
                    w_last[a] = float(w)
    w_rows.append(w_last)

weights = pd.DataFrame(w_rows, index=pred_matrix.index, columns=bt_assets).fillna(0.0)

cfg = BacktestConfig(initial_equity=INITIAL_EQUITY, transaction_cost_bps=TXN_COST_BPS, mode='vectorized')
res = run_backtest(close_prices, weights, config=cfg)
report = compute_backtest_report(result=res, close_prices=close_prices)
display(report.to_frame(MODEL_TITLE))

layout = build_interactive_portfolio_layout(
    market_ohlcv=market_df,
    equity=res.equity,
    returns=res.returns,
    weights=res.weights,
    turnover=res.turnover,
    costs=res.costs,
    close_prices=close_prices,
    title=MODEL_TITLE,
)
show(layout)
