# Decision Tree Classifier

This notebook implements and backtests a Chapter 11 (Decision Trees / Random Forests) model on this repo's feature-extracted dataset.

It follows the same conventions as the linear-model notebooks:
- Load `dataset/features/all_features.parquet` (fallback: `all_features.csv`)
- Build a forward label per asset: next-day return y_{t+1} = r_{t+1}
- Split by asset (asset-first split) into train/val/test universes
- Train an sklearn model on numeric features (with imputation)
- Convert model outputs into a weekly-rebalanced Top-K long-only weights matrix
- Run the regular backtester (`src/backtester/engine.run_backtest`) and display the existing Bokeh dashboard (`src/backtester/bokeh_plots.build_interactive_portfolio_layout`)


## Decision Trees (CART) - Math

A decision tree recursively partitions the feature space. At a node with samples {(x_j, y_j)}_{j=1}^n, we consider candidate splits of the form

$$x_k \le \tau$$

for feature k and threshold tau, producing left/right subsets L and R.

### Regression tree
For regression, a common impurity is the mean squared error (MSE):

$$\text{MSE}(S) = \frac{1}{|S|}\sum_{j\in S}(y_j - \bar y_S)^2.$$

We choose (k,tau) to minimize the weighted impurity:

$$\frac{|L|}{|S|}\text{MSE}(L) + \frac{|R|}{|S|}\text{MSE}(R).$$

### Classification tree
For binary classification, the Gini impurity for class probability p (class 1) is:

$$G(p)=2p(1-p)=1 - (p^2 + (1-p)^2).$$

(Entropy is another option: H(p) = -p log p -(1-p) log(1-p).)

We choose the split minimizing the weighted impurity.

### Regularization
To control variance/overfitting we constrain:
- `max_depth`
- `min_samples_leaf`
- `min_samples_split`


## Data, Label, and Trading Signal

We work with a panel dataset indexed by date and containing an `Asset_ID` column.

### Forward label
Let r_t be the known 1-day return for day t (feature `ret_1d`). We define the forward target per asset:

$$y_t = r_{t+1}$$

implemented via a group-wise shift:

$$y_t = \text{shift}_{-1}(r_t)$$

### From predictions to portfolio weights (Top-K)
On each rebalance date t (weekly), the model produces a score s_{i,t} for each asset i.
We select the set of long candidates:

$$\mathcal{L}_t = \{ i : s_{i,t} > 0 \}\ \cap\ \text{TopK}(s_{\cdot,t}).$$

We then form equal-weight long-only weights:

$$w_{i,t} =
\begin{cases}
\frac{1}{|\mathcal{L}_t|}, & i\in\mathcal{L}_t \\
0, & \text{otherwise}
\end{cases}$$

Weights are held constant between rebalance dates.


In [1]:
from __future__ import annotations

from pathlib import Path
import sys

import numpy as np
import pandas as pd

from bokeh.io import output_notebook, show

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, roc_auc_score


In [2]:
# Resolve project root robustly by walking parents.
# We require BOTH `dataset/` and `src/` to exist at the same level.
CWD = Path.cwd().resolve()
PROJECT_ROOT = None
for p in [CWD, *CWD.parents]:
    if (p / 'dataset').exists() and (p / 'src').exists():
        PROJECT_ROOT = p
        break

if PROJECT_ROOT is None:
    raise RuntimeError(f'Could not locate project root from CWD={CWD}')

if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

print('CWD:', CWD)
print('PROJECT_ROOT:', PROJECT_ROOT)


CWD: /home/anivarth/college/quant-task/notebooks/random forest models
PROJECT_ROOT: /home/anivarth/college/quant-task


In [3]:
SEED = 42
rng = np.random.default_rng(SEED)

FEATURES_PARQUET_PATH = PROJECT_ROOT / 'dataset' / 'features' / 'all_features.parquet'
FEATURES_CSV_PATH = PROJECT_ROOT / 'dataset' / 'features' / 'all_features.csv'

TARGET_COL = 'ret_1d'
TARGET_FWD_COL = 'y_ret_1d_fwd'

# Asset split counts (asset-first split policy)
N_TRAIN_ASSETS = 75
N_VAL_ASSETS = 15
N_TEST_ASSETS = None

# Backtest knobs
REBALANCE_FREQ = 'W'
TOP_K = 20
INITIAL_EQUITY = 1_000_000.0
TXN_COST_BPS = 5.0

def load_feature_dataset() -> pd.DataFrame:
    if FEATURES_PARQUET_PATH.exists():
        df = pd.read_parquet(FEATURES_PARQUET_PATH)
        if 'Date' in df.columns:
            df['Date'] = pd.to_datetime(df['Date'])
            df = df.set_index('Date')
    elif FEATURES_CSV_PATH.exists():
        df = pd.read_csv(FEATURES_CSV_PATH, parse_dates=['Date']).set_index('Date')
    else:
        raise FileNotFoundError('Feature dataset not found under dataset/features/.')

    required_cols = {'Asset_ID', TARGET_COL}
    missing = required_cols - set(df.columns)
    if missing:
        raise ValueError(f'Missing required columns: {sorted(missing)}')

    df = df.sort_index()
    return df

def add_ch11_compat_features(df: pd.DataFrame) -> pd.DataFrame:
    # Add a few Chapter-11-style derived features if they are missing.
    # Chapter 11 examples include RSI, MACD, Bollinger-derived distances, dollar volume.
    # Our feature dataset already exports RSI/MACD/BB bands; we add light aliases/distances.
    out = df.copy()

    # Dollar volume (daily)
    if 'dollar_volume' not in out.columns and {'close', 'volume'} <= set(out.columns):
        out['dollar_volume'] = out['close'].astype(float) * out['volume'].astype(float)

    # Monthly return alias (21 trading days ~ 1 month)
    if 'return_1m' not in out.columns and 'ret_21d' in out.columns:
        out['return_1m'] = out['ret_21d']

    # Bollinger distances (Chapter-11-style bb_up/bb_down)
    if 'bb_up' not in out.columns and {'bb_bb_upper', 'close'} <= set(out.columns):
        out['bb_up'] = out['bb_bb_upper'].astype(float) - out['close'].astype(float)
    if 'bb_down' not in out.columns and {'bb_bb_lower', 'close'} <= set(out.columns):
        out['bb_down'] = out['close'].astype(float) - out['bb_bb_lower'].astype(float)

    # MACD z-score per asset (standardization)
    if 'macd_z' not in out.columns and 'macd_macd' in out.columns:
        def _z(s: pd.Series) -> pd.Series:
            mu = float(s.mean())
            sig = float(s.std(ddof=0))
            if not np.isfinite(sig) or sig == 0:
                return pd.Series(0.0, index=s.index)
            return (s - mu) / sig
        out['macd_z'] = out.groupby('Asset_ID', sort=False)['macd_macd'].transform(_z)

    return out

# Load and enrich
raw = load_feature_dataset()
df = add_ch11_compat_features(raw)

# Forward label per asset
_df = df.copy()
_df[TARGET_FWD_COL] = _df.groupby('Asset_ID', sort=False)[TARGET_COL].shift(-1)
_df = _df.dropna(subset=[TARGET_FWD_COL])

print('shape:', _df.shape)
print('assets:', _df['Asset_ID'].nunique())
display(_df.head(3))


shape: (251000, 128)
assets: 100


Unnamed: 0_level_0,ret_1d,logret_1d,excess_ret_1d,logret_lag_1,logret_lag_5,ret_lag_1,ret_lag_5,ret_5d,ret_21d,logret_5d,...,filt_logret_lattice_demo,close,volume,Asset_ID,dollar_volume,return_1m,bb_up,bb_down,macd_z,y_ret_1d_fwd
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2016-01-25,,,,,,,,,,,...,,28.580592,249449990.0,Asset_001,7129428000.0,,,,-0.283463,0.005531
2016-01-25,,,,,,,,,,,...,,8.489112,148653644.0,Asset_028,1261937000.0,,,,-0.207834,0.027007
2016-01-25,,,,,,,,,,,...,,23.826383,65904403.0,Asset_029,1570264000.0,,,,-0.252877,0.023737


In [4]:
# Asset-based split (deterministic)
assets = sorted(_df['Asset_ID'].unique())
if N_TRAIN_ASSETS + N_VAL_ASSETS > len(assets):
    raise ValueError(f'Not enough assets: have {len(assets)} but requested train+val={N_TRAIN_ASSETS+N_VAL_ASSETS}.')

n_test = (len(assets) - N_TRAIN_ASSETS - N_VAL_ASSETS) if (N_TEST_ASSETS is None) else N_TEST_ASSETS
if N_TRAIN_ASSETS + N_VAL_ASSETS + n_test > len(assets):
    raise ValueError('Requested split exceeds available assets.')

train_assets = assets[:N_TRAIN_ASSETS]
val_assets = assets[N_TRAIN_ASSETS:N_TRAIN_ASSETS + N_VAL_ASSETS]
test_assets = assets[N_TRAIN_ASSETS + N_VAL_ASSETS:N_TRAIN_ASSETS + N_VAL_ASSETS + n_test]

print('n_assets:', len(assets))
print('train:', len(train_assets), train_assets[:3], '...', train_assets[-3:])
print('val:', len(val_assets), val_assets[:3], '...', val_assets[-3:])
print('test:', len(test_assets), test_assets[:3], '...', test_assets[-3:])

df_train = _df[_df['Asset_ID'].isin(train_assets)].copy()
df_val = _df[_df['Asset_ID'].isin(val_assets)].copy()
df_test = _df[_df['Asset_ID'].isin(test_assets)].copy()

print('rows train/val/test:', df_train.shape[0], df_val.shape[0], df_test.shape[0])


n_assets: 100
train: 75 ['Asset_001', 'Asset_002', 'Asset_003'] ... ['Asset_073', 'Asset_074', 'Asset_075']
val: 15 ['Asset_076', 'Asset_077', 'Asset_078'] ... ['Asset_088', 'Asset_089', 'Asset_090']
test: 10 ['Asset_091', 'Asset_092', 'Asset_093'] ... ['Asset_098', 'Asset_099', 'Asset_100']
rows train/val/test: 188250 37650 25100


In [5]:
# Feature matrix definition: all numeric columns except identifiers/labels.
exclude_cols = {'Asset_ID', TARGET_FWD_COL}
feature_cols = [c for c in _df.columns if c not in exclude_cols]

numeric_feature_cols = [c for c in feature_cols if pd.api.types.is_numeric_dtype(_df[c])]

print('n_features_numeric:', len(numeric_feature_cols))
print('example features:', numeric_feature_cols[:12])

def to_xy(d: pd.DataFrame):
    X = d.loc[:, numeric_feature_cols].replace([np.inf, -np.inf], np.nan)
    y = d.loc[:, TARGET_FWD_COL].astype(float)
    return X, y

X_train, y_train = to_xy(df_train)
X_val, y_val = to_xy(df_val)
X_test, y_test = to_xy(df_test)

print('X_train:', X_train.shape, 'y_train:', y_train.shape)
print('X_val:', X_val.shape, 'y_val:', y_val.shape)
print('X_test:', X_test.shape, 'y_test:', y_test.shape)


n_features_numeric: 126
example features: ['ret_1d', 'logret_1d', 'excess_ret_1d', 'logret_lag_1', 'logret_lag_5', 'ret_lag_1', 'ret_lag_5', 'ret_5d', 'ret_21d', 'logret_5d', 'logret_21d', 'cumret_5d']
X_train: (188250, 126) y_train: (188250,)
X_val: (37650, 126) y_val: (37650,)
X_test: (25100, 126) y_test: (25100,)


In [6]:
from sklearn.tree import DecisionTreeClassifier

MODEL_TITLE = 'Decision Tree Classifier (CART) - Next-Day Direction Prediction'

# Binary target: up/down next day

def to_xy_clf(d: pd.DataFrame):
    X = d.loc[:, numeric_feature_cols].replace([np.inf, -np.inf], np.nan)
    y = (d.loc[:, TARGET_FWD_COL].astype(float) > 0.0).astype(int)
    return X, y

X_train_c, y_train_c = to_xy_clf(df_train)
X_val_c, y_val_c = to_xy_clf(df_val)
X_test_c, y_test_c = to_xy_clf(df_test)

clf = DecisionTreeClassifier(
    random_state=SEED,
    max_depth=6,
    min_samples_leaf=50,
)

pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('model', clf),
])

pipe.fit(X_train_c, y_train_c)

proba_val = pipe.predict_proba(X_val_c)[:, 1]
auc_val = roc_auc_score(y_val_c, proba_val)
print('val_auc:', float(auc_val))


val_auc: 0.5052499019009457


In [7]:
# Predict probabilities on held-out test set; use centered score in (-0.5, +0.5)
proba = pipe.predict_proba(X_test_c)[:, 1]
score = proba - 0.5
long = pd.DataFrame({'Date': df_test.index, 'Asset_ID': df_test['Asset_ID'].to_numpy(), 'score': score})
pred_matrix = long.pivot_table(index='Date', columns='Asset_ID', values='score', aggfunc='mean').sort_index()
print('pred_matrix:', pred_matrix.shape)
display(pred_matrix.iloc[:3, :5])


pred_matrix: (2510, 10)


Asset_ID,Asset_091,Asset_092,Asset_093,Asset_094,Asset_095
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016-01-25,0.295918,0.295918,0.295918,0.295918,0.295918
2016-01-26,0.003561,-0.280488,0.003561,0.003561,0.003561
2016-01-27,0.003561,0.032519,0.295918,0.003561,0.295918


In [8]:
from src.backtester.data import load_cleaned_assets, align_close_prices
from src.backtester.engine import BacktestConfig, run_backtest
from src.backtester.report import compute_backtest_report
from src.backtester.bokeh_plots import build_interactive_portfolio_layout
from src.backtester.portfolio import equal_weight

# Load OHLCV for held-out test asset universe
bt_assets = sorted(test_assets)
assets_ohlcv = load_cleaned_assets(symbols=bt_assets, cleaned_dir=str(PROJECT_ROOT / 'dataset' / 'cleaned'))
close_prices = align_close_prices(assets_ohlcv)

# Market proxy OHLCV for the Bokeh dashboard
market_df = pd.DataFrame({
    'Open': pd.concat([d['Open'] for d in assets_ohlcv.values()], axis=1).mean(axis=1),
    'High': pd.concat([d['High'] for d in assets_ohlcv.values()], axis=1).mean(axis=1),
    'Low': pd.concat([d['Low'] for d in assets_ohlcv.values()], axis=1).mean(axis=1),
    'Close': pd.concat([d['Close'] for d in assets_ohlcv.values()], axis=1).mean(axis=1),
    'Volume': pd.concat([d['Volume'] for d in assets_ohlcv.values()], axis=1).sum(axis=1),
}).sort_index()

# Align prediction matrix to price calendar
pred_matrix = pred_matrix.reindex(close_prices.index)

output_notebook()

# Build weekly-rebalanced Top-K equal-weight weights from scores
rebal_dates = set(pd.Series(pred_matrix.index, index=pred_matrix.index).resample(REBALANCE_FREQ).last().dropna().tolist())

w_last = pd.Series(0.0, index=bt_assets)
w_rows = []
for dt in pred_matrix.index:
    if dt in rebal_dates:
        row = pred_matrix.loc[dt].dropna().sort_values(ascending=False)
        top = row.head(min(TOP_K, len(row)))
        candidates = [a for a, v in top.items() if np.isfinite(v) and float(v) > 0.0]
        if len(candidates) == 0:
            w_last = pd.Series(0.0, index=bt_assets)
        else:
            w_dict = equal_weight(candidates)
            w_last = pd.Series(0.0, index=bt_assets)
            for a, w in w_dict.items():
                if a in w_last.index:
                    w_last[a] = float(w)
    w_rows.append(w_last)

weights = pd.DataFrame(w_rows, index=pred_matrix.index, columns=bt_assets).fillna(0.0)

cfg = BacktestConfig(initial_equity=INITIAL_EQUITY, transaction_cost_bps=TXN_COST_BPS, mode='vectorized')
res = run_backtest(close_prices, weights, config=cfg)
report = compute_backtest_report(result=res, close_prices=close_prices)
display(report.to_frame('Backtest Report'))

layout = build_interactive_portfolio_layout(
    market_ohlcv=market_df,
    equity=res.equity,
    returns=res.returns,
    weights=res.weights,
    turnover=res.turnover,
    costs=res.costs,
    close_prices=close_prices,
    title=MODEL_TITLE,
)
show(layout)


Unnamed: 0,Backtest Report
Start,2016-01-25 00:00:00
End,2026-01-16 00:00:00
Duration,3644 days 00:00:00
Initial Equity,1000000.0
Final Equity,5145366.45647
Equity Peak,5145366.45647
Total Return [%],414.536646
CAGR [%],17.875911
Volatility (ann) [%],17.692325
Sharpe,1.018193
