# ML Model 05: Linear Models on BOS/CHOCH/Trend (Time Split)

This notebook trains three linear regression models (regular/OLS, Ridge, Lasso) using a **restricted feature set** built from:

- **BOS** (Break of Structure)
- **CHOCH** (Change of Character)
- **Uptrend / Downtrend** derived from consecutive swing highs/lows
- **OHLC + Volume features** (basic price/volume indicators)

## Definitions (implemented here)
We compute swing highs/lows using a centered rolling window on High/Low.

- Swing high at date *t* if `High[t]` equals the max of `High[t-w : t+w]`
- Swing low at date *t* if `Low[t]` equals the min of `Low[t-w : t+w]`

**Uptrend**: last *N* swing points form both **higher highs (HH)** and **higher lows (HL)** (market structure).
**Downtrend**: last *N* swing points form both **lower highs (LH)** and **lower lows (LL)** (market structure).

**Bullish BOS**: close breaks above the last swing high (first break only).
**Bearish BOS**: close breaks below the last swing low (first break only).

**CHOCH** (simple rule):
- In uptrend, a break below last swing low => bearish CHOCH
- In downtrend, a break above last swing high => bullish CHOCH

## Split policy (time-based)
- 7 years training
- 18 months validation
- 18 months testing

## Target
Predict **next-day return** per asset: `y_ret_1d_fwd = ret_1d.shift(-1)`


In [1]:
from __future__ import annotations

import os
from dataclasses import dataclass
from pathlib import Path

import numpy as np
import pandas as pd
from scipy.stats import spearmanr

from sklearn.impute import SimpleImputer
from sklearn.linear_model import Lasso, LinearRegression, Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


In [2]:
SEED = 42
rng = np.random.default_rng(SEED)

CWD = Path.cwd()
PROJECT_ROOT = CWD.parent if CWD.name == 'notebooks' else CWD

CLEANED_DIR = PROJECT_ROOT / 'dataset' / 'cleaned'

# Target
TARGET_COL = 'ret_1d'
TARGET_FWD_COL = 'y_ret_1d_fwd'

# Time split horizon
TRAIN_YEARS = 7
VAL_MONTHS = 18
TEST_MONTHS = 18

# SMC-ish feature parameters
SWING_WINDOW = 5  # w (uses 2*w+1 centered window)
N_CONSEC_SWINGS = 3  # N consecutive swing highs/lows

# Regularization grids
ALPHA_GRID = [1e-4, 1e-3, 1e-2, 1e-1, 1.0, 10.0, 100.0]

OUTPUT_DIR = PROJECT_ROOT / 'dataset' / 'model_outputs' / 'linear_models_05_smc_timesplit'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


## Build SMC + OHLCV feature set
We compute these features **per asset** from `dataset/cleaned/Asset_XXX.csv`.


In [3]:
def compute_smc_features(ohlcv: pd.DataFrame, *, swing_window: int, n_consec_swings: int) -> pd.DataFrame:
    df = ohlcv.copy().sort_index()
    for col in ['Open', 'High', 'Low', 'Close', 'Volume']:
        if col not in df.columns:
            raise ValueError(f'Missing column: {col}')

    o = df['Open'].astype(float)
    h = df['High'].astype(float)
    l = df['Low'].astype(float)
    c = df['Close'].astype(float)
    v = df['Volume'].astype(float)

    # Basic returns (from OHLC)
    ret_1d = c.pct_change()

    # Swing highs/lows (centered window)
    w = int(swing_window)
    win = 2 * w + 1
    swing_high = h.eq(h.rolling(win, center=True, min_periods=win).max())
    swing_low = l.eq(l.rolling(win, center=True, min_periods=win).min())

    last_swing_high = h.where(swing_high).ffill()
    last_swing_low = l.where(swing_low).ffill()

    # BOS events: first close break across prior swing level
    prev_high = last_swing_high.shift(1)
    prev_low = last_swing_low.shift(1)
    bos_bull = (c > prev_high) & (c.shift(1) <= prev_high) & prev_high.notna()
    bos_bear = (c < prev_low) & (c.shift(1) >= prev_low) & prev_low.notna()

    # Trend state from market structure (HH+HL for uptrend, LH+LL for downtrend)
    # We compute streaks on swing events and forward-fill to daily state.
    sh_vals = h.where(swing_high).dropna()
    sl_vals = l.where(swing_low).dropna()

    # For N swing points, we need (N-1) consecutive comparisons.
    req = max(1, int(n_consec_swings) - 1)

    # Higher-high / lower-high streaks (on swing highs)
    hh_flag = (sh_vals.diff() > 0).fillna(False)
    lh_flag = (sh_vals.diff() < 0).fillna(False)
    hh_streak = hh_flag.astype(int).groupby((~hh_flag).cumsum()).cumsum()
    lh_streak = lh_flag.astype(int).groupby((~lh_flag).cumsum()).cumsum()

    # Higher-low / lower-low streaks (on swing lows)
    hl_flag = (sl_vals.diff() > 0).fillna(False)
    ll_flag = (sl_vals.diff() < 0).fillna(False)
    hl_streak = hl_flag.astype(int).groupby((~hl_flag).cumsum()).cumsum()
    ll_streak = ll_flag.astype(int).groupby((~ll_flag).cumsum()).cumsum()

    # Daily state (forward fill from last swing event)
    hh_daily = hh_streak.reindex(df.index, method='ffill').fillna(0).astype(int)
    lh_daily = lh_streak.reindex(df.index, method='ffill').fillna(0).astype(int)
    hl_daily = hl_streak.reindex(df.index, method='ffill').fillna(0).astype(int)
    ll_daily = ll_streak.reindex(df.index, method='ffill').fillna(0).astype(int)

    uptrend = ((hh_daily >= req) & (hl_daily >= req)).astype(int)
    downtrend = ((lh_daily >= req) & (ll_daily >= req)).astype(int)

    # CHOCH events: trend state breaks against the prior swing level
    choch_bear = (uptrend.astype(bool)) & (c < prev_low) & (c.shift(1) >= prev_low) & prev_low.notna()
    choch_bull = (downtrend.astype(bool)) & (c > prev_high) & (c.shift(1) <= prev_high) & prev_high.notna()

    # Additional OHLCV features (kept simple and causal)
    hl_range = (h - l)
    hl_range_pct = hl_range / c.shift(1)
    body = (c - o)
    body_pct = body / o.replace(0.0, np.nan)
    hlc3 = (h + l + c) / 3.0
    
    vol_mean_20 = v.rolling(20).mean()
    vol_std_20 = v.rolling(20).std(ddof=1)
    vol_z_20 = (v - vol_mean_20) / (vol_std_20 + 1e-12)

    feat = pd.DataFrame(
        {
            # OHLCV raw
            'open': o,
            'high': h,
            'low': l,
            'close': c,
            'volume': v,

            # OHLCV derived
            'ret_1d': ret_1d,
            'hl_range': hl_range,
            'hl_range_pct': hl_range_pct,
            'body': body,
            'body_pct': body_pct,
            'hlc3': hlc3,
            'vol_z_20': vol_z_20,

            # SMC-ish indicators
            'bos_bull': bos_bull.astype(int),
            'bos_bear': bos_bear.astype(int),
            'choch_bull': choch_bull.astype(int),
            'choch_bear': choch_bear.astype(int),
            'uptrend': uptrend.astype(int),
            'downtrend': downtrend.astype(int),
        },
        index=df.index,
    )

    return feat


In [4]:
# Load OHLCV for 100 assets
files = sorted([p for p in CLEANED_DIR.glob('Asset_*.csv')])
if not files:
    raise FileNotFoundError(f'No Asset_*.csv found in {CLEANED_DIR}')
files = files[:100]

frames = []
for p in files:
    sym = p.stem
    ohlcv = pd.read_csv(p, parse_dates=['Date']).set_index('Date').sort_index()
    feat = compute_smc_features(ohlcv, swing_window=SWING_WINDOW, n_consec_swings=N_CONSEC_SWINGS)
    feat['Asset_ID'] = sym
    frames.append(feat)

data = pd.concat(frames, axis=0).sort_index()

# Forward label per asset
data[TARGET_FWD_COL] = data.groupby('Asset_ID', sort=False)[TARGET_COL].shift(-1)
data = data.dropna(subset=[TARGET_FWD_COL])

print('shape:', data.shape)
print('date range:', data.index.min(), '->', data.index.max())
print('assets:', data['Asset_ID'].nunique())
display(data.head(3))


shape: (251000, 20)
date range: 2016-01-25 00:00:00 -> 2026-01-15 00:00:00
assets: 100


Unnamed: 0_level_0,open,high,low,close,volume,ret_1d,hl_range,hl_range_pct,body,body_pct,hlc3,vol_z_20,bos_bull,bos_bear,choch_bull,choch_bear,uptrend,downtrend,Asset_ID,y_ret_1d_fwd
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
2016-01-25,29.178415,29.18129,28.514486,28.580592,249449990.0,,0.666804,,-0.597822,-0.020489,28.758789,,0,0,0,0,0,0,Asset_001,0.005531
2016-01-25,8.869026,8.875576,8.476011,8.489112,148653644.0,,0.399565,,-0.379914,-0.042836,8.613566,,0,0,0,0,0,0,Asset_028,0.027007
2016-01-25,24.042331,24.186298,23.795532,23.826383,65904403.0,,0.390766,,-0.215948,-0.008982,23.936071,,0,0,0,0,0,0,Asset_029,0.023737


## Time-based split
We apply the same time split to this custom feature set.


In [5]:
def align_to_trading_date(index: pd.DatetimeIndex, ts: pd.Timestamp) -> pd.Timestamp:
    pos = int(index.searchsorted(ts, side='left'))
    if pos >= len(index):
        return pd.Timestamp(index[-1])
    return pd.Timestamp(index[pos])

idx = pd.DatetimeIndex(data.index.unique()).sort_values()
end = pd.Timestamp(idx[-1])

raw_test_start = end - pd.DateOffset(months=TEST_MONTHS)
raw_val_start = raw_test_start - pd.DateOffset(months=VAL_MONTHS)
raw_train_start = raw_val_start - pd.DateOffset(years=TRAIN_YEARS)

test_start = align_to_trading_date(idx, pd.Timestamp(raw_test_start))
val_start = align_to_trading_date(idx, pd.Timestamp(raw_val_start))
train_start = align_to_trading_date(idx, pd.Timestamp(raw_train_start))

print('aligned boundaries:')
print('  train_start:', train_start)
print('  val_start  :', val_start)
print('  test_start :', test_start)
print('  end        :', end)

df_train = data.loc[(data.index >= train_start) & (data.index < val_start)].copy()
df_val = data.loc[(data.index >= val_start) & (data.index < test_start)].copy()
df_test = data.loc[(data.index >= test_start) & (data.index <= end)].copy()

print('rows train/val/test:', df_train.shape[0], df_val.shape[0], df_test.shape[0])
print('days train/val/test:', df_train.index.nunique(), df_val.index.nunique(), df_test.index.nunique())
print('assets train/val/test:', df_train['Asset_ID'].nunique(), df_val['Asset_ID'].nunique(), df_test['Asset_ID'].nunique())


aligned boundaries:
  train_start: 2016-01-25 00:00:00
  val_start  : 2023-01-17 00:00:00
  test_start : 2024-07-15 00:00:00
  end        : 2026-01-15 00:00:00
rows train/val/test: 175700 37400 37900
days train/val/test: 1757 374 379
assets train/val/test: 100 100 100


In [6]:
exclude_cols = {'Asset_ID', TARGET_FWD_COL}
feature_cols = [c for c in df_train.columns if c not in exclude_cols and pd.api.types.is_numeric_dtype(df_train[c])]

print('feature_cols:', feature_cols)

def to_xy(d: pd.DataFrame):
    X = d.loc[:, feature_cols].replace([np.inf, -np.inf], np.nan)
    y = d.loc[:, TARGET_FWD_COL]
    return X, y

X_train, y_train = to_xy(df_train)
X_val, y_val = to_xy(df_val)
X_test, y_test = to_xy(df_test)

print('X_train:', X_train.shape, 'X_val:', X_val.shape, 'X_test:', X_test.shape)


feature_cols: ['open', 'high', 'low', 'close', 'volume', 'ret_1d', 'hl_range', 'hl_range_pct', 'body', 'body_pct', 'hlc3', 'vol_z_20', 'bos_bull', 'bos_bear', 'choch_bull', 'choch_bear', 'uptrend', 'downtrend']
X_train: (175700, 18) X_val: (37400, 18) X_test: (37900, 18)


In [7]:
@dataclass(frozen=True)
class RegressionMetrics:
    rmse: float
    mae: float
    r2: float
    spearman_ic: float


def compute_metrics(y_true: pd.Series, y_pred: np.ndarray) -> RegressionMetrics:
    rmse = float(np.sqrt(mean_squared_error(y_true, y_pred)))
    mae = float(mean_absolute_error(y_true, y_pred))
    r2 = float(r2_score(y_true, y_pred))
    ic = float(spearmanr(y_true.to_numpy(), y_pred, nan_policy='omit').correlation)
    return RegressionMetrics(rmse=rmse, mae=mae, r2=r2, spearman_ic=ic)


## Model 1/3: OLS (LinearRegression)


In [8]:
ols_pipe = Pipeline(
    steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler()),
        ('model', LinearRegression()),
    ]
)

ols_pipe.fit(X_train, y_train)

print('OLS train:', compute_metrics(y_train, ols_pipe.predict(X_train)))
print('OLS val  :', compute_metrics(y_val, ols_pipe.predict(X_val)))
print('OLS test :', compute_metrics(y_test, ols_pipe.predict(X_test)))


OLS train: RegressionMetrics(rmse=0.01930281037357024, mae=0.012579510036769272, r2=0.009670557814156489, spearman_ic=0.04168134241529842)
OLS val  : RegressionMetrics(rmse=0.016015925796699446, mae=0.010965941598690047, r2=-0.016293649311212022, spearman_ic=-0.014333744911543005)
OLS test : RegressionMetrics(rmse=0.019447031131666064, mae=0.012799352392210967, r2=-0.018635283457884455, spearman_ic=0.008653073728048867)


## Model 2/3: Ridge (alpha tuned on validation)


In [9]:
def tune_ridge(alpha_grid: list[float]) -> tuple[float, pd.DataFrame]:
    rows = []
    for a in alpha_grid:
        pipe = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()),
            ('model', Ridge(alpha=a, random_state=SEED)),
        ])
        pipe.fit(X_train, y_train)
        m = compute_metrics(y_val, pipe.predict(X_val))
        rows.append({'alpha': a, 'val_rmse': m.rmse, 'val_mae': m.mae, 'val_r2': m.r2, 'val_ic': m.spearman_ic})
    res = pd.DataFrame(rows).sort_values('val_rmse', ascending=True).reset_index(drop=True)
    return float(res.iloc[0]['alpha']), res

best_alpha_ridge, ridge_grid = tune_ridge(ALPHA_GRID)
display(ridge_grid)
print('best_alpha_ridge:', best_alpha_ridge)
ridge_grid.to_csv(OUTPUT_DIR / 'ridge_alpha_valgrid.csv', index=False)

ridge_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', Ridge(alpha=best_alpha_ridge, random_state=SEED)),
])
ridge_pipe.fit(X_train, y_train)

print('Ridge train:', compute_metrics(y_train, ridge_pipe.predict(X_train)))
print('Ridge val  :', compute_metrics(y_val, ridge_pipe.predict(X_val)))
print('Ridge test :', compute_metrics(y_test, ridge_pipe.predict(X_test)))


Unnamed: 0,alpha,val_rmse,val_mae,val_r2,val_ic
0,100.0,0.015949,0.010916,-0.007825,-0.014992
1,10.0,0.015959,0.010923,-0.009043,-0.015291
2,1.0,0.015997,0.010951,-0.013861,-0.014489
3,0.1,0.016013,0.010964,-0.015985,-0.01435
4,0.01,0.016016,0.010966,-0.016262,-0.014335
5,0.001,0.016016,0.010966,-0.01629,-0.014333
6,0.0001,0.016016,0.010966,-0.016293,-0.014333


best_alpha_ridge: 100.0
Ridge train: RegressionMetrics(rmse=0.019327577869022883, mae=0.012576219726664402, r2=0.0071275378658401944, spearman_ic=0.03753750365124054)
Ridge val  : RegressionMetrics(rmse=0.015949057957717824, mae=0.01091621583890943, r2=-0.007825141513374767, spearman_ic=-0.014991723370601218)
Ridge test : RegressionMetrics(rmse=0.019303354703252, mae=0.012687398142148264, r2=-0.0036393449483664853, spearman_ic=0.013756362688875269)


## Model 3/3: Lasso (alpha tuned on validation)


In [10]:
def tune_lasso(alpha_grid: list[float]) -> tuple[float, pd.DataFrame]:
    rows = []
    for a in alpha_grid:
        pipe = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler()),
            ('model', Lasso(alpha=a, random_state=SEED, max_iter=20_000)),
        ])
        pipe.fit(X_train, y_train)
        m = compute_metrics(y_val, pipe.predict(X_val))
        rows.append({'alpha': a, 'val_rmse': m.rmse, 'val_mae': m.mae, 'val_r2': m.r2, 'val_ic': m.spearman_ic})
    res = pd.DataFrame(rows).sort_values('val_rmse', ascending=True).reset_index(drop=True)
    return float(res.iloc[0]['alpha']), res

best_alpha_lasso, lasso_grid = tune_lasso(ALPHA_GRID)
display(lasso_grid)
print('best_alpha_lasso:', best_alpha_lasso)
lasso_grid.to_csv(OUTPUT_DIR / 'lasso_alpha_valgrid.csv', index=False)

lasso_pipe = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('model', Lasso(alpha=best_alpha_lasso, random_state=SEED, max_iter=20_000)),
])
lasso_pipe.fit(X_train, y_train)

print('Lasso train:', compute_metrics(y_train, lasso_pipe.predict(X_train)))
print('Lasso val  :', compute_metrics(y_val, lasso_pipe.predict(X_val)))
print('Lasso test :', compute_metrics(y_test, lasso_pipe.predict(X_test)))


  ic = float(spearmanr(y_true.to_numpy(), y_pred, nan_policy='omit').correlation)
  ic = float(spearmanr(y_true.to_numpy(), y_pred, nan_policy='omit').correlation)
  ic = float(spearmanr(y_true.to_numpy(), y_pred, nan_policy='omit').correlation)
  ic = float(spearmanr(y_true.to_numpy(), y_pred, nan_policy='omit').correlation)
  ic = float(spearmanr(y_true.to_numpy(), y_pred, nan_policy='omit').correlation)


Unnamed: 0,alpha,val_rmse,val_mae,val_r2,val_ic
0,0.1,0.015887,0.01086,-7e-06,
1,0.01,0.015887,0.01086,-7e-06,
2,10.0,0.015887,0.01086,-7e-06,
3,1.0,0.015887,0.01086,-7e-06,
4,100.0,0.015887,0.01086,-7e-06,
5,0.001,0.015896,0.010867,-0.001077,-0.017548
6,0.0001,0.015939,0.010904,-0.006595,-0.016524


best_alpha_lasso: 0.1
Lasso train: RegressionMetrics(rmse=0.019396827294670512, mae=0.012586788696372204, r2=0.0, spearman_ic=nan)
Lasso val  : RegressionMetrics(rmse=0.015887076756732486, mae=0.010859805912974939, r2=-7.145644700612408e-06, spearman_ic=nan)
Lasso test : RegressionMetrics(rmse=0.01926878581542823, mae=0.012649241290022425, r2=-4.788321152626729e-05, spearman_ic=nan)


  ic = float(spearmanr(y_true.to_numpy(), y_pred, nan_policy='omit').correlation)
  ic = float(spearmanr(y_true.to_numpy(), y_pred, nan_policy='omit').correlation)
  ic = float(spearmanr(y_true.to_numpy(), y_pred, nan_policy='omit').correlation)


## Backtesting (Our Engine + Bokeh)

This section runs the **personal backtesting engine** on the **test window** for all 3 models (OLS/Ridge/Lasso), using the built-in detailed report + Bokeh dashboard.

- Strategy: weekly rebalance, long-only Top-K by predicted return (keep only positive predictions)
- Portfolio: 1/N equal weights
- Optional: run MPT for the best 1/N model


In [11]:
from IPython.display import display
from bokeh.io import output_notebook, show
import sys

# Ensure `src/` is on sys.path so `backtester` is importable
src_dir = PROJECT_ROOT / 'src'
if str(src_dir) not in sys.path:
    sys.path.append(str(src_dir))

from backtester.data import load_cleaned_assets, align_close_prices
from backtester.engine import BacktestConfig, run_backtest
from backtester.report import compute_backtest_report
from backtester.bokeh_plots import build_interactive_portfolio_layout
from backtester.portfolio import equal_weight, optimize_mpt

output_notebook()

assets = sorted(data['Asset_ID'].unique().tolist())
assets_ohlcv = load_cleaned_assets(symbols=assets, cleaned_dir=str(CLEANED_DIR))
close_prices = align_close_prices(assets_ohlcv).loc[test_start:end]
returns_matrix = close_prices.pct_change().fillna(0.0)

market_df = pd.DataFrame({
    'Open': pd.concat([df['Open'] for df in assets_ohlcv.values()], axis=1).mean(axis=1),
    'High': pd.concat([df['High'] for df in assets_ohlcv.values()], axis=1).mean(axis=1),
    'Low': pd.concat([df['Low'] for df in assets_ohlcv.values()], axis=1).mean(axis=1),
    'Close': pd.concat([df['Close'] for df in assets_ohlcv.values()], axis=1).mean(axis=1),
    'Volume': pd.concat([df['Volume'] for df in assets_ohlcv.values()], axis=1).sum(axis=1),
}).sort_index().loc[test_start:end]

pipes = {
    'ols': ols_pipe,
    'ridge': ridge_pipe,
    'lasso': lasso_pipe,
}

def pred_matrix_for_pipe(pipe: Pipeline) -> pd.DataFrame:
    pred = pipe.predict(X_test)
    long = pd.DataFrame({'Date': df_test.index, 'Asset_ID': df_test['Asset_ID'].to_numpy(), 'y_pred': pred})
    pm = long.pivot_table(index='Date', columns='Asset_ID', values='y_pred', aggfunc='mean').sort_index()
    return pm.reindex(close_prices.index)

pred_mats = {name: pred_matrix_for_pipe(pipe) for name, pipe in pipes.items()}

REBALANCE_FREQ = 'W'
TOP_K = min(20, len(assets))
LOOKBACK_DAYS = 126

def build_weights_from_predictions(pred_matrix: pd.DataFrame, *, pm_style: str) -> pd.DataFrame:
    # Use last trading day of each rebalance period
    rebal_dates = set(pd.Series(pred_matrix.index, index=pred_matrix.index).resample(REBALANCE_FREQ).last().dropna().tolist())
    w_last = pd.Series(0.0, index=assets)
    rows = []
    for dt in pred_matrix.index:
        if dt in rebal_dates:
            row = pred_matrix.loc[dt].dropna().sort_values(ascending=False)
            top = row.head(TOP_K)
            candidates = [a for a, v in top.items() if np.isfinite(v) and v > 0]
            if not candidates:
                w_last = pd.Series(0.0, index=assets)
            else:
                if pm_style == '1N':
                    w_dict = equal_weight(candidates)
                elif pm_style == 'MPT':
                    w_dict = optimize_mpt(returns_matrix, candidates, dt, lookback_days=LOOKBACK_DAYS)
                else:
                    raise ValueError(f'Unknown pm_style: {pm_style!r}')
                w_last = pd.Series(0.0, index=assets)
                for a, w in w_dict.items():
                    w_last[a] = float(w)
        rows.append(w_last)
    return pd.DataFrame(rows, index=pred_matrix.index, columns=assets).fillna(0.0)

cfg = BacktestConfig(initial_equity=1_000_000.0, transaction_cost_bps=5.0, mode='vectorized')

# Run 1/N for all three models
compare_rows = []
results_1n = {}
for model_name, pm in pred_mats.items():
    w = build_weights_from_predictions(pm, pm_style='1N')
    res = run_backtest(close_prices, w, config=cfg)
    rpt = compute_backtest_report(result=res, close_prices=close_prices)
    results_1n[model_name] = (w, res, rpt)
    compare_rows.append({
        'model': model_name,
        'Total Return [%]': float(rpt['Total Return [%]']),
        'CAGR [%]': float(rpt['CAGR [%]']),
        'Sharpe': float(rpt['Sharpe']),
        'Max Drawdown [%]': float(rpt['Max Drawdown [%]']),
    })

compare = pd.DataFrame(compare_rows).sort_values('Total Return [%]', ascending=False).reset_index(drop=True)
display(compare)
compare.to_csv(OUTPUT_DIR / 'backtest_1n_model_compare.csv', index=False)

# Show Bokeh dashboards + reports for each 1/N model
for model_name in ['ols', 'ridge', 'lasso']:
    w, res, rpt = results_1n[model_name]
    title = f'SMC Linear ({model_name}) - 1/N (Test Window)'
    display(rpt.to_frame(title))
    layout = build_interactive_portfolio_layout(
        market_ohlcv=market_df,
        equity=res.equity,
        returns=res.returns,
        weights=res.weights,
        turnover=res.turnover,
        costs=res.costs,
        close_prices=close_prices,
        title=title,
    )
    show(layout)

# Optional: MPT backtest for the best 1/N model
best_model = str(compare.loc[0, 'model'])
best_pm = pred_mats[best_model]
w_mpt = build_weights_from_predictions(best_pm, pm_style='MPT')
res_mpt = run_backtest(close_prices, w_mpt, config=cfg)
rpt_mpt = compute_backtest_report(result=res_mpt, close_prices=close_prices)
display(rpt_mpt.to_frame(f'SMC Linear ({best_model}) - MPT (Test Window)'))
layout = build_interactive_portfolio_layout(
    market_ohlcv=market_df,
    equity=res_mpt.equity,
    returns=res_mpt.returns,
    weights=res_mpt.weights,
    turnover=res_mpt.turnover,
    costs=res_mpt.costs,
    close_prices=close_prices,
    title=f'SMC Linear ({best_model}) - MPT (Test Window)',
)
show(layout)


Unnamed: 0,model,Total Return [%],CAGR [%],Sharpe,Max Drawdown [%]
0,lasso,41.831591,26.235639,1.396587,-18.490968
1,ridge,37.247918,23.50101,1.064191,-21.478081
2,ols,12.529357,8.187535,0.486278,-22.598123


Unnamed: 0,SMC Linear (ols) - 1/N (Test Window)
Start,2024-07-15 00:00:00
End,2026-01-15 00:00:00
Duration,549 days 00:00:00
Initial Equity,1000000.0
Final Equity,1125293.573016
Equity Peak,1137801.337237
Total Return [%],12.529357
CAGR [%],8.187535
Volatility (ann) [%],20.363539
Sharpe,0.486278


Unnamed: 0,SMC Linear (ridge) - 1/N (Test Window)
Start,2024-07-15 00:00:00
End,2026-01-15 00:00:00
Duration,549 days 00:00:00
Initial Equity,1000000.0
Final Equity,1372479.175301
Equity Peak,1384509.421592
Total Return [%],37.247918
CAGR [%],23.50101
Volatility (ann) [%],22.045462
Sharpe,1.064191


Unnamed: 0,SMC Linear (lasso) - 1/N (Test Window)
Start,2024-07-15 00:00:00
End,2026-01-15 00:00:00
Duration,549 days 00:00:00
Initial Equity,1000000.0
Final Equity,1418315.908397
Equity Peak,1432693.438242
Total Return [%],41.831591
CAGR [%],26.235639
Volatility (ann) [%],17.764512
Sharpe,1.396587


Unnamed: 0,SMC Linear (lasso) - MPT (Test Window)
Start,2024-07-15 00:00:00
End,2026-01-15 00:00:00
Duration,549 days 00:00:00
Initial Equity,1000000.0
Final Equity,1181270.48566
Equity Peak,1188827.778042
Total Return [%],18.127049
CAGR [%],11.746236
Volatility (ann) [%],16.294835
Sharpe,0.76154
