# LeNet5-like CNN on Feature Grid

These notebooks implement the Chapter 18 CNN models (time series as images) on this repo's dataset.

Common requirements satisfied in each notebook:
- **Time-wise split**: train = first 7 years, val = middle, test = last 18 months
- Uses the existing backtester: `src/backtester.engine.run_backtest`
- Uses the existing Bokeh dashboard: `src/backtester.bokeh_plots.build_interactive_portfolio_layout`
- Backtest window is sliced to the test period so reported Start/End align with the test start (not 2016).

GPU note (AMD): PyTorch uses `torch.cuda` for both CUDA and ROCm builds. If your PyTorch build supports ROCm, `torch.cuda.is_available()` should be True.


## Math: Convolution and Pooling

For a 1D signal x and kernel w, a discrete 1D convolution (cross-correlation in many DL libs) is:

$$y[t] = \sum_{k=0}^{K-1} w[k] \cdot x[t+k].$$

For multichannel inputs, the kernel has shape (C_in, K), and outputs sum over channels.

Pooling reduces resolution while keeping salient local patterns. For example, max-pooling over window size P:

$$y[t] = \max_{0\le k < P} x[t+k].$$

CNNs exploit locality and weight sharing to reduce parameters vs fully-connected layers.


## Data, Label, and Portfolio Construction

We load the repo's feature-extracted panel dataset and create a forward label per asset:

$$y_t = r_{t+1}$$

where `ret_1d` is the known one-day return at time t.

Binary direction label:

$$y^{\uparrow}_t = \mathbb{1}[y_t > 0].$$

From a model probability $p_{i,t}$, define a centered score $s_{i,t} = p_{i,t} - 0.5$.

Weekly Top-K long-only weights at rebalance date t:

$$w_{i,t}=\frac{1}{|\mathcal{L}_t|}\mathbb{1}[i\in\mathcal{L}_t],\quad \mathcal{L}_t=\{\text{TopK}(s_{\cdot,t})\cap (s_{\cdot,t}>0)\}.$$


## LeNet5 (Adapted) - Math/Architecture

LeNet5 uses stacked Conv2D + pooling + fully connected layers.
We adapt it to 15x15 single-channel inputs.


In [None]:
from __future__ import annotations

from pathlib import Path
import sys

import numpy as np
import pandas as pd

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

from bokeh.io import output_notebook, show


In [None]:
# Resolve project root robustly by walking parents.
CWD = Path.cwd().resolve()
PROJECT_ROOT = None
for p in [CWD, *CWD.parents]:
    if (p / 'dataset').exists() and (p / 'src').exists():
        PROJECT_ROOT = p
        break
if PROJECT_ROOT is None:
    raise RuntimeError(f'Could not locate project root from CWD={CWD}')
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))
print('PROJECT_ROOT:', PROJECT_ROOT)


In [None]:
# Device selection: CUDA on NVIDIA, ROCm on AMD (if your PyTorch build supports it).
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('torch version:', torch.__version__)
print('device:', device)


In [None]:
SEED = 42
rng = np.random.default_rng(SEED)
torch.manual_seed(SEED)

FEATURES_PARQUET_PATH = PROJECT_ROOT / 'dataset' / 'features' / 'all_features.parquet'
FEATURES_CSV_PATH = PROJECT_ROOT / 'dataset' / 'features' / 'all_features.csv'

TARGET_COL = 'ret_1d'
TARGET_FWD_COL = 'y_ret_1d_fwd'
TARGET_DIR_COL = 'y_up_fwd'

TRAIN_YEARS = 7
TEST_MONTHS = 18

REBALANCE_FREQ = 'W'
TOP_K = 20
INITIAL_EQUITY = 1_000_000.0
TXN_COST_BPS = 5.0


def load_feature_dataset() -> pd.DataFrame:
    if FEATURES_PARQUET_PATH.exists():
        df = pd.read_parquet(FEATURES_PARQUET_PATH)
        if 'Date' in df.columns:
            df['Date'] = pd.to_datetime(df['Date'])
            df = df.set_index('Date')
    elif FEATURES_CSV_PATH.exists():
        df = pd.read_csv(FEATURES_CSV_PATH, parse_dates=['Date']).set_index('Date')
    else:
        raise FileNotFoundError('Feature dataset not found under dataset/features/.')

    required = {'Asset_ID', TARGET_COL}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f'Missing required columns: {sorted(missing)}')

    return df.sort_index()

panel = load_feature_dataset().copy()

panel[TARGET_FWD_COL] = panel.groupby('Asset_ID', sort=False)[TARGET_COL].shift(-1)
panel = panel.dropna(subset=[TARGET_FWD_COL]).sort_index()
panel[TARGET_DIR_COL] = (panel[TARGET_FWD_COL].astype(float) > 0.0).astype(int)

start = pd.Timestamp(panel.index.min())
end = pd.Timestamp(panel.index.max())
train_end = start + pd.DateOffset(years=TRAIN_YEARS)
test_start = end - pd.DateOffset(months=TEST_MONTHS)

if train_end >= test_start:
    raise ValueError(
        f'Not enough history for requested split: start={start.date()} train_end={train_end.date()} test_start={test_start.date()} end={end.date()}'
    )

train_mask = panel.index < train_end
val_mask = (panel.index >= train_end) & (panel.index < test_start)
test_mask = panel.index >= test_start

panel_train = panel.loc[train_mask].copy()
panel_val = panel.loc[val_mask].copy()
panel_test = panel.loc[test_mask].copy()

print('date range:', start.date(), '->', end.date())
print('train:', panel_train.index.min().date(), '->', panel_train.index.max().date(), 'rows:', panel_train.shape[0])
print('val  :', panel_val.index.min().date(), '->', panel_val.index.max().date(), 'rows:', panel_val.shape[0])
print('test :', panel_test.index.min().date(), '->', panel_test.index.max().date(), 'rows:', panel_test.shape[0])
print('assets:', panel['Asset_ID'].nunique())

exclude = {'Asset_ID', TARGET_FWD_COL, TARGET_DIR_COL}
feature_cols = [c for c in panel.columns if c not in exclude]
numeric_feature_cols = [c for c in feature_cols if pd.api.types.is_numeric_dtype(panel[c])]
print('n_numeric_features:', len(numeric_feature_cols))


In [None]:
from sklearn.feature_selection import mutual_info_classif
from scipy.cluster.hierarchy import linkage, leaves_list
from scipy.spatial.distance import squareform

# Grid size adapts to available numeric features.
# Our feature dataset may have fewer than 15x15=225 numeric features.
N_AVAIL = len(numeric_feature_cols)
GRID = int(np.floor(np.sqrt(N_AVAIL)))
N_PIXELS = GRID * GRID
if N_PIXELS < 25:
    raise ValueError(f'Not enough numeric features for CNN grid: n={N_AVAIL}')
print('grid:', GRID, 'x', GRID, 'using', N_PIXELS, 'of', N_AVAIL, 'features')
BATCH_SIZE = 512
EPOCHS = 3
LR = 1e-3

# Select top features on training window
X_fs = panel_train[numeric_feature_cols].replace([np.inf, -np.inf], np.nan)
y_fs = panel_train[TARGET_DIR_COL].astype(int)
med = X_fs.median(axis=0)
X_imp = X_fs.fillna(med)

mi = mutual_info_classif(X_imp.to_numpy(), y_fs.to_numpy(), random_state=SEED)
mi_s = pd.Series(mi, index=numeric_feature_cols).sort_values(ascending=False)
sel = mi_s.head(N_PIXELS).index.tolist()
print('selected', len(sel), 'features for', GRID, 'x', GRID)

# Cluster/order features so correlated features are near each other in the grid
corr = X_imp.loc[:, sel].corr().fillna(0.0)
dist = (1.0 - corr.abs()).clip(lower=0.0)
Z = linkage(squareform(dist.values, checks=False), method='average')
order = leaves_list(Z)
sel_ordered = [sel[i] for i in order]

# Train normalization stats
train_vals = panel_train.loc[:, sel_ordered].replace([np.inf, -np.inf], np.nan)
train_med = train_vals.median(axis=0)
train_mean = train_vals.fillna(train_med).mean(axis=0)
train_std = train_vals.fillna(train_med).std(axis=0, ddof=0).replace(0.0, 1.0)


def make_images(df: pd.DataFrame):
    X = df.loc[:, sel_ordered].replace([np.inf, -np.inf], np.nan).fillna(train_med)
    X = ((X - train_mean) / train_std).to_numpy(dtype=np.float32)
    X = X.reshape((len(df), 1, GRID, GRID))
    y = df[TARGET_DIR_COL].astype(int).to_numpy(dtype=np.int64)
    dates = df.index.to_numpy(dtype='datetime64[ns]')
    assets = df['Asset_ID'].to_numpy()
    return X, y, dates, assets

X_tr, y_tr, d_tr, a_tr = make_images(panel_train.loc[:, ['Asset_ID', TARGET_DIR_COL] + sel_ordered])
X_va, y_va, d_va, a_va = make_images(panel_val.loc[:, ['Asset_ID', TARGET_DIR_COL] + sel_ordered])
X_te, y_te, d_te, a_te = make_images(panel_test.loc[:, ['Asset_ID', TARGET_DIR_COL] + sel_ordered])

print('X_tr:', X_tr.shape, 'X_te:', X_te.shape)

class ImgDS(Dataset):
    def __init__(self, X, y):
        self.X = torch.from_numpy(X)
        self.y = torch.from_numpy(y)
    def __len__(self):
        return self.X.shape[0]
    def __getitem__(self, i):
        return self.X[i], self.y[i]

train_loader = DataLoader(ImgDS(X_tr, y_tr), batch_size=BATCH_SIZE, shuffle=True)
val_loader = DataLoader(ImgDS(X_va, y_va), batch_size=BATCH_SIZE, shuffle=False)

class LeNetLike(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 6, kernel_size=3, padding=0),
            nn.Tanh(),
            nn.AvgPool2d(2),
            nn.Conv2d(6, 16, kernel_size=3, padding=0),
            nn.Tanh(),
            # Make output shape independent of GRID
            nn.AdaptiveAvgPool2d((1, 1)),
            nn.Flatten(),
        )
        self.fc = nn.Sequential(
            nn.Linear(16, 64),
            nn.Tanh(),
            nn.Linear(64, 1),
        )

    def forward(self, x):
        x = self.conv(x)
        return self.fc(x).squeeze(-1)

model = LeNetLike().to(device)
opt = torch.optim.Adam(model.parameters(), lr=LR)
loss_fn = nn.BCEWithLogitsLoss()

for epoch in range(1, EPOCHS + 1):
    model.train()
    for xb, yb in train_loader:
        xb = xb.to(device)
        yb = yb.float().to(device)
        opt.zero_grad(set_to_none=True)
        logits = model(xb)
        loss = loss_fn(logits, yb)
        loss.backward()
        opt.step()

    model.eval()
    all_p, all_y = [], []
    with torch.no_grad():
        for xb, yb in val_loader:
            xb = xb.to(device)
            p = torch.sigmoid(model(xb)).cpu().numpy()
            all_p.append(p)
            all_y.append(yb.numpy())
    if all_p:
        import sklearn.metrics as skm
        p = np.concatenate(all_p)
        y = np.concatenate(all_y)
        if len(np.unique(y)) > 1:
            print('epoch', epoch, 'val_auc', float(skm.roc_auc_score(y, p)))

model.eval()
with torch.no_grad():
    logits = model(torch.from_numpy(X_te).to(device)).cpu().numpy()
proba = 1.0 / (1.0 + np.exp(-logits))

MODEL_TITLE = 'LeNet5-like 2D CNN on Feature Grid - Time Split'

dates_test = pd.to_datetime(d_te)
assets_test = a_te
scores_test = proba - 0.5


In [None]:
from src.backtester.data import load_cleaned_assets, align_close_prices
from src.backtester.engine import BacktestConfig, run_backtest
from src.backtester.report import compute_backtest_report
from src.backtester.bokeh_plots import build_interactive_portfolio_layout
from src.backtester.portfolio import equal_weight

# Convert per-row predictions to pred_matrix[date, asset]
long = pd.DataFrame({'Date': dates_test, 'Asset_ID': assets_test, 'score': scores_test})
pred_matrix = long.pivot_table(index='Date', columns='Asset_ID', values='score', aggfunc='mean').sort_index()

bt_assets = sorted(panel['Asset_ID'].unique().tolist())
assets_ohlcv = load_cleaned_assets(symbols=bt_assets, cleaned_dir=str(PROJECT_ROOT / 'dataset' / 'cleaned'))
close_prices = align_close_prices(assets_ohlcv).sort_index()

# Slice backtest to TEST window (fixes Start not being 2016)
bt_start = pd.Timestamp(panel_test.index.min())
bt_end = pd.Timestamp(panel_test.index.max())
close_prices = close_prices.loc[bt_start:bt_end]

assets_ohlcv_slice = {k: v.loc[bt_start:bt_end] for k, v in assets_ohlcv.items()}
market_df = pd.DataFrame({
    'Open': pd.concat([d['Open'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'High': pd.concat([d['High'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'Low': pd.concat([d['Low'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'Close': pd.concat([d['Close'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'Volume': pd.concat([d['Volume'] for d in assets_ohlcv_slice.values()], axis=1).sum(axis=1),
}).sort_index()

pred_matrix = pred_matrix.reindex(close_prices.index)
output_notebook()

rebal_dates = set(pd.Series(pred_matrix.index, index=pred_matrix.index).resample(REBALANCE_FREQ).last().dropna().tolist())

w_last = pd.Series(0.0, index=bt_assets)
w_rows = []
for dt in pred_matrix.index:
    if dt in rebal_dates:
        row = pred_matrix.loc[dt].dropna().sort_values(ascending=False)
        top = row.head(min(TOP_K, len(row)))
        candidates = [a for a, v in top.items() if np.isfinite(v) and float(v) > 0.0]
        if len(candidates) == 0:
            w_last = pd.Series(0.0, index=bt_assets)
        else:
            w_dict = equal_weight(candidates)
            w_last = pd.Series(0.0, index=bt_assets)
            for a, w in w_dict.items():
                if a in w_last.index:
                    w_last[a] = float(w)
    w_rows.append(w_last)

weights = pd.DataFrame(w_rows, index=pred_matrix.index, columns=bt_assets).fillna(0.0)

cfg = BacktestConfig(initial_equity=INITIAL_EQUITY, transaction_cost_bps=TXN_COST_BPS, mode='vectorized')
res = run_backtest(close_prices, weights, config=cfg)
report = compute_backtest_report(result=res, close_prices=close_prices)
display(report.to_frame(MODEL_TITLE))

layout = build_interactive_portfolio_layout(
    market_ohlcv=market_df,
    equity=res.equity,
    returns=res.returns,
    weights=res.weights,
    turnover=res.turnover,
    costs=res.costs,
    close_prices=close_prices,
    title=MODEL_TITLE,
)
show(layout)
