<a href="https://colab.research.google.com/github/Shadfurman/ArbitrAvatar/blob/master/trading_bot_scaffold.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Trading Bot: Conv-Transformer → Backtest (Scaffold)

End-to-end notebook scaffold. Fill in the TODOs only—no scope creep.

**Sections:**
- Setup & seed
- Data & features
- Split
- Model
- Train
- Inference & backtest
- Metrics & plots
- Params dump & outputs


## Setup & seed

In [1]:
# --- Setup & seed ---

import os, random, json, glob, math
import numpy as np
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import IterableDataset, DataLoader

# Paths (adjust as needed; do not restructure unless necessary)
PARQUET_DIR = '/content/drive/MyDrive/stock_market_data/data_parquet'
OUTPUT_DIR  = '/content/drive/MyDrive/Colab Notebooks/outputs'
os.makedirs(OUTPUT_DIR, exist_ok=True)

# Hyperparameters and configuration
# Define default parameters. You should update values (especially 'features' and date ranges) for your dataset.
PARAMS = {
    'seed': 42,
    'features': ['open', 'high', 'low', 'close', 'volume'],
    'lookback': 64,
    'batch_size': 512,
    'model_option': 'convtransformer',
    'split': {
        'train_start': '2005-01-01',
        'train_end':   '2015-12-31',
        'val_start':   '2016-01-01',
        'val_end':     '2017-12-31',
        'test_start':  '2018-01-01',
        'test_end':    '2018-12-31',
    },
    'calibration_T': 1.0,
    'margin': 0.15,
    'portfolio_mode': 'long_short',
}

# Set random seeds
SEED = PARAMS.get('seed', 42)
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)


In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# --- BOOTSTRAP A: mount + load params/norms + rebuild path lists (no model deps) ---

# Artifact paths
ART = {
    'lock': os.path.join(OUTPUT_DIR, 'params_locked.json'),
    'norms': os.path.join(OUTPUT_DIR, 'feature_norm.csv'),
    'ckpt': os.path.join(OUTPUT_DIR, 'best_model.pt'),
}
PRELOAD = {k: os.path.exists(v) for k, v in ART.items()}

# Merge locked params if present
if PRELOAD['lock']:
    with open(ART['lock'], 'r') as f:
        locked_params = json.load(f)
    PARAMS.update(locked_params)

# Load normalization stats if present and define normalize_inplace
FEAT_MEAN, FEAT_STD = {}, {}
if PRELOAD['norms']:
    norm_df = pd.read_csv(ART['norms'], index_col=0)
    FEAT_MEAN = norm_df['mean'].astype(float).to_dict()
    FEAT_STD  = norm_df['std'].replace(0, 1.0).astype(float).to_dict()
    def normalize_inplace(df: pd.DataFrame) -> pd.DataFrame:
        for col in PARAMS['features']:
            if col in df.columns:
                df[col] = (df[col] - FEAT_MEAN.get(col, 0.0)) / FEAT_STD.get(col, 1.0)
        return df
else:
    def normalize_inplace(df: pd.DataFrame) -> pd.DataFrame:
        return df

# Helper to count rows in a date range for a parquet file (lightweight)
def count_rows_in_range(p: str, start: str, end: str, any_col: str | None = None) -> int:
    any_col = any_col or (PARAMS['features'] + ['y'])[0]
    df = pd.read_parquet(p, columns=[any_col]).sort_index()
    df = df.loc[(df.index >= pd.to_datetime(start)) & (df.index <= pd.to_datetime(end))]
    return len(df)

# Discover parquet files and apply lookback+2 filter on train window
all_paths = sorted(glob.glob(os.path.join(PARQUET_DIR, '*.parquet')))
S = PARAMS['split']
lookback = int(PARAMS['lookback'])

# Keep tickers with enough rows in the train window
keep_paths = [p for p in all_paths if count_rows_in_range(p, S['train_start'], S['train_end']) >= lookback + 2]

def filt(ps, a, b):
    return [p for p in ps if count_rows_in_range(p, a, b) >= lookback + 2]

train_paths = filt(keep_paths, S['train_start'], S['train_end'])
val_paths   = filt(keep_paths, S['val_start'],   S['val_end'])
test_paths  = filt(keep_paths, S['test_start'],  S['test_end'])

print("Bootstrap A →", f"lock={PRELOAD['lock']} norms={PRELOAD['norms']} ckpt={PRELOAD['ckpt']} |",      f"tickers train/val/test = {len(train_paths)}/{len(val_paths)}/{len(test_paths)}")


Bootstrap A → lock=True norms=True ckpt=True | tickers train/val/test = 2558/2415/2406


In [7]:
# --- LOCAL OVERRIDES (optional, session-only; not persisted) ---
# These only affect this runtime unless you later write/lock params.
LOCAL_OVERRIDES = {
    # portfolio behavior
    "portfolio_mode": "long_only",   # or "long_short"
    "margin": 0.02,

    # calibration
    "calibration_T": PARAMS.get("calibration_T", 1.0),
}

print("LOCAL_OVERRIDES set:", LOCAL_OVERRIDES)


LOCAL_OVERRIDES set: {'portfolio_mode': 'long_only', 'margin': 0.02, 'calibration_T': 0.899}


## Data & features

In [4]:
import yfinance as yf

def load_ohlcv_yf(ticker: str, start: str, end: str) -> pd.DataFrame:
    """Load daily OHLCV from yfinance. TODO: cache to CSV in ./data."""
    df = yf.download(ticker, start=start, end=end, progress=False)
    if df.empty:
        raise ValueError(f'No data for {ticker}')
    df = df[['Open','High','Low','Close','Volume']].copy()
    df.index.name = 'Date'
    df['Ticker'] = ticker
    return df

def compute_rsi(series: pd.Series, period: int = 14) -> pd.Series:
    series = series.astype(float)
    delta = series.diff()
    up = delta.clip(lower=0.0)
    down = (-delta).clip(lower=0.0)
    roll_up = up.ewm(alpha=1/period, adjust=False).mean()
    roll_down = down.ewm(alpha=1/period, adjust=False).mean()
    rs = roll_up / (roll_down + 1e-8)
    return 100.0 - (100.0 / (1.0 + rs))

def make_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Compute pct-change, rolling mean/vol(10), RSI(14) on adjusted close if present.
    Add next-day direction target y (1 if up, else 0). Drops NaNs and sorts index.
    """
    out = df.copy()
    out = out.sort_index()

    # Prefer adjusted. If converter already replaced Close with adjusted, this is a no-op.
    price = out['AdjClose'] if 'AdjClose' in out.columns else out['Close']

    out['pct_change']   = price.pct_change()
    out['roll_mean_10'] = out['pct_change'].rolling(window=10, min_periods=10).mean()
    out['roll_vol_10']  = out['pct_change'].rolling(window=10, min_periods=10).std()
    out['rsi_14']       = compute_rsi(price, 14)

    # Target uses forward return on the same price series
    out['ret_fwd_1'] = price.pct_change().shift(-1)
    out['y'] = (out['ret_fwd_1'] > 0).astype(int)

    # Final cleanup: drop rows with any NaNs in features/target
    out = out.dropna(subset=['pct_change', 'roll_mean_10', 'roll_vol_10', 'rsi_14', 'ret_fwd_1', 'y'])
    return out

def load_all_tickers(tickers, start, end):
    frames = []
    for t in tickers:
        raw = load_ohlcv_yf(t, start, end)
        feats = make_features(raw)
        frames.append(feats)
    return pd.concat(frames).sort_index()

# TODO: Optionally visualize a sample
print('Data/feature functions defined. Fill TODOs as needed.')


Data/feature functions defined. Fill TODOs as needed.


## CSV→Parquet converter

In [None]:
'''
# --- CSV → Parquet (one-time) ---
import glob

# Your paths
OUTPUT_DIR = './outputs'
DATA_IN_DIRS = [
    '/content/drive/MyDrive/stock_market_data/nasdaq/csv',
    '/content/drive/MyDrive/stock_market_data/nyse/csv',
]
#PARQUET_DIR = '/content/data_parquet'
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(PARQUET_DIR, exist_ok=True)

# Map your CSV headers to canonical names we use elsewhere
CSV_COLS = {
    'Date': 'Date',
    'Open': 'Open',
    'High': 'High',
    'Low': 'Low',
    'Close': 'Close',                 # raw close (we'll replace with Adjusted Close for features)
    'Volume': 'Volume',
    'Adjusted Close': 'AdjClose',
}

def guess_ticker_from_path(p):
    import os
    return os.path.splitext(os.path.basename(p))[0].upper()

def load_csv_one(p):
    # Read permissively (handles odd rows/column orders)
    df = pd.read_csv(
        p,
        engine='python',           # robust parser
        on_bad_lines='skip'        # skip corrupted rows
    )

    # Normalize column names (lower -> canonical)
    orig_cols = df.columns.tolist()
    lc_map = {c.lower().strip(): c for c in df.columns}

    # Find variants
    date_col = lc_map.get('date')
    open_col = lc_map.get('open')
    high_col = lc_map.get('high')
    low_col  = lc_map.get('low')
    close_col= lc_map.get('close')
    vol_col  = lc_map.get('volume')
    adj_col  = lc_map.get('adjusted close') or lc_map.get('adj close') or lc_map.get('adjclose')

    # Hard fail if no Date
    if not date_col:
        raise ValueError(f'No Date column in {p} (had: {orig_cols})')

    # Build a minimal frame with whatever we found
    cols = {}
    cols['Date']   = date_col
    if open_col:  cols['Open']  = open_col
    if high_col:  cols['High']  = high_col
    if low_col:   cols['Low']   = low_col
    if close_col: cols['Close'] = close_col
    if vol_col:   cols['Volume']= vol_col
    if adj_col:   cols['AdjClose'] = adj_col

    df = df.rename(columns={v:k for k,v in cols.items()})[list(cols.keys())]

    # Parse day-first dates like 15-12-2010
    df['Date'] = pd.to_datetime(df['Date'], dayfirst=True, errors='coerce')
    df = df.dropna(subset=['Date']).sort_values('Date').set_index('Date')

    # Prefer adjusted close for features; fall back to Close
    if 'AdjClose' in df.columns:
        df['Close_raw'] = df.get('Close', df['AdjClose'])
        df['Close'] = df['AdjClose']
    else:
        # No adjusted close available; use raw Close
        if 'Close' not in df.columns:
            raise ValueError(f'No Close/AdjClose in {p} (had: {orig_cols})')
        df['Close_raw'] = df['Close']

    # Coerce numerics (strip stray chars like $ or commas)
    for c in ['Open','High','Low','Close','Volume','AdjClose','Close_raw']:
        if c in df.columns:
            df[c] = (df[c].astype(str)
                           .str.replace(r'[^0-9\.\-eE]', '', regex=True)
                           .replace({'': None}))
            df[c] = pd.to_numeric(df[c], errors='coerce')

    # Basic sanity: drop rows missing Close or Volume
    need = ['Close']
    if 'Volume' in df.columns: need.append('Volume')
    df = df.dropna(subset=need)

    df['Ticker'] = guess_ticker_from_path(p)
    return df

def iter_csv_files():
    for d in DATA_IN_DIRS:
        for p in glob.glob(os.path.join(d, '*.csv')):
            yield p

def build_parquet_once():
    converted, skipped = 0, []
    for p in iter_csv_files():
        tkr = guess_ticker_from_path(p)
        outp = os.path.join(PARQUET_DIR, f'{tkr}.parquet')
        if os.path.exists(outp):
            continue
        try:
            df = load_csv_one(p)
            feats = make_features(df)  # uses adjusted Close internally now
            feats = feats.drop(columns=[c for c in ['AdjClose','Close_raw'] if c in feats.columns], errors='ignore')
            feats.to_parquet(outp)
            converted += 1
            if converted % 50 == 0:
                print(f'Converted {converted} tickers...')
        except Exception as e:
            skipped.append((tkr, str(e)))
            if len(skipped) <= 5:
                print(f'[WARN] Skipping {tkr}: {e}')
    print(f'Parquet conversion complete. Total tickers converted: {converted}. Skipped: {len(skipped)}')
    if skipped:
        import json
        with open(os.path.join(OUTPUT_DIR, 'skipped_files.json'), 'w') as f:
            json.dump(skipped, f, indent=2)
        print('Wrote outputs/skipped_files.json')

build_parquet_once()
'''


SyntaxError: incomplete input (ipython-input-265338909.py, line 128)

## Parquet Loader

In [5]:
# --- Parquet path index (no big concat) ---
import os, glob
import pandas as pd

# Gather all ticker parquet files
paths = sorted(glob.glob(os.path.join(PARQUET_DIR, "*.parquet")))
S = PARAMS['split']
L = int(PARAMS['lookback'])

# Count rows for a single file within a date window, using a light read when possible
def count_rows_in_range(p, start, end):
    any_col = (PARAMS.get('features', []) + ['y', 'Close', 'Open'])[0]
    try:
        df = pd.read_parquet(p, columns=[any_col])
    except Exception:
        df = pd.read_parquet(p)

    # Ensure DatetimeIndex
    if not isinstance(df.index, pd.DatetimeIndex):
        if 'Date' in df.columns:
            df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
            df = df.dropna(subset=['Date']).set_index('Date')
        else:
            return 0

    df = df.sort_index()
    df = df.loc[(df.index >= pd.to_datetime(start)) & (df.index <= pd.to_datetime(end))]
    return len(df)

# Keep only tickers that have enough rows in ALL splits
keep_paths = []
for p in paths:
    n_tr = count_rows_in_range(p, S['train_start'], S['train_end'])
    n_va = count_rows_in_range(p, S['val_start'],   S['val_end'])
    n_te = count_rows_in_range(p, S['test_start'],  S['test_end'])
    if (n_tr >= L + 2) and (n_va >= L + 2) and (n_te >= L + 2):
        keep_paths.append(p)

# Build per-split lists from the intersection (defensive re-check)
def filter_paths_by_range(ps, start, end):
    out = []
    for p in ps:
        if count_rows_in_range(p, start, end) >= L + 2:
            out.append(p)
    return out

train_paths = filter_paths_by_range(keep_paths, S['train_start'], S['train_end'])
val_paths   = filter_paths_by_range(keep_paths, S['val_start'],   S['val_end'])
test_paths  = filter_paths_by_range(keep_paths, S['test_start'],  S['test_end'])

print("Parquet files scanned:", len(paths))
print("tickers kept (all splits ≥ lookback+2):", len(keep_paths))
print(f"train_paths: {len(train_paths)}, val_paths: {len(val_paths)}, test_paths: {len(test_paths)}")


tickers kept: 2406
train_paths: 2406, val_paths: 2406, test_paths: 2406


## Single-split indexing + streaming loader
# We use PARAMS['split'] (train/val/test date ranges) and Bootstrap A to build:
#   - train_paths / val_paths / test_paths  (each file has ≥ lookback+2 rows in that window)
# Data are streamed per-ticker via TickerIterableDataset.
# Feature normalization uses feature_norm.csv loaded in Bootstrap A; we normalize on-the-fly.

In [9]:
# --- Build per-split ticker file lists (tiny memory) ---
import os, glob, pandas as pd

paths = sorted(glob.glob(os.path.join(PARQUET_DIR, "*.parquet")))
S = PARAMS["split"]
L = int(PARAMS["lookback"])

def count_rows_in_range(p, start, end):
    """Lightweight row count for a single file within [start, end], avoiding big reads."""
    any_col = (PARAMS.get("features", []) + ["y", "Close", "Open"])[0]
    try:
        df = pd.read_parquet(p, columns=[any_col])
    except Exception:
        df = pd.read_parquet(p)

    # Ensure DatetimeIndex
    if not isinstance(df.index, pd.DatetimeIndex):
        if "Date" in df.columns:
            df["Date"] = pd.to_datetime(df["Date"], errors="coerce")
            df = df.dropna(subset=["Date"]).set_index("Date")
        else:
            return 0

    df = df.sort_index()
    df = df.loc[(df.index >= pd.to_datetime(start)) & (df.index <= pd.to_datetime(end))]
    return len(df)

# Keep only tickers with enough rows in ALL splits
keep_paths = []
for p in paths:
    n_tr = count_rows_in_range(p, S["train_start"], S["train_end"])
    n_va = count_rows_in_range(p, S["val_start"],   S["val_end"])
    n_te = count_rows_in_range(p, S["test_start"],  S["test_end"])
    if (n_tr >= L + 2) and (n_va >= L + 2) and (n_te >= L + 2):
        keep_paths.append(p)

def filter_paths(ps, start, end):
    out = []
    for p in ps:
        if count_rows_in_range(p, start, end) >= L + 2:
            out.append(p)
    return out

train_paths = filter_paths(keep_paths, S["train_start"], S["train_end"])
val_paths   = filter_paths(keep_paths, S["val_start"],   S["val_end"])
test_paths  = filter_paths(keep_paths, S["test_start"],  S["test_end"])

print(f"Kept: {len(keep_paths)} tickers | "
      f"train={len(train_paths)} val={len(val_paths)} test={len(test_paths)}")


Kept: 2406 tickers | train=2406 val=2406 test=2406


In [10]:
# --- Streaming dataset: reads per-ticker on the fly, applies normalization, yields windows ---
import os, pandas as pd, numpy as np, torch
from torch.utils.data import IterableDataset, DataLoader

feat_cols = PARAMS["features"]
lb = int(PARAMS["lookback"])

class TickerIterableDataset(IterableDataset):
    def __init__(self, parquet_paths, split_start, split_end):
        self.paths = parquet_paths
        self.start = pd.to_datetime(split_start)
        self.end   = pd.to_datetime(split_end)

    def __iter__(self):
        for p in self.paths:
            # read only what we need; keep memory tiny
            df = pd.read_parquet(p, columns=feat_cols + ["y"]).sort_index()
            df = df.loc[(df.index >= self.start) & (df.index <= self.end)]
            if len(df) < lb + 2:
                continue
            df = normalize_inplace(df)
            X = df[feat_cols].to_numpy(dtype=np.float32, copy=False)
            y = df["y"].to_numpy(dtype=np.int64,   copy=False)

            # yield sliding windows without prebuilding a giant array
            for i in range(lb, len(df)):
                yield X[i - lb:i], y[i]

# DataLoaders (keep workers=0 in Colab to avoid extra RAM)
S = PARAMS['split']
BATCH = min(1024, int(PARAMS.get("batch_size", 128)))

train_loader = DataLoader(
    TickerIterableDataset(train_paths, S["train_start"], S["train_end"]),
    batch_size=BATCH, shuffle=False, num_workers=0, pin_memory=True,
)

val_loader = DataLoader(
    TickerIterableDataset(val_paths, S["val_start"], S["val_end"]),
    batch_size=BATCH, shuffle=False, num_workers=0, pin_memory=True,
)

test_loader = DataLoader(
    TickerIterableDataset(test_paths, S["test_start"], S["test_end"]),
    batch_size=BATCH, shuffle=False, num_workers=0, pin_memory=True,
)

print(f"Streaming loaders ready. batch={BATCH} | tickers train/val/test = "
      f"{len(train_paths)}/{len(val_paths)}/{len(test_paths)}")


Streaming loaders ready. batch=1024 | tickers train/val/test = 2406/2406/2406


## Model (Option A/B/C stubs)

In [11]:
#       *** DEFINE MODEL ***
import torch
from torch import nn

class ConvTransformer(nn.Module):
    def __init__(self, in_feats, d_model=64, nhead=8, dim_ff=128, num_layers=2, dropout=0.1, num_classes=2):
        super().__init__()
        self.proj = nn.Conv1d(in_channels=in_feats, out_channels=d_model, kernel_size=3, padding=1)
        enc_layer = nn.TransformerEncoderLayer(
            d_model=d_model, nhead=nhead, dim_feedforward=dim_ff, dropout=dropout, batch_first=True
        )
        self.encoder = nn.TransformerEncoder(enc_layer, num_layers=num_layers)
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        # x: (B, T, F)
        x = x.transpose(1, 2)          # (B, F, T)
        x = torch.relu(self.proj(x))   # (B, d_model, T)
        x = x.transpose(1, 2)          # (B, T, d_model)
        x = self.encoder(x)            # (B, T, d_model)
        x = x.mean(dim=1)              # GAP over time
        return self.fc(x)

class CNN_GRU(nn.Module):
    """Option B: 1D-CNN + small GRU head → logits(2)."""
    def __init__(self, in_feats, hidden=32, num_classes=2):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv1d(in_channels=in_feats, out_channels=hidden, kernel_size=3, padding=1),
            nn.ReLU()
        )
        self.gru = nn.GRU(input_size=hidden, hidden_size=hidden, batch_first=True)
        self.fc = nn.Linear(hidden, num_classes)
    def forward(self, x):
        x = x.transpose(1, 2)       # (B, F, T)
        x = self.conv(x)            # (B, H, T)
        x = x.transpose(1, 2)       # (B, T, H)
        _, h = self.gru(x)          # h: (1, B, H)
        h = h.squeeze(0)
        return self.fc(h)

def build_model(option: str, in_feats: int):
    if option == 'A':
        return ConvTransformer(in_feats)
    raise ValueError('Only Option A is enabled right now')

print('Model stubs ready (A/B in PyTorch, C via sklearn).')


Model stubs ready (A/B in PyTorch, C via sklearn).


In [12]:
# --- BOOTSTRAP B: attach checkpoint, set session settings, and quick smoke test ---
import os, torch

# 0) Device
device = (
    "cuda" if torch.cuda.is_available()
    else ("mps" if getattr(torch.backends, "mps", None) and torch.backends.mps.is_available() else "cpu")
)

# 0b) Optional local overrides (define LOCAL_OVERRIDES in any earlier cell if you want to override locked params)
if "LOCAL_OVERRIDES" in globals() and isinstance(LOCAL_OVERRIDES, dict) and LOCAL_OVERRIDES:
    PARAMS.update(LOCAL_OVERRIDES)

# 1) Build model (requires build_model + PARAMS['features'])
assert "build_model" in globals(), "Define build_model(...) before running Bootstrap B."
num_features = len(PARAMS["features"])
try:
    model = build_model(PARAMS.get("model_option"), in_feats=num_features)
except TypeError:
    # Fallback if your build_model signature is simpler
    model = build_model(num_features)
model = model.to(device)

# 2) Load checkpoint (robust to different save formats)
ckpt_path = os.path.join(OUTPUT_DIR, "best_model.pt")
optimizer = None
CKPT_META = {}
if os.path.exists(ckpt_path):
    state = torch.load(ckpt_path, map_location=device)
    if isinstance(state, dict):
        if "model" in state:           # common: {"model": state_dict, "optimizer": ..., "epoch": ...}
            state_dict = state["model"]
        elif "state_dict" in state:     # some trainers
            state_dict = state["state_dict"]
        else:                           # assume it's already a state_dict-like mapping
            state_dict = state
    else:
        state_dict = state

    missing, unexpected = model.load_state_dict(state_dict, strict=False)
    model.eval()

    # Optional: resume optimizer if you have make_optimizer(...)
    if isinstance(state, dict) and "optimizer" in state and "make_optimizer" in globals():
        try:
            optimizer = make_optimizer(model, PARAMS)
            optimizer.load_state_dict(state["optimizer"])
        except Exception:
            optimizer = None  # don't fail Bootstrap if optimizer schema changed

    CKPT_META = {
        "epoch": (state.get("epoch") if isinstance(state, dict) else None),
        "best_metric": (state.get("best_metric") if isinstance(state, dict) else None),
        "missing_keys": missing, "unexpected_keys": unexpected,
    }
    print(f"Loaded checkpoint: {ckpt_path}")
else:
    model.eval()
    print("No checkpoint found — train once; this cell will auto-load on next run.")

# 3) Session settings (temperature, margin, mode)
T_best = float(PARAMS.get("calibration_T", 1.0))
margin = float(PARAMS.get("margin", 0.15))
mode   = str(PARAMS.get("portfolio_mode", "long_short"))
print(f"Session settings → T={T_best:.3f}, margin={margin}, mode={mode}, device={device}")

# 4) Lightweight inference helper (applies temperature; works for binary or multi-class)
@torch.no_grad()
def model_probs(batch_windows, T=None):
    """batch_windows: numpy/torch with shape [B, L, F]. Returns numpy probs."""
    import numpy as np
    x = torch.as_tensor(batch_windows, dtype=torch.float32, device=device)
    logits = model(x)
    t = float(T if T is not None else T_best)
    if logits.ndim == 1 or logits.shape[-1] == 1:
        z = logits.squeeze(-1) / t
        return torch.sigmoid(z).detach().cpu().numpy()
    return torch.softmax(logits / t, dim=-1).detach().cpu().numpy()

# 5) Tiny smoke test (toggle on/off)
RUN_SMOKE = True
if RUN_SMOKE:
    import numpy as np, pandas as pd
    # gather candidate files from Bootstrap A
    cands = []
    for _nm in ("train_paths", "val_paths", "test_paths"):
        if _nm in globals():
            cands += globals()[_nm]
    if cands:
        try:
            sample = cands[0]
            need_cols = list(set(PARAMS["features"] + ["y"]))
            df = pd.read_parquet(sample, columns=need_cols).sort_index().tail(200)
            if "normalize_inplace" in globals():
                df = normalize_inplace(df)
            L = int(PARAMS["lookback"])
            if len(df) > L + 1:
                X = df[PARAMS["features"]].to_numpy(np.float32)
                X = np.stack([X[i-L:i] for i in range(L, len(X))], axis=0)
                p = model_probs(X)[:3]
                print("Smoke OK:", X.shape, "first probs:", np.round(p, 3))
            else:
                print("Smoke skipped: not enough rows in sample for lookback =", L)
        except Exception as e:
            print("Smoke failed (non-fatal):", type(e).__name__, str(e)[:200])
    else:
        print("Smoke skipped: no *_paths available (run Bootstrap A first).")


Loaded checkpoint: /content/drive/MyDrive/Colab Notebooks/outputs/best_model.pt
Session settings → T=0.899, margin=0.02, mode=long_only, device=cuda
Smoke OK: (136, 64, 4) first probs: [[0.513 0.487]
 [0.513 0.487]
 [0.513 0.487]]


In [None]:
# --- Train: stabilized (no AMP), lower LR, grad clipping ---
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Using device:", device)

model = build_model(PARAMS['model_option'], in_feats=len(PARAMS['features'])).to(device)
crit  = nn.CrossEntropyLoss()
opt   = torch.optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-4)  # lower LR + small WD

def run_val(loader):
    model.eval(); correct=total=0
    with torch.no_grad():
        for xb, yb in loader:
            xb = xb.to(device); yb = yb.to(device)
            logits = model(xb)
            pred = logits.argmax(1)
            correct += (pred==yb).sum().item(); total += yb.numel()
    return correct / max(total,1)

best_acc = -1.0
best_path = os.path.join(OUTPUT_DIR, 'best_model.pt')

for ep in range(1, 3):  # start with 2 epochs to verify stability/speed
    model.train()
    running = 0.0; steps = 0
    for xb, yb in train_loader:
        xb = xb.to(device); yb = yb.to(device)
        opt.zero_grad(set_to_none=True)
        logits = model(xb)
        loss = crit(logits, yb)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)  # grad clip
        opt.step()
        running += float(loss.item()); steps += 1
        # (optional) speed probe:
        # if steps % 5000 == 0: break

    val_acc = run_val(val_loader)
    print(f"Epoch {ep}/2 | train_loss ~ {running/max(steps,1):.4f} | val_acc={val_acc:.3f}")

    if val_acc > best_acc:
        best_acc = val_acc
        torch.save(model.state_dict(), best_path)
        print(f"  ↳ saved {best_path} (best so far)")

print(f"Best val_acc: {best_acc:.3f}")


## Train (per split)

## Inference & backtest

In [14]:
# --- Eval dataset + inference + backtest (uses next-day returns) ---
import os, json, numpy as np, pandas as pd, torch
from torch.utils.data import IterableDataset, DataLoader

# 1) Dataset that yields (window, label, next-day return)
class TickerEvalIterableDataset(IterableDataset):
    def __init__(self, parquet_paths, start, end, features, lookback):
        self.paths = parquet_paths
        self.start = pd.to_datetime(start)
        self.end   = pd.to_datetime(end)
        self.features = features
        self.lookback = int(lookback)

    def __iter__(self):
        for p in self.paths:
            # We require 'ret_fwd_1' for PnL. If it's missing, raise a clear error.
            need_cols = self.features + ['y', 'ret_fwd_1']
            try:
                df = pd.read_parquet(p, columns=need_cols).sort_index()
            except Exception as e:
                raise RuntimeError(f"'ret_fwd_1' not found in {p}. "
                                   f"Make sure your feature build saved that column.") from e

            df = df.loc[(df.index >= self.start) & (df.index <= self.end)]
            if len(df) < self.lookback + 2:
                continue

            # use the same on-the-fly normalization as training
            df = normalize_inplace(df)

            X = df[self.features].to_numpy(dtype=np.float32, copy=False)
            y = df['y'].to_numpy(dtype=np.int64,   copy=False)
            r = df['ret_fwd_1'].to_numpy(dtype=np.float32, copy=False)

            for i in range(self.lookback, len(df)):
                yield X[i-self.lookback:i], y[i], r[i]

# 2) Small helpers
def loader_from_eval(paths, start, end, batch_size=1024):
    ds = TickerEvalIterableDataset(paths, start, end, PARAMS['features'], PARAMS['lookback'])
    return DataLoader(ds, batch_size=batch_size, shuffle=False, num_workers=0, pin_memory=True)

@torch.no_grad()
def infer_probs_and_returns(model, loader, device):
    model.eval()
    probs, rets, labels = [], [], []
    for xb, yb, rb in loader:
        xb = xb.to(device, non_blocking=True)
        logits = model(xb)
        p1 = torch.softmax(logits, dim=1)[:, 1].cpu().numpy()
        probs.append(p1); rets.append(rb.numpy()); labels.append(yb.numpy())
    if not probs:
        return np.array([]), np.array([]), np.array([])
    return np.concatenate(probs), np.concatenate(rets), np.concatenate(labels)

def positions_from_probs(probs, th_long, th_short):
    pos = np.zeros_like(probs, dtype=np.int8)
    pos[probs >= th_long] = 1
    pos[probs <= th_short] = -1
    return pos

def backtest_from_positions(returns, positions, cost_bps=5):
    # position at t-1 applied to return at t (execute at next bar)
    pos_shift = np.roll(positions.astype(int), 1); pos_shift[0] = 0
    # transaction costs when position changes
    trades = (pos_shift[1:] != pos_shift[:-1]).astype(int)
    costs = np.zeros_like(returns, dtype=np.float32)
    costs[1:] = trades * (cost_bps / 1e4)

    daily_pnl = pos_shift * returns - costs
    equity = (1.0 + daily_pnl).cumprod()
    return equity, trades, daily_pnl

def compute_metrics(equity, daily_pnl):
    if len(equity) == 0:
        return {"CAGR": 0.0, "MaxDD": 0.0, "Sharpe": 0.0}
    T = len(equity)
    years = T / 252.0
    cagr  = (equity[-1] ** (1/years) - 1) if years > 0 and equity[-1] > 0 else 0.0
    peaks = np.maximum.accumulate(equity)
    maxdd = float((equity/peaks - 1).min())
    sharpe = float((daily_pnl.mean() / (daily_pnl.std() + 1e-8)) * np.sqrt(252))
    return {"CAGR": float(cagr), "MaxDD": maxdd, "Sharpe": sharpe}

# 3) Build loaders for val/test
val_eval_loader  = loader_from_eval(val_paths,  PARAMS['split']['val_start'],  PARAMS['split']['val_end'],  batch_size=2048)
test_eval_loader = loader_from_eval(test_paths, PARAMS['split']['test_start'], PARAMS['split']['test_end'], batch_size=2048)

# 4) Load best checkpoint
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = build_model(PARAMS['model_option'], in_feats=len(PARAMS['features'])).to(device)
state = torch.load(os.path.join(OUTPUT_DIR, 'best_model.pt'), map_location=device)
sd = state if isinstance(state, dict) and 'model' not in state else state.get('model', state)
model.load_state_dict(sd)

# 5) Infer → positions → backtest → metrics
probs_val,  rets_val,  y_val  = infer_probs_and_returns(model, val_eval_loader,  device)
probs_test, rets_test, y_test = infer_probs_and_returns(model, test_eval_loader, device)
print("Shapes | val:", probs_val.shape, rets_val.shape, "| test:", probs_test.shape, rets_test.shape)

# thresholds & costs (with safe defaults)
th_long  = PARAMS.get('threshold_long', 0.55)
th_short = PARAMS.get('threshold_short', 0.45)
cost_bps = PARAMS.get('cost_bps', 5)

pos_val  = positions_from_probs(probs_val,  th_long, th_short)
pos_test = positions_from_probs(probs_test, th_long, th_short)

eq_val,  trades_val,  pnl_val  = backtest_from_positions(rets_val,  pos_val,  cost_bps=cost_bps)
eq_test, trades_test, pnl_test = backtest_from_positions(rets_test, pos_test, cost_bps=cost_bps)

m_val  = compute_metrics(eq_val,  pnl_val)
m_test = compute_metrics(eq_test, pnl_test)

print("VAL  metrics:", m_val,  "| trades:", int(trades_val.sum()))
print("TEST metrics:", m_test, "| trades:", int(trades_test.sum()))


Shapes | val: (452128,) (452128,) | test: (1624247,) (1624247,)
VAL  metrics: {'CAGR': 0.0, 'MaxDD': -1.0, 'Sharpe': -0.06684386441090877} | trades: 1400
TEST metrics: {'CAGR': 0.0, 'MaxDD': -1.0, 'Sharpe': -0.024597495151045513} | trades: 5236


In [20]:
# --- Evaluation (subset) — uses PARAMS['calibration_T'] and PARAMS['margin'] ---
import os, json, math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# knobs driven by PARAMS (no hardcoding)
SPLITS    = ["val", "test"]        # run both
N_TICKERS = len(val_paths)#250                     # subset size
T_BEST    = float(PARAMS.get("calibration_T", 0.899))
MARGIN    = float(PARAMS.get("margin", 0.02))
MODE      = str(PARAMS.get("portfolio_mode", "long_only"))
COST_BPS  = float(PARAMS.get("cost_bps", 5))
FEATS     = PARAMS["features"]
L         = int(PARAMS["lookback"])
S         = PARAMS["split"]

# helpers we expect from your canonical utilities (don’t redefine here):
#   - sanitize_returns(...)
#   - positions_from_probs(...)
#   - compute_metrics(...)
# and from Bootstrap B:
#   - model_probs(...)
# and from Bootstrap A:
#   - normalize_inplace(...)

paths_map = {"train": train_paths, "val": val_paths, "test": test_paths}

def load_returns_and_probs(fp, start, end):
    """Load one ticker, normalize, make windows, get up-prob per step with T scaling."""
    # minimal columns with fallbacks
    try:
        df = pd.read_parquet(fp, columns=list(set(FEATS + ["ret_fwd_1", "y", "Close"]))).sort_index()
    except Exception:
        df = pd.read_parquet(fp).sort_index()

    df = df.loc[(df.index >= pd.to_datetime(start)) & (df.index <= pd.to_datetime(end))]
    if len(df) < L + 2:
        return None

    # derive forward return if missing
    if "ret_fwd_1" not in df:
        if "y" in df:
            df["ret_fwd_1"] = (df["y"].astype(float) * 2 - 1).shift(-1).fillna(0.0)
        elif "Close" in df:
            cl = df["Close"].astype(float)
            df["ret_fwd_1"] = np.log(cl.shift(-1) / cl).fillna(0.0)
        else:
            return None

    # normalize features
    df = normalize_inplace(df)
    X = df[FEATS].to_numpy(np.float32, copy=False)
    if len(X) <= L + 1:
        return None

    # build sliding windows (T-L steps), get probs with calibrated temperature
    Xw = np.stack([X[i - L:i] for i in range(L, len(X))], axis=0)
    p  = model_probs(Xw, T=T_BEST)
    p  = p[:, 1] if (p.ndim == 2 and p.shape[1] == 2) else (p if p.ndim == 1 else p[:, -1])

    # align returns to label steps
    r = df["ret_fwd_1"].to_numpy(np.float32, copy=False)[L:]
    r = sanitize_returns(r, cap=0.30)
    dates = df.index[L:]
    return dates, r, p

def portfolio_from_subset(split_name, n_limit):
    files = paths_map[split_name][:n_limit]
    if not files:
        return None

    start = S[f"{split_name}_start"]; end = S[f"{split_name}_end"]
    eq_sums = {}   # date -> sum of per-ticker pnl
    eq_cnts = {}   # date -> # active tickers

    th_long  = 0.5 + MARGIN
    th_short = 0.5 - MARGIN

    for fp in files:
        out = load_returns_and_probs(fp, start, end)
        if out is None:
            continue
        dates, r, p = out

        if MODE == "long_only":
            pos = (p > th_long).astype(np.int8)          # 1 or 0
        else:
            pos = np.zeros_like(p, dtype=np.int8)
            pos[p >= th_long] = 1
            pos[p <= th_short] = -1

        # entry/flip costs per ticker
        pos_shift = np.roll(pos, 1); pos_shift[0] = 0
        trades = (pos_shift[1:] != pos_shift[:-1]).astype(np.int8)
        costs  = np.zeros_like(r, dtype=np.float32)
        if len(costs) > 1:
            costs[1:] = trades * (COST_BPS / 1e4)

        pnl = (pos_shift * r) - costs

        for d, x in zip(pd.to_datetime(dates).normalize(), pnl):
            eq_sums[d] = eq_sums.get(d, 0.0) + float(x)
            eq_cnts[d] = eq_cnts.get(d, 0) + 1

    if not eq_sums:
        return None

    days = sorted(eq_sums.keys())
    daily = pd.DataFrame({
        "date": days,
        "ret_day": [eq_sums[d] / max(eq_cnts[d], 1) for d in days],
        "n": [eq_cnts[d] for d in days],
    }).set_index("date").sort_index()

    # equity + metrics
    rets = daily["ret_day"].to_numpy(np.float64)
    equity = np.cumprod(1.0 + np.clip(rets, -0.99, 10.0))
    mets = compute_metrics(equity, rets)
    return daily, equity, mets

os.makedirs(OUTPUT_DIR, exist_ok=True)
run_summ = {"T": T_BEST, "margin": MARGIN, "mode": MODE, "lookback": L, "features": FEATS, "subset": N_TICKERS}

for split in SPLITS:
    res = portfolio_from_subset(split, N_TICKERS)
    if res is None:
        print(f"{split.upper()} produced no data.")
        continue
    daily, equity, mets = res

    # save equity PNG + CSV
    png = os.path.join(OUTPUT_DIR, f"model_equity_{split}_subset.png")
    csv = os.path.join(OUTPUT_DIR, f"model_daily_{split}_subset.csv")
    daily.to_csv(csv, index=True)

    plt.figure()
    plt.plot(equity)
    plt.title(f"Model Equity — {split.upper()} (subset={N_TICKERS}, T={T_BEST:.3f}, margin={MARGIN}, mode={MODE})")
    plt.xlabel("Days"); plt.ylabel("Equity"); plt.grid(True)
    plt.savefig(png, dpi=150, bbox_inches="tight"); plt.close()

    run_summ[split] = {"days": int(daily.shape[0]), "metrics": mets, "png": png, "csv": csv}
    print(f"{split.upper()} metrics:", mets)

# Write run summary JSON
run_json = os.path.join(OUTPUT_DIR, "model_run_full.json") #model_run_subset.json
with open(run_json, "w") as f:
    json.dump(run_summ, f, indent=2)
print("Saved:", run_json)


VAL metrics: {'CAGR': 0.0032084100905311885, 'MaxDD': -0.002329084011930105, 'Sharpe': 0.894978358671953}
TEST metrics: {'CAGR': 0.00576117112133212, 'MaxDD': -0.003667971007708104, 'Sharpe': 1.0546295781739405}
Saved: /content/drive/MyDrive/Colab Notebooks/outputs/model_run_full.json


In [18]:
# Verify eval artifacts exist and look sane
import os, glob, pandas as pd
outs = sorted(glob.glob(os.path.join(OUTPUT_DIR, "model_*_subset.*")))
print("Found:", [os.path.basename(p) for p in outs])
for p in outs:
    try: print(os.path.basename(p), "bytes:", os.path.getsize(p))
    except: pass

for split in ("val","test"):
    csv = os.path.join(OUTPUT_DIR, f"model_daily_{split}_subset.csv")
    if os.path.exists(csv):
        print("\n", split.upper(), "CSV tail:")
        display(pd.read_csv(csv).tail(3))


Found: ['model_daily_test_subset.csv', 'model_daily_val_subset.csv', 'model_equity_test_subset.png', 'model_equity_val_subset.png', 'model_run_subset.json']
model_daily_test_subset.csv bytes: 21159
model_daily_val_subset.csv bytes: 6819
model_equity_test_subset.png bytes: 52618
model_equity_val_subset.png bytes: 66676
model_run_subset.json bytes: 870

 VAL CSV tail:


Unnamed: 0,date,ret_day,n
185,2019-12-27,-0.000167,249
186,2019-12-30,0.000133,250
187,2019-12-31,0.000206,250



 TEST CSV tail:


Unnamed: 0,date,ret_day,n
675,2022-12-07,0.0,246
676,2022-12-08,0.0,246
677,2022-12-09,0.0,237


In [19]:
daily.to_csv(os.path.join(OUTPUT_DIR, f"model_daily_{split}_subset.csv"), index=True)
plt.savefig(os.path.join(OUTPUT_DIR, f"model_equity_{split}_subset.png"), dpi=150, bbox_inches="tight")


<Figure size 640x480 with 0 Axes>

In [21]:
# --- Sanitize returns for backtest (handles percent-points & split spikes) ---
import numpy as np

def sanitize_returns(r, cap=0.30):
    r = r.astype(np.float32).copy()

    # Step A: if any values look like percent-points (|r| > 1), convert them to fractions
    # Example: 99 -> 0.99, 3098 -> 30.98 (then we'll cap below)
    pp_mask = np.abs(r) > 1.0
    if pp_mask.any():
        r[pp_mask] = r[pp_mask] / 100.0

    # Step B: kill impossible negatives (bankruptcy in a day). Keep -0.99 as a hard floor.
    r[r < -0.99] = -0.99

    # Step C: cap absurd positives (splits/bad ticks). 30% cap is common; tune if needed.
    r = np.clip(r, -cap, cap)

    return r

rets_val_s  = sanitize_returns(rets_val,  cap=0.30)
rets_test_s = sanitize_returns(rets_test, cap=0.30)

# Re-run backtest + metrics with sanitized returns
eq_val,  trades_val,  pnl_val  = backtest_from_positions(rets_val_s,  pos_val,  cost_bps=cost_bps)
eq_test, trades_test, pnl_test = backtest_from_positions(rets_test_s, pos_test, cost_bps=cost_bps)

m_val  = compute_metrics(eq_val,  pnl_val)
m_test = compute_metrics(eq_test, pnl_test)
print("VAL  metrics (sanitized):", m_val,  "| trades:", int(trades_val.sum()))
print("TEST metrics (sanitized):", m_test, "| trades:", int(trades_test.sum()))


VAL  metrics (sanitized): {'CAGR': -0.018236330647384835, 'MaxDD': -0.9999999999999987, 'Sharpe': 0.014450937035149873} | trades: 1400
TEST metrics (sanitized): {'CAGR': -0.058430599608076506, 'MaxDD': -1.0, 'Sharpe': -0.16890677308891483} | trades: 5236


In [17]:
# --- Boundary-aware backtest (no cross-ticker carry; log-equity to avoid underflow) ---
import pandas as pd
import numpy as np

def compute_segment_starts(paths, start, end, lookback):
    starts = []
    idx = 0
    start = pd.to_datetime(start); end = pd.to_datetime(end)
    any_col = (PARAMS['features'] + ['y'])[0]
    for p in paths:
        df = pd.read_parquet(p, columns=[any_col]).sort_index()
        df = df.loc[(df.index >= start) & (df.index <= end)]
        n = max(0, len(df) - int(lookback))
        if n > 0:
            starts.append(idx)
            idx += n
    return np.array(starts, dtype=np.int64)

starts_val  = compute_segment_starts(val_paths,  PARAMS['split']['val_start'],  PARAMS['split']['val_end'],  PARAMS['lookback'])
starts_test = compute_segment_starts(test_paths, PARAMS['split']['test_start'], PARAMS['split']['test_end'], PARAMS['lookback'])

def backtest_from_positions_segments(returns, positions, starts, cost_bps=5):
    returns = returns.astype(np.float64)
    positions = positions.astype(int)

    # reset carry at each ticker start
    pos_shift = np.roll(positions, 1)
    pos_shift[0] = 0
    if len(starts):
        pos_shift[starts] = 0

    trades = (pos_shift[1:] != pos_shift[:-1]).astype(int)
    costs = np.zeros_like(returns, dtype=np.float64)
    costs[1:] = trades * (cost_bps / 1e4)

    daily_pnl = pos_shift * returns - costs

    # log-equity to avoid underflow
    le = np.cumsum(np.log1p(np.clip(daily_pnl, -0.999999, 10.0)))
    equity = np.exp(le)

    # drawdown from log-equity
    peaks_log = np.maximum.accumulate(le)
    maxdd = float(np.exp(le - peaks_log).min() - 1.0)

    # metrics
    sharpe = float((daily_pnl.mean() / (daily_pnl.std() + 1e-8)) * np.sqrt(252)) if daily_pnl.size else 0.0
    years = daily_pnl.size / 252.0
    cagr = (equity[-1] ** (1 / years) - 1.0) if (years > 0 and equity[-1] > 0) else 0.0

    return equity, trades, daily_pnl, {"CAGR": float(cagr), "MaxDD": maxdd, "Sharpe": sharpe}

# Use sanitized returns from earlier; if not present, fall back to raw
rets_val_s  = globals().get("rets_val_s",  rets_val)
rets_test_s = globals().get("rets_test_s", rets_test)

eq_val,  trades_val,  pnl_val,  m_val  = backtest_from_positions_segments(rets_val_s,  pos_val,  starts_val,  cost_bps=cost_bps)
eq_test, trades_test, pnl_test, m_test = backtest_from_positions_segments(rets_test_s, pos_test, starts_test, cost_bps=cost_bps)

print("VAL  metrics (seg):",  m_val,  "| trades:", int(trades_val.sum()))
print("TEST metrics (seg):", m_test, "| trades:", int(trades_test.sum()))


VAL  metrics (seg): {'CAGR': -0.017960838711980398, 'MaxDD': -0.999999999999998, 'Sharpe': 0.015690154644168623} | trades: 1414
TEST metrics (seg): {'CAGR': -0.05620217465915145, 'MaxDD': -1.0, 'Sharpe': -0.15914943293546552} | trades: 5394


In [10]:
import numpy as np

def describe(arr, name):
    q = np.quantile(arr, [0, 0.001, 0.01, 0.05, 0.5, 0.95, 0.99, 0.999, 1])
    print(f"{name}: min={q[0]:.6f}  q0.1%={q[1]:.6f}  q1%={q[2]:.6f}  q5%={q[3]:.6f}  "
          f"median={q[4]:.6f}  q95%={q[5]:.6f}  q99%={q[6]:.6f}  q99.9%={q[7]:.6f}  max={q[8]:.6f}")

print("VAL returns:")
describe(rets_val, "rets_val")
print("TEST returns:")
describe(rets_test, "rets_test")

# Did PnL ever go below -100% in one day?
print("\nPnL sanity:")
print(" min pnl_val:", float(pnl_val.min()), " <= -1 count:", int((pnl_val <= -1).sum()))
print(" min pnl_test:", float(pnl_test.min()), " <= -1 count:", int((pnl_test <= -1).sum()))

# How many returns have absolute value > 1 (i.e., > 100% in *fractional* terms)?
print("\nabs(rets) > 1 counts (val/test):",
      int((np.abs(rets_val) > 1).sum()),
      int((np.abs(rets_test) > 1).sum()))


VAL returns:
rets_val: min=-0.995000  q0.1%=-0.268532  q1%=-0.080238  q5%=-0.036316  median=0.000000  q95%=0.035576  q99%=0.087408  q99.9%=0.334319  max=99.000000
TEST returns:
rets_test: min=-1.972603  q0.1%=-0.284789  q1%=-0.093154  q5%=-0.046595  median=0.000000  q95%=0.050514  q99%=0.115000  q99.9%=0.438158  max=3098.000000

PnL sanity:
 min pnl_val: -0.30050001192092896  <= -1 count: 0
 min pnl_test: -0.30050001192092896  <= -1 count: 0

abs(rets) > 1 counts (val/test): 83 397


In [11]:
# --- Buy & Hold baseline: equal-weight daily portfolio (streaming, tiny RAM) ---
import os, json, numpy as np, pandas as pd
import matplotlib.pyplot as plt

def compute_metrics_from_equity_and_rets(equity, daily_rets):
    if len(equity) == 0:
        return {"CAGR": 0.0, "MaxDD": 0.0, "Sharpe": 0.0}
    # log-space DD for numeric stability
    le = np.log(np.clip(equity, 1e-12, None))
    peaks_log = np.maximum.accumulate(le)
    maxdd = float(np.exp(le - peaks_log).min() - 1.0)

    years = len(daily_rets) / 252.0
    cagr = (equity[-1] ** (1/years) - 1.0) if (years > 0 and equity[-1] > 0) else 0.0
    sharpe = float((daily_rets.mean() / (daily_rets.std() + 1e-8)) * np.sqrt(252)) if daily_rets.size else 0.0
    return {"CAGR": float(cagr), "MaxDD": maxdd, "Sharpe": sharpe}

def equal_weight_bh(paths, start, end, cap=0.30):
    start = pd.to_datetime(start); end = pd.to_datetime(end)
    sums = {}   # date -> sum of returns
    counts = {} # date -> ticker count that day

    for p in paths:
        # read only the forward return; index holds the date
        df = pd.read_parquet(p, columns=['ret_fwd_1']).sort_index()
        df = df.loc[(df.index >= start) & (df.index <= end)]
        if df.empty:
            continue
        r = sanitize_returns(df['ret_fwd_1'].to_numpy(dtype=np.float32), cap=cap)

        # accumulate by date (normalize index to date only)
        for d, ri in zip(df.index, r):
            day = pd.Timestamp(d).normalize()  # midnight of that day
            sums[day] = sums.get(day, 0.0) + float(ri)
            counts[day] = counts.get(day, 0) + 1

    if not sums:
        return [], np.array([]), np.array([]), {"CAGR": 0.0, "MaxDD": 0.0, "Sharpe": 0.0}

    dates = sorted(sums.keys())
    daily_rets = np.array([sums[d] / max(counts[d], 1) for d in dates], dtype=np.float64)
    equity = np.cumprod(1.0 + daily_rets)
    metrics = compute_metrics_from_equity_and_rets(equity, daily_rets)
    return dates, daily_rets, equity, metrics

# Run baseline on val & test
val_dates,  val_rets_bh,  val_eq_bh,  m_val_bh  = equal_weight_bh(
    val_paths,  PARAMS['split']['val_start'],  PARAMS['split']['val_end'],  cap=0.30
)
test_dates, test_rets_bh, test_eq_bh, m_test_bh = equal_weight_bh(
    test_paths, PARAMS['split']['test_start'], PARAMS['split']['test_end'], cap=0.30
)

print("B&H VAL : days=", len(val_dates),  "| metrics:",  m_val_bh)
print("B&H TEST: days=", len(test_dates), "| metrics:", m_test_bh)

# Save artifacts
def plot_equity(eq, title, path):
    plt.figure()
    plt.plot(eq)
    plt.title(title); plt.xlabel("Days"); plt.ylabel("Equity")
    plt.grid(True); plt.savefig(path, dpi=150, bbox_inches='tight'); plt.close()

os.makedirs(OUTPUT_DIR, exist_ok=True)
val_png  = os.path.join(OUTPUT_DIR, "baseline_equity_val.png")
test_png = os.path.join(OUTPUT_DIR, "baseline_equity_test.png")
plot_equity(val_eq_bh,  "Buy & Hold (Equal-Weight) — Validation", val_png)
plot_equity(test_eq_bh, "Buy & Hold (Equal-Weight) — Test",       test_png)

baseline_json = {
    "seed": SEED,
    "params": PARAMS,
    "baseline": "equal_weight_buy_and_hold",
    "val": {"days": len(val_dates),  "metrics": m_val_bh},
    "test":{"days": len(test_dates), "metrics": m_test_bh},
}
with open(os.path.join(OUTPUT_DIR, "baseline_run.json"), "w") as f:
    json.dump(baseline_json, f, indent=2)

print("Saved baseline:",
      val_png, ",",
      test_png, ",",
      os.path.join(OUTPUT_DIR, "baseline_run.json"))


B&H VAL : days= 252 | metrics: {'CAGR': 0.2231077131275303, 'MaxDD': -0.08114500414961767, 'Sharpe': 1.8023377145278825}
B&H TEST: days= 743 | metrics: {'CAGR': 0.08300290961302337, 'MaxDD': -0.4115523167155214, 'Sharpe': 0.4384846621035782}
Saved baseline: /content/drive/MyDrive/Colab Notebooks/outputs/baseline_equity_val.png , /content/drive/MyDrive/Colab Notebooks/outputs/baseline_equity_test.png , /content/drive/MyDrive/Colab Notebooks/outputs/baseline_run.json


In [12]:
# --- Temperature scaling (calibrate on VAL) + re-evaluate model portfolio (VAL & TEST) ---
import os, numpy as np, pandas as pd, torch, json

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# ensure normalization stats are loaded in this cell's scope
if 'FEAT_MEAN' not in globals():
    norm_df = pd.read_csv(os.path.join(OUTPUT_DIR, "feature_norm.csv"), index_col=0)
    FEAT_MEAN = norm_df["mean"].astype(float).to_dict()
    FEAT_STD  = norm_df["std"].replace(0, 1.0).astype(float).to_dict()

# Use your existing eval dataset + infer helper if present; else define minimal versions
try:
    TickerEvalIterableDataset
    infer_probs_and_returns
except NameError:
    from torch.utils.data import IterableDataset, DataLoader
    class TickerEvalIterableDataset(IterableDataset):
        def __init__(self, parquet_paths, start, end, features, lookback):
            self.paths = parquet_paths
            self.start = pd.to_datetime(start)
            self.end   = pd.to_datetime(end)
            self.features = features
            self.lookback = int(lookback)
        def __iter__(self):
            for p in self.paths:
                df = pd.read_parquet(p, columns=self.features + ['y','ret_fwd_1']).sort_index()
                df = df.loc[(df.index >= self.start) & (df.index <= self.end)]
                if len(df) < self.lookback + 2:
                    continue
                # same normalization as training
                for c in self.features:
                    df[c] = (df[c] - FEAT_MEAN[c]) / FEAT_STD[c]
                X = df[self.features].to_numpy(dtype=np.float32, copy=False)
                y = df['y'].to_numpy(dtype=np.int64, copy=False)
                r = df['ret_fwd_1'].to_numpy(dtype=np.float32, copy=False)
                for i in range(self.lookback, len(df)):
                    yield X[i-self.lookback:i], y[i], r[i]
    def infer_probs_and_returns(model, loader, device):
        model.eval(); probs=[]; rets=[]; labels=[]
        with torch.no_grad():
            for xb, yb, rb in loader:
                xb = xb.to(device, non_blocking=True)
                p1 = torch.softmax(model(xb), dim=1)[:,1].cpu().numpy()
                probs.append(p1); rets.append(rb.numpy()); labels.append(yb.numpy())
        if not probs:
            return np.array([]), np.array([]), np.array([])
        return np.concatenate(probs), np.concatenate(rets), np.concatenate(labels)

# Build or reuse loaders
from torch.utils.data import DataLoader
feat_cols = PARAMS['features']; lb = int(PARAMS['lookback'])
def make_eval_loader(paths, start, end, bs=4096):
    ds = TickerEvalIterableDataset(paths, start, end, feat_cols, lb)
    return DataLoader(ds, batch_size=bs, shuffle=False, num_workers=0, pin_memory=True)

val_eval_loader = globals().get("val_eval_loader", None)
if val_eval_loader is None:
    val_eval_loader = make_eval_loader(val_paths, PARAMS['split']['val_start'], PARAMS['split']['val_end'])

test_eval_loader = globals().get("test_eval_loader", None)
if test_eval_loader is None:
    test_eval_loader = make_eval_loader(test_paths, PARAMS['split']['test_start'], PARAMS['split']['test_end'])

# Load model
model = build_model(PARAMS['model_option'], in_feats=len(feat_cols)).to(device).eval()
state = torch.load(os.path.join(OUTPUT_DIR, "best_model.pt"), map_location=device)
sd = state if isinstance(state, dict) and 'model' not in state else state.get('model', state)
model.load_state_dict(sd)

# 1) Collect VAL probs + labels for calibration
p_val, r_dummy, y_val = infer_probs_and_returns(model, val_eval_loader, device)
eps = 1e-6
p_val = np.clip(p_val, eps, 1.0 - eps)
logit_val = np.log(p_val / (1.0 - p_val))

# 2) Fit temperature by minimizing log loss on VAL (1D grid, robust & fast)
def nll_at_T(T):
    z = logit_val / T
    p = 1.0 / (1.0 + np.exp(-z))
    # binary log loss
    return -np.mean(y_val * np.log(np.clip(p, eps, 1-eps)) + (1-y_val) * np.log(np.clip(1-p, eps, 1-eps)))

grid = np.geomspace(0.25, 4.0, 40)
losses = np.array([nll_at_T(T) for T in grid])
T_best = float(grid[np.argmin(losses)])
print(f"Best temperature on VAL: T={T_best:.3f} (vs uncalibrated T=1.0)")

# Small helper: apply temperature to any prob vector
def apply_temp(p, T):
    p = np.clip(p, eps, 1.0 - eps)
    z = np.log(p / (1.0 - p)) / T
    return 1.0 / (1.0 + np.exp(-z))

# 3) Rebuild model portfolio with calibrated probs (equal-weight, date-aligned)

def portfolio_with_temp(paths, start, end, margin, cost_bps=5, cap=0.30, batch_windows=4096):
    # norms
    norm_df = pd.read_csv(os.path.join(OUTPUT_DIR, "feature_norm.csv"), index_col=0)
    FEAT_MEAN = norm_df["mean"].astype(float).to_dict()
    FEAT_STD  = norm_df["std"].replace(0, 1.0).astype(float).to_dict()

    start = pd.to_datetime(start); end = pd.to_datetime(end)
    sum_pnl, count = {}, {}
    th_long, th_short = 0.5 + margin, 0.5 - margin

    for pth in paths:
        try:
            df = pd.read_parquet(pth, columns=feat_cols + ['ret_fwd_1']).sort_index()
        except Exception:
            continue
        df = df.loc[(df.index >= start) & (df.index <= end)]
        if len(df) < lb + 2:
            continue
        for c in feat_cols:
            if c in df.columns:
                df[c] = (df[c] - FEAT_MEAN[c]) / FEAT_STD[c]

        X = df[feat_cols].to_numpy(dtype=np.float32, copy=False)
        r = sanitize_returns(df['ret_fwd_1'].to_numpy(dtype=np.float32, copy=False), cap=cap)

        # batched window inference
        Tlen = len(df)
        probs = []
        with torch.no_grad():
            for i in range(lb, Tlen, batch_windows):
                j = min(Tlen, i + batch_windows)
                windows = np.stack([X[k-lb:k] for k in range(i, j)], axis=0).astype(np.float32)
                xb = torch.from_numpy(windows).to(device)
                p1 = torch.softmax(model(xb), dim=1)[:,1].detach().cpu().numpy()
                probs.append(p1)
        if not probs:
            continue
        probs = np.concatenate(probs)
        probs = apply_temp(probs, T_best)  # <-- calibration applied here

        r_lbl = r[lb:]
        dates = df.index[lb:]

        pos = np.zeros_like(probs, dtype=np.int8)
        pos[probs >= th_long] = 1
        pos[probs <= th_short] = -1

        pos_shift = np.roll(pos.astype(int), 1); pos_shift[0] = 0
        trades = (pos_shift[1:] != pos_shift[:-1]).astype(int)
        costs  = np.zeros_like(r_lbl, dtype=np.float32)
        costs[1:] = trades * (cost_bps / 1e4)

        pnl = pos_shift * r_lbl - costs
        for d, x in zip(dates, pnl):
            day = pd.Timestamp(d).normalize()
            sum_pnl[day] = sum_pnl.get(day, 0.0) + float(x)
            count[day]   = count.get(day, 0) + 1

    if not sum_pnl:
        return [], np.array([]), np.array([]), {"CAGR": 0.0, "MaxDD": 0.0, "Sharpe": 0.0}

    days = sorted(sum_pnl.keys())
    daily_rets = np.array([sum_pnl[d] / max(count[d],1) for d in days], dtype=np.float64)
    equity = np.cumprod(1.0 + daily_rets)

    # metrics
    le = np.log(np.clip(equity, 1e-12, None))
    peaks_log = np.maximum.accumulate(le)
    maxdd = float(np.exp(le - peaks_log).min() - 1.0)
    years = len(daily_rets) / 252.0
    cagr  = (equity[-1] ** (1/years) - 1.0) if (years > 0 and equity[-1] > 0) else 0.0
    sharpe = float((daily_rets.mean() / (daily_rets.std() + 1e-8)) * np.sqrt(252)) if daily_rets.size else 0.0
    return days, daily_rets, equity, {"CAGR": float(cagr), "MaxDD": maxdd, "Sharpe": sharpe}

# Use the chosen margin from your sweep (0.15)
margin = 0.02
val_days,  val_rets_c,  val_eq_c,  m_val_c  = portfolio_with_temp(val_paths,  PARAMS['split']['val_start'],  PARAMS['split']['val_end'],  margin, cost_bps=PARAMS.get('cost_bps',5))
test_days, test_rets_c, test_eq_c, m_test_c = portfolio_with_temp(test_paths, PARAMS['split']['test_start'], PARAMS['split']['test_end'], margin, cost_bps=PARAMS.get('cost_bps',5))

print(f"CALIBRATED (T={T_best:.3f}) VAL  metrics @ margin={margin}: ", m_val_c)
print(f"CALIBRATED (T={T_best:.3f}) TEST metrics @ margin={margin}: ", m_test_c)


Best temperature on VAL: T=0.899 (vs uncalibrated T=1.0)
CALIBRATED (T=0.899) VAL  metrics @ margin=0.02:  {'CAGR': -0.026403907520209935, 'MaxDD': -0.06667088128018483, 'Sharpe': -0.49574412114335936}
CALIBRATED (T=0.899) TEST metrics @ margin=0.02:  {'CAGR': -0.11699866071416731, 'MaxDD': -0.32370792794089076, 'Sharpe': -0.9384402698914333}


In [56]:
# Set in-memory settings for this session
PARAMS['portfolio_mode'] = 'long_only'
PARAMS['margin'] = 0.02          # <— your chosen value
PARAMS['calibration_T'] = 0.899  # from temp scaling
# optional:
# PARAMS['cost_bps'] = 5
print("Session settings set →", {k: PARAMS[k] for k in ['portfolio_mode','margin','calibration_T']})


Session settings set → {'portfolio_mode': 'long_only', 'margin': 0.02, 'calibration_T': 0.899}


In [22]:
# Ensure we’re using YOUR chosen settings before running the saver
# (Pull from PARAMS, or hard-set them here)
T_best = float(PARAMS.get('calibration_T', 0.899))
margin = float(PARAMS.get('margin', 0.02))       # <-- you found 0.02 works
mode   = PARAMS.get('portfolio_mode', 'long_only')

print(f"Using settings → mode={mode}, margin={margin}, T={T_best}, cost_bps={PARAMS.get('cost_bps',5)}")


Using settings → mode=long_only, margin=0.02, T=0.899, cost_bps=5


In [23]:
# --- Lock portfolio settings ---
import json, os
PARAMS['calibration_T'] = float(globals().get('T_best', 1.0))  # 0.899 for you
PARAMS['portfolio_mode'] = 'long_only'
PARAMS['margin'] = 0.02  # or whatever you pick
lock_path = os.path.join(OUTPUT_DIR, 'params_locked.json')
with open(lock_path, 'w') as f:
    json.dump(PARAMS, f, indent=2)
print(f"Using settings → mode={mode}, margin={margin}, T={T_best}, cost_bps={PARAMS.get('cost_bps',5)}")
print("Saved:", lock_path)

Using settings → mode=long_only, margin=0.02, T=0.899, cost_bps=5
Saved: /content/drive/MyDrive/Colab Notebooks/outputs/params_locked.json


In [24]:
# Ensure saver uses locked settings
T_best = float(PARAMS.get('calibration_T', 1.0))
margin = float(PARAMS.get('margin', 0.02))
mode   = PARAMS.get('portfolio_mode', 'long_only')
print(f"Using settings → mode={mode}, margin={margin}, T={T_best}, cost_bps={PARAMS.get('cost_bps',5)}")


Using settings → mode=long_only, margin=0.02, T=0.899, cost_bps=5


## Metrics & plots

In [None]:
def plot_equity(equity_curve, title='Equity Curve', path=os.path.join(OUTPUT_DIR, 'equity_curve.png')):
    plt.figure()
    plt.plot(equity_curve)
    plt.title(title)
    plt.xlabel('Days')
    plt.ylabel('Equity')
    plt.grid(True)
    plt.savefig(path, dpi=150, bbox_inches='tight')
    plt.close()
    return path

print('Metrics/plots stubs ready.')


## Params dump & outputs

In [None]:
def save_run(seed, params, metrics: dict):
    run = {
        'timestamp': dt.datetime.utcnow().isoformat() + 'Z',
        'seed': seed,
        'params': params,
        'metrics': metrics,
    }
    with open(os.path.join(OUTPUT_DIR, 'run.json'), 'w') as f:
        json.dump(run, f, indent=2)
    print('Saved outputs/run.json')

print('Output saving stub ready.')


---
### Wiring plan (streaming + Bootstrap; in this order)
1) Setup & seed
   - Imports, SEED, paths (PARQUET_DIR, OUTPUT_DIR), PARAMS defined.

2) Bootstrap A (resume-safe; no model deps)
   - Mount Drive.
   - Load params_locked.json (if present) and merge into PARAMS.
   - Load feature_norm.csv → define normalize_inplace.
   - Build train_paths / val_paths / test_paths using lookback+2 filter.

3) (Optional) Local overrides for this session
   - e.g., portfolio_mode, margin, calibration_T. Only persisted if you choose to lock later.

4) Model definition
   - Define build_model(...). No training here.

5) Bootstrap B (after model exists)
   - Load best_model.pt if present; set device; expose model_probs(...).
   - Quick smoke test on 1 ticker to verify shapes.

6) (Optional) Training (single, stabilized loop)
   - Use the streaming TickerIterableDataset + DataLoader.
   - Save checkpoint to outputs/best_model.pt when improved.

7) Inference & backtest utilities (single canonical set)
   - sanitize_returns, positions_from_probs, compute_metrics, etc.

8) Evaluation
   - Run VAL (then TEST) backtests with costs; choose long_only or long_short per PARAMS.
   - (Optional) Temperature scaling on VAL; update PARAMS['calibration_T'] when happy.

9) Exports
   - Save metrics, equity/trades CSVs, and a concise run.json.
   - Lock params to params_locked.json only when you intend to “freeze” a run.

10) Utilities (run once; guarded)
   - CSV→Parquet builder, yfinance fetches, or norms regeneration from train_paths.
   - Keep behind RUN_ONCE to avoid accidental side effects.
