These notebooks implement the models described in `hierachial_nueral_network.pdf` ("Stock Type Prediction Model Based on Hierarchical Graph Neural Network").

Important: this repo's dataset does NOT include sector/industry labels, and these notebooks do NOT add or require that data.

We adapt the PDF setup while preserving the architecture:
- Task: predict next-day direction `y_up_fwd = 1[ret_1d(t+1) > 0]` for each asset.
- Graph: build a stock relationship graph from training-window return correlations (k-NN graph) instead of industry labels.
- Split: time-wise split (train=first 7 years, val=middle, test=last 18 months).
- Trading: convert model probabilities into weekly Top-K long-only weights and run the existing backtester + existing Bokeh dashboard.

GPU note (AMD): if your PyTorch build supports ROCm, `torch.cuda.is_available()` will be True and these notebooks will use your GPU automatically.


## HGNN (Hierarchical Graph Neural Network) - Math (from PDF)

At each date t, each stock node has an "own" representation $e^t_s$ (node view), a relation/industry representation $a^t_s$ (relation view), and a macro market representation $g^t$ (macro view).

### Node view (temporal encoder)
We encode a length-T sequence of features for stock s up to time t using an LSTM; we use the last hidden state as $h^t_s$.

### Relation view (graph convolution unit)
Given a stock relationship graph with neighborhood $N_s$ and degrees, the PDF uses a normalized neighbor aggregation:

$$a^t_s = \sum_{j\in N_s \cup \{s\}} \frac{\Pi e^t_j}{\sqrt{\deg(j)\deg(s)}}.$$

This is a GCN-style update with learnable weight matrix $\Pi$.

### Macro view (temporal attention aggregator)
Compute attention weights over nodes at time t:

$$\eta^t_s = p_a^\top\, \phi(Q_a a^t_s + b_a),\quad
w(s_t)=\frac{\exp(\eta^t_s)}{\sum_j \exp(\eta^t_j)}.$$

Macro state:

$$g^t = \sum_s w(s_t)\, a^t_s.$$

### Hierarchical fusion

$$H^t_s = [e^t_s \oplus a^t_s \oplus g^t].$$

Classification:

$$\hat y^t_s = \sigma(q^\top H^t_s + b).$$

We also implement the PDF ablations:
- `HGNN_I`: fusion of node+relation only: $[e \oplus a]$
- `HGNN_M`: fusion of node+macro only: $[e \oplus g]$


## Labels and Backtest Signal

Forward return label per asset:

$$y_t = r_{t+1}.$$

Binary direction:

$$y^{\uparrow}_t = \mathbb{1}[y_t > 0].$$

Score for trading from predicted probability $p_{i,t}$:

$$s_{i,t} = p_{i,t} - 0.5.$$

Weekly Top-K long-only portfolio on rebalance date t:

$$\mathcal{L}_t = \{ i : s_{i,t} > 0 \}\ \cap\ \text{TopK}(s_{\cdot,t}),\quad
w_{i,t} = \frac{1}{|\mathcal{L}_t|}\mathbb{1}[i\in\mathcal{L}_t].$$


In [1]:
from __future__ import annotations

from pathlib import Path
import sys

import numpy as np
import pandas as pd

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

from bokeh.io import output_notebook, show


In [2]:
CWD = Path.cwd().resolve()
PROJECT_ROOT = None
for p in [CWD, *CWD.parents]:
    if (p / 'dataset').exists() and (p / 'src').exists():
        PROJECT_ROOT = p
        break
if PROJECT_ROOT is None:
    raise RuntimeError(f'Could not locate project root from CWD={CWD}')
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))
print('PROJECT_ROOT:', PROJECT_ROOT)


PROJECT_ROOT: /home/anivarth/college/quant-task


In [3]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('torch:', torch.__version__)
print('device:', device)


torch: 2.10.0+cpu
device: cpu


In [4]:
from __future__ import annotations

from pathlib import Path
import sys

import numpy as np
import pandas as pd

from bokeh.io import output_notebook, show

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score


In [5]:
SEED = 42
rng = np.random.default_rng(SEED)

FEATURES_PARQUET_PATH = PROJECT_ROOT / 'dataset' / 'features' / 'all_features.parquet'
FEATURES_CSV_PATH = PROJECT_ROOT / 'dataset' / 'features' / 'all_features.csv'

TARGET_COL = 'ret_1d'
TARGET_FWD_COL = 'y_ret_1d_fwd'
TARGET_DIR_COL = 'y_up_fwd'

TRAIN_YEARS = 7
TEST_MONTHS = 18

REBALANCE_FREQ = 'W'
TOP_K = 20
INITIAL_EQUITY = 1_000_000.0
TXN_COST_BPS = 5.0


def load_feature_dataset() -> pd.DataFrame:
    if FEATURES_PARQUET_PATH.exists():
        df = pd.read_parquet(FEATURES_PARQUET_PATH)
        if 'Date' in df.columns:
            df['Date'] = pd.to_datetime(df['Date'])
            df = df.set_index('Date')
    elif FEATURES_CSV_PATH.exists():
        df = pd.read_csv(FEATURES_CSV_PATH, parse_dates=['Date']).set_index('Date')
    else:
        raise FileNotFoundError('Feature dataset not found under dataset/features/.')

    required = {'Asset_ID', TARGET_COL}
    missing = required - set(df.columns)
    if missing:
        raise ValueError(f'Missing required columns: {sorted(missing)}')

    return df.sort_index()

panel = load_feature_dataset().copy()

panel[TARGET_FWD_COL] = panel.groupby('Asset_ID', sort=False)[TARGET_COL].shift(-1)
panel = panel.dropna(subset=[TARGET_FWD_COL]).sort_index()
panel[TARGET_DIR_COL] = (panel[TARGET_FWD_COL].astype(float) > 0.0).astype(int)

start = pd.Timestamp(panel.index.min())
end = pd.Timestamp(panel.index.max())
train_end = start + pd.DateOffset(years=TRAIN_YEARS)
test_start = end - pd.DateOffset(months=TEST_MONTHS)
if train_end >= test_start:
    raise ValueError('Not enough history for requested split')

train_mask = panel.index < train_end
val_mask = (panel.index >= train_end) & (panel.index < test_start)
test_mask = panel.index >= test_start

panel_train = panel.loc[train_mask].copy()
panel_val = panel.loc[val_mask].copy()
panel_test = panel.loc[test_mask].copy()

print('date range:', start.date(), '->', end.date())
print('train:', panel_train.index.min().date(), '->', panel_train.index.max().date(), 'rows:', panel_train.shape[0])
print('val  :', panel_val.index.min().date(), '->', panel_val.index.max().date(), 'rows:', panel_val.shape[0])
print('test :', panel_test.index.min().date(), '->', panel_test.index.max().date(), 'rows:', panel_test.shape[0])
print('assets:', panel['Asset_ID'].nunique())

exclude = {'Asset_ID', TARGET_FWD_COL, TARGET_DIR_COL}
feature_cols = [c for c in panel.columns if c not in exclude]
numeric_feature_cols = [c for c in feature_cols if pd.api.types.is_numeric_dtype(panel[c])]
print('n_numeric_features:', len(numeric_feature_cols))


date range: 2016-01-25 -> 2026-01-15
train: 2016-01-25 -> 2023-01-24 rows: 176300
val  : 2023-01-25 -> 2024-07-12 rows: 36800
test : 2024-07-15 -> 2026-01-15 rows: 37900
assets: 100
n_numeric_features: 127


In [6]:
from src.backtester.data import load_cleaned_assets, align_close_prices

# Build stock relationship graph from TRAIN-window return correlations (k-NN graph)
K_NEIGHBORS = 8

assets_ohlcv = load_cleaned_assets(cleaned_dir=str(PROJECT_ROOT / 'dataset' / 'cleaned'))
close_prices_full = align_close_prices(assets_ohlcv).sort_index()

bt_assets = sorted(close_prices_full.columns.tolist())
N = len(bt_assets)

# Align to training window
bt_start_all = pd.Timestamp(panel.index.min())
bt_end_all = pd.Timestamp(panel.index.max())
train_start = pd.Timestamp(panel_train.index.min())
train_end = pd.Timestamp(panel_train.index.max())

cp_train = close_prices_full.loc[train_start:train_end]
ret = cp_train.pct_change().dropna(how='all')
ret = ret.dropna(axis=1, how='any')

# Ensure we keep consistent asset ordering
bt_assets = ret.columns.tolist()
N = len(bt_assets)

corr = ret.corr().fillna(0.0)

A = np.zeros((N, N), dtype=np.float32)
for i in range(N):
    c = corr.iloc[i].abs().to_numpy().copy()
    c[i] = -1.0
    nn_idx = np.argsort(-c)[:K_NEIGHBORS]
    A[i, nn_idx] = 1.0

# Symmetrize + self loops
A = np.maximum(A, A.T)
np.fill_diagonal(A, 1.0)

# Normalize: D^{-1/2} A D^{-1/2}
deg = A.sum(axis=1)
deg_inv_sqrt = 1.0 / np.sqrt(np.maximum(deg, 1e-12))
A_hat = (A * deg_inv_sqrt[:, None]) * deg_inv_sqrt[None, :]
A_hat_t = torch.tensor(A_hat, dtype=torch.float32, device=device)

print('graph assets:', N)
print('avg degree:', float((A > 0).sum(axis=1).mean()))


graph assets: 100
avg degree: 13.52


In [7]:
# Build dense tensors X[dates, assets, features] and y[dates, assets]

from sklearn.feature_selection import mutual_info_classif

N_FEATURES = 16
LOOKBACK = 20

# Select features using MI on training rows
X_fs = panel_train[numeric_feature_cols].replace([np.inf, -np.inf], np.nan)
y_fs = panel_train[TARGET_DIR_COL].astype(int)
med = X_fs.median(axis=0)
X_imp = X_fs.fillna(med)

mi = mutual_info_classif(X_imp.to_numpy(), y_fs.to_numpy(), random_state=SEED)
mi_s = pd.Series(mi, index=numeric_feature_cols).sort_values(ascending=False)
sel_features = mi_s.head(N_FEATURES).index.tolist()
print('selected features:', sel_features)

# Pivot to [dates, assets]
dates = pd.Index(sorted(panel.index.unique()))
assets = pd.Index(sorted(panel['Asset_ID'].unique()))

# Restrict to graph assets intersection (bt_assets from correlation computation)
assets = pd.Index([a for a in assets if a in bt_assets])

# Build feature tensor
feat_list = []
for f in sel_features:
    m = panel.pivot_table(index=panel.index, columns='Asset_ID', values=f, aggfunc='mean').reindex(index=dates, columns=assets)
    feat_list.append(m.to_numpy(dtype=np.float32))

X_all = np.stack(feat_list, axis=-1)  # [T, N, F]

# Labels
y_mat = panel.pivot_table(index=panel.index, columns='Asset_ID', values=TARGET_DIR_COL, aggfunc='mean').reindex(index=dates, columns=assets)
y_all = y_mat.to_numpy(dtype=np.float32)  # [T, N]

# Train normalization on train window
train_dates = dates[dates < pd.Timestamp(panel_train.index.max()) + pd.Timedelta(days=1)]
train_mask_dates = (dates >= pd.Timestamp(panel_train.index.min())) & (dates <= pd.Timestamp(panel_train.index.max()))

x_tr = X_all[train_mask_dates]
med_f = np.nanmedian(x_tr, axis=(0, 1))
mean_f = np.nanmean(np.where(np.isnan(x_tr), med_f, x_tr), axis=(0, 1))
std_f = np.nanstd(np.where(np.isnan(x_tr), med_f, x_tr), axis=(0, 1))
std_f = np.where(std_f == 0, 1.0, std_f)

X_all = np.where(np.isnan(X_all), med_f, X_all)
X_all = (X_all - mean_f) / std_f

# Date masks
train_dates_mask = dates < pd.Timestamp(panel_train.index.max()) + pd.Timedelta(days=1)
train_dates_mask = dates < pd.Timestamp(panel_train.index.max())
val_dates_mask = (dates >= pd.Timestamp(panel_val.index.min())) & (dates <= pd.Timestamp(panel_val.index.max()))
test_dates_mask = (dates >= pd.Timestamp(panel_test.index.min())) & (dates <= pd.Timestamp(panel_test.index.max()))

# Use explicit boundaries (consistent with panel split)
train_end = pd.Timestamp(panel_train.index.max())
val_start = pd.Timestamp(panel_val.index.min())
val_end = pd.Timestamp(panel_val.index.max())
test_start = pd.Timestamp(panel_test.index.min())

def date_to_idx(dt: pd.Timestamp) -> int:
    return int(dates.get_loc(dt))

train_end_i = date_to_idx(train_end)
val_start_i = date_to_idx(val_start)
val_end_i = date_to_idx(val_end)
test_start_i = date_to_idx(test_start)

# Sample indices are based on the *prediction date* (label is already forward)
train_idx = list(range(LOOKBACK - 1, train_end_i + 1))
val_idx = list(range(max(LOOKBACK - 1, val_start_i), val_end_i + 1))
test_idx = list(range(max(LOOKBACK - 1, test_start_i), len(dates)))

print('n_train_dates:', len(train_idx), 'n_val_dates:', len(val_idx), 'n_test_dates:', len(test_idx))


selected features: ['diff_log_close_2', 'excess_ret_1d', 'macd_macd', 'logret_roll_std_20', 'diff_log_close_1', 'logret_1d', 'ret_1d', 'realized_vol_20', 'close_minmax_20', 'logret_roll_var_20', 'filt_close_wiener_5', 'close', 'macd_macd_hist', 'ret_lag_1', 'logret_lag_1', 'bos']
n_train_dates: 1744 n_val_dates: 368 n_test_dates: 379


In [8]:
HIDDEN = 32
BATCH_DATES = 16
EPOCHS = 3
LR = 1e-3

class DateGraphDataset(Dataset):
    def __init__(self, idxs):
        self.idxs = idxs
    def __len__(self):
        return len(self.idxs)
    def __getitem__(self, k):
        i = self.idxs[k]
        X_seq = X_all[i - LOOKBACK + 1 : i + 1]  # [L, N, F]
        y = y_all[i]  # [N]
        dt = dates[i]
        return (
            torch.tensor(X_seq, dtype=torch.float32),
            torch.tensor(y, dtype=torch.float32),
        )

def make_loader(idxs, shuffle: bool):
    return DataLoader(DateGraphDataset(idxs), batch_size=BATCH_DATES, shuffle=shuffle)

train_loader = make_loader(train_idx, True)
val_loader = make_loader(val_idx, False)


In [9]:
# HGNN definitions (MacroAttention + HGNN) - no training in this cell

# Shared node temporal encoder used by graph models
class NodeLSTMEncoder(nn.Module):
    def __init__(self, f_in: int, hidden: int):
        super().__init__()
        self.lstm = nn.LSTM(input_size=f_in, hidden_size=hidden, num_layers=1, batch_first=True)

    def forward(self, x_seq):
        # x_seq: [B, L, N, F]
        B, L, N, F = x_seq.shape
        x = x_seq.permute(0, 2, 1, 3).contiguous().view(B * N, L, F)
        out, _ = self.lstm(x)
        h = out[:, -1, :].view(B, N, -1)  # [B, N, H]
        return h


class MacroAttention(nn.Module):
    def __init__(self, dim: int):
        super().__init__()
        self.Q = nn.Linear(dim, dim)
        self.p = nn.Linear(dim, 1, bias=False)
        self.act = nn.Tanh()

    def forward(self, a):
        # a: [B,N,H]
        eta = self.p(self.act(self.Q(a))).squeeze(-1)  # [B,N]
        w = torch.softmax(eta, dim=-1)  # [B,N]
        g = (a * w.unsqueeze(-1)).sum(dim=1)  # [B,H]
        return g


class HGNN(nn.Module):
    def __init__(self, f_in: int, hidden: int, *, use_relation: bool, use_macro: bool):
        super().__init__()
        self.use_relation = use_relation
        self.use_macro = use_macro
        self.enc = NodeLSTMEncoder(f_in, hidden)
        self.pi = nn.Linear(hidden, hidden, bias=False)
        self.macro = MacroAttention(hidden)
        out_in = hidden
        if use_relation:
            out_in += hidden
        if use_macro:
            out_in += hidden
        self.out = nn.Linear(out_in, 1)

    def forward(self, x_seq):
        e = self.enc(x_seq)  # [B,N,H]
        feats = [e]

        if self.use_relation:
            a = torch.matmul(A_hat_t, self.pi(e))  # [B,N,H]
            feats.append(a)
        else:
            a = e

        if self.use_macro:
            g = self.macro(a)  # [B,H]
            g_rep = g.unsqueeze(1).expand(e.shape[0], e.shape[1], g.shape[-1])
            feats.append(g_rep)

        h = torch.cat(feats, dim=-1)
        logits = self.out(h).squeeze(-1)
        return logits


In [10]:
# Shared node temporal encoder used by graph models
class NodeLSTMEncoder(nn.Module):
    def __init__(self, f_in: int, hidden: int):
        super().__init__()
        self.lstm = nn.LSTM(input_size=f_in, hidden_size=hidden, num_layers=1, batch_first=True)

    def forward(self, x_seq):
        # x_seq: [B, L, N, F]
        B, L, N, F = x_seq.shape
        x = x_seq.permute(0, 2, 1, 3).contiguous().view(B * N, L, F)
        out, _ = self.lstm(x)
        h = out[:, -1, :].view(B, N, -1)  # [B, N, H]
        return h

# HGNN_I: node + relation only
model = HGNN(f_in=len(sel_features), hidden=HIDDEN, use_relation=True, use_macro=False).to(device)
opt = torch.optim.Adam(model.parameters(), lr=LR)
loss_fn = nn.BCEWithLogitsLoss()

for epoch in range(1, EPOCHS + 1):
    model.train()
    for x_seq, y in train_loader:
        x_seq = x_seq.to(device)
        y = y.to(device)
        opt.zero_grad(set_to_none=True)
        logits = model(x_seq)
        loss = loss_fn(logits, y)
        loss.backward()
        opt.step()

MODEL_TITLE = 'HGNN_I (node+relation) - Time Split'


In [11]:
# Predict on test dates and flatten to long for backtesting

model.eval()
all_dates = []
all_assets = []
all_scores = []

with torch.no_grad():
    for i in test_idx:
        X_seq = torch.tensor(X_all[i - LOOKBACK + 1 : i + 1], dtype=torch.float32).unsqueeze(0).to(device)  # [1,L,N,F]
        logits = model(X_seq).squeeze(0).cpu().numpy()  # [N]
        proba = 1.0 / (1.0 + np.exp(-logits))
        score = proba - 0.5

        dt = pd.Timestamp(dates[i])
        all_dates.append(np.repeat(np.datetime64(dt.to_datetime64()), len(assets)))
        all_assets.append(np.array(assets, dtype=object))
        all_scores.append(score)

dates_test = np.concatenate(all_dates)
assets_test = np.concatenate(all_assets)
scores_test = np.concatenate(all_scores)


In [12]:
from src.backtester.data import load_cleaned_assets, align_close_prices
from src.backtester.engine import BacktestConfig, run_backtest
from src.backtester.report import compute_backtest_report
from src.backtester.bokeh_plots import build_interactive_portfolio_layout
from src.backtester.portfolio import equal_weight

long = pd.DataFrame({'Date': dates_test, 'Asset_ID': assets_test, 'score': scores_test})
pred_matrix = long.pivot_table(index='Date', columns='Asset_ID', values='score', aggfunc='mean').sort_index()

bt_assets = sorted(panel['Asset_ID'].unique().tolist())
assets_ohlcv = load_cleaned_assets(symbols=bt_assets, cleaned_dir=str(PROJECT_ROOT / 'dataset' / 'cleaned'))
close_prices = align_close_prices(assets_ohlcv).sort_index()

bt_start = pd.Timestamp(panel_test.index.min())
bt_end = pd.Timestamp(panel_test.index.max())
close_prices = close_prices.loc[bt_start:bt_end]

assets_ohlcv_slice = {k: v.loc[bt_start:bt_end] for k, v in assets_ohlcv.items()}
market_df = pd.DataFrame({
    'Open': pd.concat([d['Open'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'High': pd.concat([d['High'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'Low': pd.concat([d['Low'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'Close': pd.concat([d['Close'] for d in assets_ohlcv_slice.values()], axis=1).mean(axis=1),
    'Volume': pd.concat([d['Volume'] for d in assets_ohlcv_slice.values()], axis=1).sum(axis=1),
}).sort_index()

pred_matrix = pred_matrix.reindex(close_prices.index)
output_notebook()

rebal_dates = set(pd.Series(pred_matrix.index, index=pred_matrix.index).resample(REBALANCE_FREQ).last().dropna().tolist())

w_last = pd.Series(0.0, index=bt_assets)
w_rows = []
for dt in pred_matrix.index:
    if dt in rebal_dates:
        row = pred_matrix.loc[dt].dropna().sort_values(ascending=False)
        top = row.head(min(TOP_K, len(row)))
        candidates = [a for a, v in top.items() if np.isfinite(v) and float(v) > 0.0]
        if len(candidates) == 0:
            w_last = pd.Series(0.0, index=bt_assets)
        else:
            w_dict = equal_weight(candidates)
            w_last = pd.Series(0.0, index=bt_assets)
            for a, w in w_dict.items():
                if a in w_last.index:
                    w_last[a] = float(w)
    w_rows.append(w_last)

weights = pd.DataFrame(w_rows, index=pred_matrix.index, columns=bt_assets).fillna(0.0)

cfg = BacktestConfig(initial_equity=INITIAL_EQUITY, transaction_cost_bps=TXN_COST_BPS, mode='vectorized')
res = run_backtest(close_prices, weights, config=cfg)
report = compute_backtest_report(result=res, close_prices=close_prices)
display(report.to_frame(MODEL_TITLE))

layout = build_interactive_portfolio_layout(
    market_ohlcv=market_df,
    equity=res.equity,
    returns=res.returns,
    weights=res.weights,
    turnover=res.turnover,
    costs=res.costs,
    close_prices=close_prices,
    title=MODEL_TITLE,
)
show(layout)


Unnamed: 0,HGNN_I (node+relation) - Time Split
Start,2024-07-19 00:00:00
End,2026-01-15 00:00:00
Duration,545 days 00:00:00
Initial Equity,999500.0
Final Equity,1280139.57151
Equity Peak,1285024.497098
Total Return [%],28.077996
CAGR [%],18.145175
Volatility (ann) [%],21.344071
Sharpe,0.882607
