### Resources:
- https://github.com/Samvid95/AlgoTradingRepoList

- https://www.google.com/amp/s/blog.mlq.ai/deep-reinforcement-learning-for-trading-with-tensorflow-2-0/amp/


Quant Radio Podcast Ideas:
- Transformer model
  - really good at long sequences of time series data
  - Recency important - continuously retrain your model
- The bitter lesson 
  - https://www.quantitativo.com/p/the-bitter-lesson?utm_source=substack&utm_medium=web&utm_content=embedded-post&triedRedirect=true
  - Don't just throw everything at the model, LESS IS MORE, pick a method and train using that style (e.g. momentum)
- Learning to Rank Stocks
  - https://www.quantitativo.com/p/learning-to-rank


# Setup Raw Data
   import csv of High, Low, Open, Close, Volume for 5-10 stocks
  - Remove stocks with share price <$5
  - Remove stocks with volatility indicator >X

# Model 1:
  - Simple bulk model
    - Leverage the Bitter Lesson
    - Maximum data throughput
    - minimal human intervention 

  #### Inputs/Pre-processing:
    - 20 log daily returns
    - 12 log monthly returns (t-13 -> t-2)
    - 1 January flag
    - Convert returns to cumulative returns and z-scores for cross-sectional normalization
  
  #### The Model:
    - Stacked RBMs to form an autoencoder, pretrained layer-wise
    - Compress input to a low-dim feature space
    - Pass to feedforward Neural Network for classification
    - grid search with hold-out validation
    - Final Architecture:
      - 33-40-4-50-2
      - 33 inputs, compressed to 4, classified into 2 classes
    - 

  #### Target/Classification
    - Above monthly returns
    - Below monthly Returns




# Model 2: Momentum Indicator Enriched Model 
  - apply Trading indicators
    - Volatility indicator, SMA, EMA, MACD, etc
    - Return single Enriched dataframe for regression
- Same model as above

# Model 3: Ranking Portfolio with LambdaMART
https://www.quantitativo.com/p/learning-to-rank

https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html

#### Features
  - Compute asset scores using past returns, volatility-normalized indicators and momentum signals
  - Modular framework - can encorporate additional features beyond momentum
  - 6 features are past returns (raw and normalised for different windows), 15 momentum based features (different windows)

#### Score Ranking
  - Apply LambdaMART to rank stocks based on future expected performance

#### Security Selection
  - Long the top decile and short the bottome decile based upon rankings
  - Increasing to 40 quantiles rather than 10 improves results but increases volatility and drawdowns
  - Sharpe ratio optimised at between 30-40 quantiles

#### Volatility scaling
  - Normalize position sizes based on *ex-ante monthly volatility*, targeting 15% annualized volatility

#### Re-Training and Rebalancing
  - Rebalance monthly

In [2]:
import csv 
import numpy as np
import pandas as pd
# import tensorflow as tf


In [3]:
def get_data(symbol):
    # Get raw price data (adjusted and unadjusted closes)
    raw_df = pd.read_csv(fr"C:\Users\dougl\Desktop\Personal Projects\Deep Learning Algorithm\RAW DATA\{symbol}")

    raw_df = raw_df.replace(0, np.nan)
    raw_df = raw_df.ffill()         # Foward Fill values to patch any gaps in time series data
    raw_df['Date'] = pd.to_datetime(raw_df['Date'], dayfirst=True)
    raw_df.set_index('Date', inplace=True)  # Set datetime index here

    return raw_df
    
raw_df = get_data("VAS.csv")  # Example symbol, replace with actual file name

def process_data(raw_df, symbol):
    """
    Constructs a feature-rich DataFrame indexed by (date, symbol) for use in predictive models.
    Calculates monthly cumulative returns (ret-m1 to ret-m12) based on end-of-month closes and daily cumulative returns (ret-d1 to ret-d20) over the last 20 trading days.
    Includes the unadjusted close price, a binary flag is_next_jan indicating if the next month is January, and the forward return for the next month (next_month_ret).
    Ensures no missing or infinite values and returns a clean, multi-indexed dataset ready for modeling.
    """
    # Get end-of-month data: last trading day of each month
    eom = raw_df.groupby(pd.Grouper(freq='M')).last()
    eom['is_next_jan'] = (eom.index.month == 12).astype(int)
    eom['next_month_ret'] = eom['Close'].shift(-1) / eom['Close'] - 1

    # Compute monthly cumulative returns over the past 12 months (ret-m1 to ret-m12)
    monthly = pd.DataFrame(index=eom.index)
    for m in range(1, 13):
        monthly[f'ret-m{m}'] = (eom['Close'] / eom['Close'].shift(m)) - 1
    monthly = monthly.dropna()
    monthly = monthly.astype(float)

    # Compute daily cumulative returns over the past 20 trading days (ret-d1 to ret-d20)
    daily = pd.DataFrame(index=raw_df.index)
    for d in range(1, 21):
        daily[f'ret-d{d}'] = (raw_df['Close'] / raw_df['Close'].shift(d)) - 1
    daily = daily.dropna()
    daily = daily.astype(float)

    # Align daily to end-of-month dates (last 20 trading days up to each EOM)
    # For each EOM date, get the last available daily row <= EOM date
    daily_eom = daily.loc[daily.index.isin(eom.index)]
    if len(daily_eom) < len(eom):
        # If not all EOM dates are present in daily, reindex with forward fill
        daily_eom = daily.reindex(eom.index, method='ffill')

    # Merge monthly and daily features with EOM targets and metadata
    df = monthly.join(daily_eom, how='inner') \
                .join(eom[['Close', 'is_next_jan', 'next_month_ret']], how='inner')
    df = df.rename(columns={'Close': 'unadj_close'})

    # Clean: drop rows with any missing or infinite values
    df = df.replace([np.inf, -np.inf], np.nan).dropna()

    # Return None if no usable rows
    if len(df) == 0:
        return None

    # Create MultiIndex (date, symbol)
    df.index = pd.MultiIndex.from_tuples([(d, symbol) for d in df.index], names=["date", "symbol"])

    return df

print(process_data(raw_df, "VAS.csv").columns)


Index(['ret-m1', 'ret-m2', 'ret-m3', 'ret-m4', 'ret-m5', 'ret-m6', 'ret-m7',
       'ret-m8', 'ret-m9', 'ret-m10', 'ret-m11', 'ret-m12', 'ret-d1', 'ret-d2',
       'ret-d3', 'ret-d4', 'ret-d5', 'ret-d6', 'ret-d7', 'ret-d8', 'ret-d9',
       'ret-d10', 'ret-d11', 'ret-d12', 'ret-d13', 'ret-d14', 'ret-d15',
       'ret-d16', 'ret-d17', 'ret-d18', 'ret-d19', 'ret-d20', 'unadj_close',
       'is_next_jan', 'next_month_ret'],
      dtype='object')


Now going to loop through each of our stocks using the above function

In [5]:
import os

folder_path = "C:\\Users\\dougl\\Desktop\\Personal Projects\\Deep Learning Algorithm\\RAW DATA\\"



csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

print(csv_files)

full_data = []

for stock in csv_files:
    # print(f"Loading {stock}")
    stock_df = get_data(stock)
    stock_df = process_data(stock_df, stock)
    # If process_data returns None, skip this stock
    if stock_df is not None:
        full_data.append(stock_df)
    # print(f"{stock} Loaded")
    # print("------")

# Combine all stock DataFrames into one
full_data = pd.concat(full_data, axis=0)

print(full_data.shape)

try:
    full_data.to_csv('Full_Dataframe.csv', index=True)
    print("File saved successfully.")
except Exception as e:
    raise RuntimeError(f"Failed to save 'Full_Dataframe.csv': {e}")

print(full_data.columns)

['ALL.csv', 'ALQ.csv', 'ATOM.csv', 'BGL.csv', 'CAR.csv', 'CSL.csv', 'DMP.csv', 'DVP.csv', 'FMG.csv', 'FPH.csv', 'GOLD.csv', 'JHX.csv', 'MIN.csv', 'MQG.csv', 'NWS.csv', 'NXG.csv', 'PDN.csv', 'PME.csv', 'QAN.csv', 'REA.csv', 'RMD.csv', 'SGH.csv', 'STO.csv', 'TWE.csv', 'VAS.csv', 'WCN.csv', 'WDS.csv', 'WHC.csv']
(2648, 35)
File saved successfully.
Index(['ret-m1', 'ret-m2', 'ret-m3', 'ret-m4', 'ret-m5', 'ret-m6', 'ret-m7',
       'ret-m8', 'ret-m9', 'ret-m10', 'ret-m11', 'ret-m12', 'ret-d1', 'ret-d2',
       'ret-d3', 'ret-d4', 'ret-d5', 'ret-d6', 'ret-d7', 'ret-d8', 'ret-d9',
       'ret-d10', 'ret-d11', 'ret-d12', 'ret-d13', 'ret-d14', 'ret-d15',
       'ret-d16', 'ret-d17', 'ret-d18', 'ret-d19', 'ret-d20', 'unadj_close',
       'is_next_jan', 'next_month_ret'],
      dtype='object')


## Pre-Processing

Before training the model, we'll apply several pre-processing steps to clean, standardize features, and define target variables

Filter out low-priced stocks - this is a proxy to remove illiquid and noisy stocks. Could enhance this in the future to use the VIX and a $'s traded metric instead.

Next apply **Cross-sectional z-score standardization** to all features (except last two columns) to standardize them (I prefer normalisation)

Next target variable is assigned a 1 if the next months return is above the median return for that date, otherwise a 0. 

Preserve the last feature (`is_next_jan`) and the original forward return for later analysis. All components are then combined into the single DataFrame for model input. 

In [6]:
# Filter out penny stocks: keep only rows where unadjusted close is 
# greater than $5
raw = full_data[full_data['unadj_close'] > 5]

# Drop the unadjusted close column—it’s no longer needed
raw = raw.drop(columns=['unadj_close'])

# Standardize features (z-score) within each date (cross-sectionally)
features_std = raw.iloc[:, :-2].groupby(level=0)\
    .transform(lambda x: (x - x.mean()) / x.std())

# Extract the raw (non-standardized) version of the last feature column (is_next_jan)
feature_raw = raw.iloc[:, -2]

# Binary target: 1 if next_month_ret is above the cross-sectional median 
# for that date, else 0
target = raw.iloc[:, -1]
target = (target > target.groupby(level=0).transform('median')).astype(int)
target.name = 'target'

# Preserve the original next_month_ret values for reference or evaluation
next_month_ret = raw.iloc[:, -1]

# Concatenate standardized features, raw feature, target, and 
# forward return into final dataset
data = pd.concat([features_std, feature_raw, target, next_month_ret], axis=1)

print(data.columns)

print(data["ret-d1"].head)
print(data["ret-d1"].tail)

Index(['ret-m1', 'ret-m2', 'ret-m3', 'ret-m4', 'ret-m5', 'ret-m6', 'ret-m7',
       'ret-m8', 'ret-m9', 'ret-m10', 'ret-m11', 'ret-m12', 'ret-d1', 'ret-d2',
       'ret-d3', 'ret-d4', 'ret-d5', 'ret-d6', 'ret-d7', 'ret-d8', 'ret-d9',
       'ret-d10', 'ret-d11', 'ret-d12', 'ret-d13', 'ret-d14', 'ret-d15',
       'ret-d16', 'ret-d17', 'ret-d18', 'ret-d19', 'ret-d20', 'is_next_jan',
       'target', 'next_month_ret'],
      dtype='object')
<bound method NDFrame.head of date        symbol 
2016-06-30  ALL.csv    0.814109
2016-07-31  ALL.csv    0.818213
2016-08-31  ALL.csv   -0.079368
2016-09-30  ALL.csv   -0.021750
2016-10-31  ALL.csv    1.040548
                         ...   
2024-12-31  WHC.csv    1.763361
2025-01-31  WHC.csv   -0.173003
2025-02-28  WHC.csv    0.474797
2025-03-31  WHC.csv    0.069649
2025-05-31  WHC.csv   -0.577266
Name: ret-d1, Length: 2108, dtype: float64>
<bound method NDFrame.tail of date        symbol 
2016-06-30  ALL.csv    0.814109
2016-07-31  ALL.csv    0.81821

## Cross-Validation splits
To evaluate model performance over time, we implement a **rolling-window cross-validation** framework tailored for time series data:

In [7]:
def train_val_test_split(data, look_back_years, val_years, validation_first=False):
    """
    Generator that yields rolling train/validation/test splits from a time-indexed DataFrame.

    Parameters:
    - data: A pandas DataFrame indexed by date (multi-index allowed, first level must be date)
    - look_back_years: Total number of years in the rolling window (train + val)
    - val_years: Number of years allocated to validation set
    - validation_first: If True, use the order Val → Train → Test; else Train → Val → Test

    Yields:
    - train_data: training data for current rolling window
    - val_data: validation data for current rolling window
    - test_data: test data for current rolling window
    """

    # Create a DataFrame with unique dates from the input index. Takes both Stock and Date for index
    dt = pd.DataFrame(index=data.index.get_level_values(0).unique())

    # Assign each date a corresponding year; shift December forward to avoid lookahead bias
    dt['year'] = dt.index
    dt['year'] = dt['year'].apply(lambda x: x.year if x.month != 12 else x.year + 1)

    # Establish boundaries for rolling window
    current_start_year = dt['year'].iloc[0]  # first available year
    max_year = dt['year'].max()              # last year available in data

    print(f"Start year: {current_start_year}, Max year: {max_year}")

    while True:
        # Determine where to start train and val periods based on validation_first flag
        if validation_first:
            val_start_year = current_start_year
            train_start_year = current_start_year + val_years
        else:
            train_start_year = current_start_year
            val_start_year = current_start_year + (look_back_years - val_years)

        test_year = current_start_year + look_back_years
        print(test_year, max_year)

        # Stop if we've run out of years to create a full test period
        if test_year > max_year:
            break

        # Map year boundaries to actual dates in the original data
        val_start = dt[dt['year'] == val_start_year].index.min()
        train_start = dt[dt['year'] == train_start_year].index.min()
        test_start = dt[dt['year'] == test_year].index.min()
        test_end = dt[dt['year'] == test_year].index.max()

        # Create masks for filtering the data based on date ranges
        if validation_first:
            # Validation → Training → Test
            val_mask = (data.index.get_level_values(0) > val_start) & \
                       (data.index.get_level_values(0) <= train_start)
            train_mask = (data.index.get_level_values(0) > train_start) & \
                         (data.index.get_level_values(0) <= test_start)
        else:
            # Training → Validation → Test (default)
            train_mask = (data.index.get_level_values(0) > train_start) & \
                         (data.index.get_level_values(0) <= val_start)
            val_mask = (data.index.get_level_values(0) > val_start) & \
                       (data.index.get_level_values(0) <= test_start)

        # Define mask for the test set: follows val/train period
        test_mask = (data.index.get_level_values(0) > test_start) & \
                    (data.index.get_level_values(0) <= test_end)

        # Apply masks to extract subsets from data
        train_data = data.loc[train_mask]
        val_data = data.loc[val_mask]
        test_data = data.loc[test_mask]

        # Yield split data
        yield train_data, val_data, test_data

        # Advance rolling window by one year for next iteration
        current_start_year += 1


This function generates chronological train/validation/test splits by sliding a multi-year window forward one year at a time. for each iteration it defintes the training period, validation period and test year - ensuring no data leakage. This is an implimentation of the Cross-Validation split referenced earlier

In [8]:
df = pd.DataFrame(columns=['val_start', 'val_end', 'train_start', 'train_end', 
                           'test_start', 'test_end', 'training_sampes'])
for train_data, val_data, test_data in train_val_test_split(data, 4, 1):
    df.loc[len(df)] = [
        train_data.index.get_level_values(0).min(),
        train_data.index.get_level_values(0).max(),
        val_data.index.get_level_values(0).min(),
        val_data.index.get_level_values(0).max(),
        test_data.index.get_level_values(0).min(),
        test_data.index.get_level_values(0).max(),
        len(train_data)
    ]

df

Start year: 2016, Max year: 2025
2020 2025
2021 2025
2022 2025
2023 2025
2024 2025
2025 2025
2026 2025


Unnamed: 0,val_start,val_end,train_start,train_end,test_start,test_end,training_sampes
0,2016-07-31,2018-12-31,2019-01-31,2019-12-31,2020-01-31,2020-11-30,527
1,2017-01-31,2019-12-31,2020-01-31,2020-12-31,2021-01-31,2021-11-30,658
2,2018-01-31,2020-12-31,2021-01-31,2021-12-31,2022-01-31,2022-11-30,665
3,2019-01-31,2021-12-31,2022-01-31,2022-12-31,2023-01-31,2023-11-30,665
4,2020-01-31,2022-12-31,2023-01-31,2023-12-31,2024-01-31,2024-11-30,675
5,2021-01-31,2023-12-31,2024-01-31,2024-12-31,2025-01-31,2025-05-31,720


## The Model: FFNN
We'll use a simple FeedForwards Neural Network architecture to classify the pre-processed features into two classes, as specified in the paper.  

In [9]:
# %pip install torch
# %pip install torch.nn


import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader


class FFNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(33, 40)
        self.fc2 = nn.Linear(40, 4)
        self.fc3 = nn.Linear(4, 50)
        self.out = nn.Linear(50, 2)  # logits for 2 classes

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        return self.out(x)  # pass logits to CrossEntropyLoss

The FFNN consists of 3 hidden layers with ReLU activations, with a final output layer that produces a binary classification. The input layer is 33 features, which are transformed through progressively deeper representations: froim 40 units, to 4, then to 50, before the output layer of 2. 

We are using `CrossEntropyLoss`, which is the standard loss function for multi-class classification tasks



In [10]:
def create_dataloader(df, batch_size=512, shuffle=False):
    # Convert input features to float32 tensor (all columns except the last two)
    X = torch.tensor(df.iloc[:, :-2].astype('float32').values)
    # Convert labels (second-to-last column) to int64 tensor
    y = torch.tensor(df.iloc[:, -2].astype('int64').values)
    # Wrap tensors in a TensorDataset and return a DataLoader
    dataset = TensorDataset(X, y)
    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)
    
print(train_data["ret-d1"].tail)
# Create a DataLoader for training with shuffling
dataloader = create_dataloader(train_data, shuffle=True)

# Fetch one mini-batch from the DataLoader
batch_X, batch_y = next(iter(dataloader))

# Inspect the shapes of the input features and labels
print(batch_X.shape)  # e.g., torch.Size([512, 33])
print(batch_y.shape)  # e.g., torch.Size([512])

# Initialize the model and run a forward pass on one batch
model = FFNN()
model(batch_X)

<bound method NDFrame.tail of date        symbol 
2021-01-31  ALL.csv    0.951194
2021-02-28  ALL.csv   -1.155492
2021-03-31  ALL.csv    0.967002
2021-04-30  ALL.csv   -0.100165
2021-05-31  ALL.csv   -1.519696
                         ...   
2023-08-31  WHC.csv   -3.660740
2023-09-30  WHC.csv   -2.305724
2023-10-31  WHC.csv    1.345617
2023-11-30  WHC.csv   -1.111632
2023-12-31  WHC.csv    0.035989
Name: ret-d1, Length: 720, dtype: float64>
torch.Size([512, 33])
torch.Size([512])


tensor([[0.0343, 0.0735],
        [0.0294, 0.0557],
        [0.0161, 0.0096],
        ...,
        [0.0309, 0.0745],
        [0.0266, 0.0752],
        [0.0324, 0.0521]], grad_fn=<AddmmBackward0>)

## Model Training: Process and Hyperparameters

This section covers the training loop for the Feedforward Neural Network (FFNN) model. The process includes:

- **Data Preparation:** Training and validation data are loaded into PyTorch DataLoaders for efficient mini-batch processing.
- **Model Initialization:** The FFNN model, optimizer, and loss function are set up.
- **Training Loop:** For each epoch, the model is trained on the training set and evaluated on the validation set. The best model (lowest validation loss) is tracked and saved.
- **Evaluation:** After training, the model is restored to the best-performing weights.

### Key Hyperparameters
- **num_epochs:** Number of passes through the training data.  
  - *Typical values:* 50, 100, 200
- **learning rate (lr):** Step size for optimizer updates.  
  - *Typical values:* 1e-3, 1e-4, 5e-4
- **batch_size:** Number of samples per mini-batch.  
  - *Typical values:* 32, 64, 128, 256, 512
- **optimizer:** Algorithm for updating model weights.  
  - *Common choices:* Adam, SGD, RMSprop
- **loss function:** Measures prediction error.  
  - *Common choices for classification:* CrossEntropyLoss, NLLLoss

These hyperparameters are often tuned to improve model performance. The defaults here are a good starting point for most classification tasks.

In [17]:
import copy

def train(
    train_data, 
    val_data, 
    num_epochs=100, 
    lr=1e-4, 
    batch_size=512, 
    silent=True
):
    """
    Train a Feedforward Neural Network (FFNN) on the provided training data and evaluate on validation data.
    Tracks the best model based on validation loss and restores the best weights at the end.

    Args:
        train_data (pd.DataFrame): Training data.
        val_data (pd.DataFrame): Validation data.
        num_epochs (int): Number of training epochs.
        lr (float): Learning rate for the optimizer.
        batch_size (int): Mini-batch size for DataLoader.
        silent (bool): If False, prints progress each epoch.

    Returns:
        dict: Contains the trained model, best validation loss, epoch, and accuracy.
    """
    # Create DataLoaders for training and validation sets
    train_loader = create_dataloader(train_data, batch_size=batch_size, shuffle=True)
    val_loader = create_dataloader(val_data, batch_size=batch_size, shuffle=False)

    # Initialize model, optimizer, and loss function
    model = FFNN()
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    criterion = nn.CrossEntropyLoss()

    # Helper function to evaluate model on a given DataLoader
    def evaluate(loader):
        model.eval()
        total_loss = 0
        correct = 0
        total = 0
        with torch.no_grad():
            for X, y in loader:
                logits = model(X)
                loss = criterion(logits, y)
                total_loss += loss.item() * X.size(0)
                preds = torch.argmax(logits, dim=1)
                correct += (preds == y).sum().item()
                total += y.size(0)
        avg_loss = total_loss / total
        accuracy = correct / total
        return avg_loss, accuracy

    # Track the best model based on validation loss
    best_val_loss = float('inf')
    best_accuracy = 0
    best_model_weights = None
    best_epoch = -1

    # Training loop
    for epoch in range(1, 1 + num_epochs):
        model.train()
        total_loss = 0
        correct = 0
        total = 0

        for batch_x, batch_y in train_loader:
            optimizer.zero_grad()
            logits = model(batch_x)
            loss = criterion(logits, batch_y)
            loss.backward()
            optimizer.step()

            total_loss += loss.item() * batch_x.size(0)
            preds = torch.argmax(logits, dim=1)
            correct += (preds == batch_y).sum().item()
            total += batch_x.size(0)

        avg_train_loss = total_loss / total
        train_acc = correct / total

        # Evaluate on validation set
        val_loss, val_acc = evaluate(val_loader)

        # Print training progress if not in silent mode
        if not silent:
            print(f"[Epoch {epoch:02d}] "
                  f"Train Loss: {avg_train_loss:.4f}, Train Acc: {train_acc:.4f} | "
                  f"Val Loss: {val_loss:.4f}, Val Acc: {val_acc:.4f}")

        # Update best model if validation loss improves
        if val_loss < best_val_loss:
            best_val_loss = val_loss
            best_accuracy = val_acc
            best_model_weights = copy.deepcopy(model.state_dict())
            best_epoch = epoch

    # Load weights from the best-performing model
    model.load_state_dict(best_model_weights)
    if not silent:
        print(f"Best validation loss was {best_val_loss:.4f} at epoch {best_epoch}")

    return {
        'model': model,
        'best_val_loss': best_val_loss,
        'best_epoch': best_epoch,
        'best_accuracy': best_accuracy
    }

# Example usage:
results = train(train_data, val_data, num_epochs=100, lr=1e-4, batch_size=512, silent=False)
trained_model = results['model']


[Epoch 01] Train Loss: 0.6928, Train Acc: 0.5194 | Val Loss: 0.6938, Val Acc: 0.5053
[Epoch 02] Train Loss: 0.6928, Train Acc: 0.5181 | Val Loss: 0.6938, Val Acc: 0.5053
[Epoch 03] Train Loss: 0.6927, Train Acc: 0.5194 | Val Loss: 0.6937, Val Acc: 0.5053
[Epoch 04] Train Loss: 0.6927, Train Acc: 0.5181 | Val Loss: 0.6937, Val Acc: 0.5053
[Epoch 05] Train Loss: 0.6926, Train Acc: 0.5181 | Val Loss: 0.6937, Val Acc: 0.5018
[Epoch 06] Train Loss: 0.6926, Train Acc: 0.5153 | Val Loss: 0.6937, Val Acc: 0.4947
[Epoch 07] Train Loss: 0.6926, Train Acc: 0.5167 | Val Loss: 0.6936, Val Acc: 0.4912
[Epoch 08] Train Loss: 0.6925, Train Acc: 0.5167 | Val Loss: 0.6936, Val Acc: 0.4912
[Epoch 09] Train Loss: 0.6925, Train Acc: 0.5181 | Val Loss: 0.6936, Val Acc: 0.4912
[Epoch 10] Train Loss: 0.6924, Train Acc: 0.5181 | Val Loss: 0.6936, Val Acc: 0.4912
[Epoch 11] Train Loss: 0.6924, Train Acc: 0.5194 | Val Loss: 0.6936, Val Acc: 0.4912
[Epoch 12] Train Loss: 0.6924, Train Acc: 0.5167 | Val Loss: 0.69

## Generating Predictions: Process and Parameters

This section describes how to use the trained model to generate predictions on the test set. The process includes:

- **Data Preparation:** The test data is loaded into a DataLoader for efficient batch processing.
- **Model Evaluation:** The model is set to evaluation mode and predictions are generated for each batch without computing gradients.
- **Probability and Class Assignment:** For each sample, the predicted class and the probability of the positive class are computed.
- **Quantile Assignment:** Each sample is assigned a quantile label based on its predicted probability, which is useful for ranking and portfolio construction.

### Key Parameters
- **batch_size:** Number of samples per mini-batch during prediction.  
  - *Typical values:* 128, 256, 512 (should match or be a multiple of training batch size)
- **num_quantiles:** Number of quantile bins to assign for ranking.  
  - *Typical values:* 10 (deciles), 20, 40 (finer granularity for ranking)

Assigning quantiles is especially useful in financial applications for constructing long/short portfolios or evaluating model performance across different risk buckets.

In [None]:
def generate_predictions(test_data, model, num_quantiles=10):
    """
    Generate class predictions, probabilities, and quantile labels for the test set using a trained model.

    Args:
        test_data (pd.DataFrame): Test data for prediction.
        model (nn.Module): Trained PyTorch model.
        num_quantiles (int): Number of quantile bins for ranking predictions.

    Returns:
        pd.DataFrame: Test data with added columns for predicted class, probability, and quantile label.
    """
    # Create DataLoader for test data (no shuffling)
    test_loader = create_dataloader(test_data, batch_size=512, shuffle=False)

    model.eval()
    all_preds = []
    all_probs = []

    # Perform forward pass on test data without computing gradients
    with torch.no_grad():
        for batch_x, _ in test_loader:
            logits = model(batch_x)  # Output logits: shape [batch_size, 2]
            probs = torch.softmax(logits, dim=1)  # Convert to probabilities
            preds = torch.argmax(probs, dim=1)    # Predicted class labels

            all_preds.append(preds.cpu())
            all_probs.append(probs.cpu())

    # Concatenate all mini-batch results into full arrays
    all_preds = torch.cat(all_preds).numpy()
    all_probs = torch.cat(all_probs).numpy()

    # Create a copy of the test data and add predictions
    td = test_data.copy()
    td["predicted_class"] = all_preds
    td["predicted_prob"] = all_probs[:, 1]  # Probability of class 1 (positive class)

    # Assign quantile labels (1 to num_quantiles) based on predicted probabilities
    # NOTE: duplicates='drop' avoids qcut errors when there are tied probabilities
    td["quantile"] = td.groupby(level=0)["predicted_prob"] \
        .transform(lambda x: pd.qcut(x, num_quantiles, labels=False, duplicates='drop') + 1)

    return td

# Example usage:
# test_results = generate_predictions(test_data, trained_model, num_quantiles=10)
# Display the first few rows of the test results with predictions
# print(test_results.head())


                      ret-m1    ret-m2    ret-m3    ret-m4    ret-m5  \
date       symbol                                                      
2025-01-31 ALL.csv  0.977591  1.383571  1.429739  1.186168  1.263271   
2025-02-28 ALL.csv  0.406909  0.901354  1.144721  1.202088  1.125166   
2025-03-31 ALL.csv -0.341682  0.038926  0.435025  0.625850  0.919563   
2025-04-30 ALL.csv  0.179030 -0.120631  0.179380  0.497325  0.699685   
2025-05-31 ALL.csv -1.215512 -0.715020 -0.959728 -0.580422 -0.267160   

                      ret-m6    ret-m7    ret-m8    ret-m9   ret-m10  ...  \
date       symbol                                                     ...   
2025-01-31 ALL.csv  1.159889  1.356672  1.484772  1.755659  0.315620  ...   
2025-02-28 ALL.csv  1.286345  1.166334  1.317940  1.470540  1.738471  ...   
2025-03-31 ALL.csv  0.902887  1.130388  1.047060  1.180390  1.412855  ...   
2025-04-30 ALL.csv  0.890066  0.829274  1.015785  0.944749  1.091691  ...   
2025-05-31 ALL.csv  0.001233  0.2

## Rolling-Window Training, Prediction, and Results Aggregation

This section implements a rolling-window cross-validation loop for time series data. For each fold (test year):

- **Results Logging:** Training and prediction results, including metadata and performance metrics, are stored for each fold.
- **Aggregation:** All test set predictions are combined for final evaluation.

### Key Parameters
- **look_back_years:** Number of years in the rolling window (train + val).  
  - *Typical values:* 4, 5, 10
- **val_years:** Number of years for validation in each window.  
  - *Typical values:* 1, 2
- **num_quantiles:** Number of quantile bins for ranking predictions.  
  - *Typical values:* 10, 20, 40

The output shows the best val loss, accuracy and epoch for each year.

In [23]:
# Initialize summary DataFrame to store training details for each test year
summary_cols = [
    'train_start', 'train_end', 'val_start', 'val_end', 'test_start', 'test_end',
    'training_samples', 'best_val_loss', 'best_epoch', 'best_accuracy'
]
df = pd.DataFrame(columns=summary_cols)

preds = []

# Perform rolling-window training and evaluation
for train_data, val_data, test_data in train_val_test_split(data, look_back_years=10, val_years=1):
    year = test_data.index.get_level_values(0)[0].year  # Extract test year for logging

    # Train model on training and validation sets
    result = train(train_data, val_data, silent=True)

    # Generate predictions on test set
    td = generate_predictions(test_data, result['model'])
    preds.append(td)
    
    # Record metadata and training results for this fold
    df.loc[len(df)] = [
        train_data.index.get_level_values(0).min(),
        train_data.index.get_level_values(0).max(),
        val_data.index.get_level_values(0).min(),
        val_data.index.get_level_values(0).max(),
        test_data.index.get_level_values(0).min(),
        test_data.index.get_level_values(0).max(),
        len(train_data),
        result['best_val_loss'],
        result['best_epoch'],
        result['best_accuracy'],
    ]

    # Print training result summary for current year
    print(f"{year} Best val loss {result['best_val_loss']:.4f}, acc {result['best_accuracy']:.4f}")

# Combine all test set predictions across folds
preds = pd.concat(preds)

# Example: display summary DataFrame and first few rows of predictions
# print(df)
# print(preds.head())


Start year: 2016, Max year: 2025
2026 2025


ValueError: No objects to concatenate