### Resources:
- https://github.com/Samvid95/AlgoTradingRepoList

- https://www.google.com/amp/s/blog.mlq.ai/deep-reinforcement-learning-for-trading-with-tensorflow-2-0/amp/


Quant Radio Podcast Ideas:
- Transformer model
  - really good at long sequences of time series data
  - Recency important - continuously retrain your model
- The bitter lesson 
  - https://www.quantitativo.com/p/the-bitter-lesson?utm_source=substack&utm_medium=web&utm_content=embedded-post&triedRedirect=true
  - Don't just throw everything at the model, LESS IS MORE, pick a method and train using that style (e.g. momentum)
- Learning to Rank Stocks
  - https://www.quantitativo.com/p/learning-to-rank


# Setup Raw Data
   import csv of High, Low, Open, Close, Volume for 5-10 stocks
  - Remove stocks with share price <$5
  - Remove stocks with volatility indicator >X

# Model 1:
  - Simple bulk model
    - Leverage the Bitter Lesson
    - Maximum data throughput
    - minimal human intervention 

  #### Inputs/Pre-processing:
    - 20 log daily returns
    - 12 log monthly returns (t-13 -> t-2)
    - 1 January flag
    - Convert returns to cumulative returns and z-scores for cross-sectional normalization
  
  #### The Model:
    - Stacked RBMs to form an autoencoder, pretrained layer-wise
    - Compress input to a low-dim feature space
    - Pass to feedforward Neural Network for classification
    - grid search with hold-out validation
    - Final Architecture:
      - 33-40-4-50-2
      - 33 inputs, compressed to 4, classified into 2 classes
    - 

  #### Target/Classification
    - Above monthly returns
    - Below monthly Returns




# Model 2: Momentum Indicator Enriched Model 
  - apply Trading indicators
    - Volatility indicator, SMA, EMA, MACD, etc
    - Return single Enriched dataframe for regression
- Same model as above

# Model 3: Ranking Portfolio with LambdaMART
https://www.quantitativo.com/p/learning-to-rank

https://xgboost.readthedocs.io/en/stable/tutorials/learning_to_rank.html

#### Features
  - Compute asset scores using past returns, volatility-normalized indicators and momentum signals
  - Modular framework - can encorporate additional features beyond momentum
  - 6 features are past returns (raw and normalised for different windows), 15 momentum based features (different windows)

#### Score Ranking
  - Apply LambdaMART to rank stocks based on future expected performance

#### Security Selection
  - Long the top decile and short the bottome decile based upon rankings
  - Increasing to 40 quantiles rather than 10 improves results but increases volatility and drawdowns
  - Sharpe ratio optimised at between 30-40 quantiles

#### Volatility scaling
  - Normalize position sizes based on *ex-ante monthly volatility*, targeting 15% annualized volatility

#### Re-Training and Rebalancing
  - Rebalance monthly

In [1]:
import csv 
import numpy as np
import pandas as pd
# import tensorflow as tf


In [12]:
def get_data(symbol):
    # Get raw price data (adjusted and unadjusted closes)
    raw_df = pd.read_csv(fr"C:\Users\dougl\Desktop\Personal Projects\Deep Learning Algorithm\RAW DATA\{symbol}")

    raw_df = raw_df.replace(0, np.nan)
    raw_df = raw_df.ffill()         # Foward Fill values to patch any gaps in time series data
    raw_df['Date'] = pd.to_datetime(raw_df['Date'], dayfirst=True)
    raw_df.set_index('Date', inplace=True)  # Set datetime index here

    return raw_df
    
raw_df = get_data("VAS.csv")  # Example symbol, replace with actual file name

def process_data(raw_df, symbol):
    """
    Constructs a feature-rich DataFrame indexed by (date, symbol) for use in predictive models.
    Calculates monthly cumulative returns (ret-m1 to ret-m12) based on end-of-month closes and daily cumulative returns (ret-d1 to ret-d20) over the last 20 trading days.
    Includes the unadjusted close price, a binary flag is_next_jan indicating if the next month is January, and the forward return for the next month (next_month_ret).
    Ensures no missing or infinite values and returns a clean, multi-indexed dataset ready for modeling.
    """
    # Get end-of-month data: last trading day of each month
    eom = raw_df.groupby(pd.Grouper(freq='M')).last()
    eom['is_next_jan'] = (eom.index.month == 12).astype(int)
    eom['next_month_ret'] = eom['Close'].shift(-1) / eom['Close'] - 1

    # Compute monthly cumulative returns over the past 12 months (ret-m1 to ret-m12)
    monthly = pd.DataFrame(index=eom.index)
    for m in range(1, 13):
        monthly[f'ret-m{m}'] = (eom['Close'] / eom['Close'].shift(m)) - 1
    monthly = monthly.dropna()
    monthly = monthly.astype(float)

    # Compute daily cumulative returns over the past 20 trading days (ret-d1 to ret-d20)
    daily = pd.DataFrame(index=raw_df.index)
    for d in range(1, 21):
        daily[f'ret-d{d}'] = (raw_df['Close'] / raw_df['Close'].shift(d)) - 1
    daily = daily.dropna()
    daily = daily.astype(float)

    # Align daily to end-of-month dates (last 20 trading days up to each EOM)
    # For each EOM date, get the last available daily row <= EOM date
    daily_eom = daily.loc[daily.index.isin(eom.index)]
    if len(daily_eom) < len(eom):
        # If not all EOM dates are present in daily, reindex with forward fill
        daily_eom = daily.reindex(eom.index, method='ffill')

    # Merge monthly and daily features with EOM targets and metadata
    df = monthly.join(daily_eom, how='inner') \
                .join(eom[['Close', 'is_next_jan', 'next_month_ret']], how='inner')
    df = df.rename(columns={'Close': 'unadj_close'})

    # Clean: drop rows with any missing or infinite values
    df = df.replace([np.inf, -np.inf], np.nan).dropna()

    # Return None if no usable rows
    if len(df) == 0:
        return None

    # Create MultiIndex (date, symbol)
    df.index = pd.MultiIndex.from_tuples([(d, symbol) for d in df.index], names=["date", "symbol"])

    return df

print(process_data(raw_df, "VAS.csv").columns)


Index(['ret-m1', 'ret-m2', 'ret-m3', 'ret-m4', 'ret-m5', 'ret-m6', 'ret-m7',
       'ret-m8', 'ret-m9', 'ret-m10', 'ret-m11', 'ret-m12', 'ret-d1', 'ret-d2',
       'ret-d3', 'ret-d4', 'ret-d5', 'ret-d6', 'ret-d7', 'ret-d8', 'ret-d9',
       'ret-d10', 'ret-d11', 'ret-d12', 'ret-d13', 'ret-d14', 'ret-d15',
       'ret-d16', 'ret-d17', 'ret-d18', 'ret-d19', 'ret-d20', 'unadj_close',
       'is_next_jan', 'next_month_ret'],
      dtype='object')


Now going to loop through each of our stocks using the above function

In [23]:
# import os

folder_path = "C:\\Users\\dougl\\Desktop\\Personal Projects\\Deep Learning Algorithm\\RAW DATA\\"



csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

print(csv_files)

full_data = []

for stock in csv_files:
    # print(f"Loading {stock}")
    stock_df = get_data(stock)
    stock_df = process_data(stock_df, stock)
    # If process_data returns None, skip this stock
    if stock_df is not None:
        full_data.append(stock_df)
    # print(f"{stock} Loaded")
    # print("------")

# Combine all stock DataFrames into one
full_data = pd.concat(full_data, axis=0)

print(full_data.shape)

try:
    full_data.to_csv('Full_Dataframe.csv', index=True)
    print("File saved successfully.")
except Exception as e:
    raise RuntimeError(f"Failed to save 'Full_Dataframe.csv': {e}")

print(full_data.columns)

['ALL.csv', 'ALQ.csv', 'ATOM.csv', 'BGL.csv', 'CAR.csv', 'CSL.csv', 'DMP.csv', 'DVP.csv', 'FMG.csv', 'FPH.csv', 'GOLD.csv', 'JHX.csv', 'MIN.csv', 'MQG.csv', 'NWS.csv', 'NXG.csv', 'PDN.csv', 'PME.csv', 'QAN.csv', 'REA.csv', 'RMD.csv', 'SGH.csv', 'STO.csv', 'TWE.csv', 'VAS.csv', 'WCN.csv', 'WDS.csv', 'WHC.csv']
(2648, 35)
File saved successfully.
Index(['ret-m1', 'ret-m2', 'ret-m3', 'ret-m4', 'ret-m5', 'ret-m6', 'ret-m7',
       'ret-m8', 'ret-m9', 'ret-m10', 'ret-m11', 'ret-m12', 'ret-d1', 'ret-d2',
       'ret-d3', 'ret-d4', 'ret-d5', 'ret-d6', 'ret-d7', 'ret-d8', 'ret-d9',
       'ret-d10', 'ret-d11', 'ret-d12', 'ret-d13', 'ret-d14', 'ret-d15',
       'ret-d16', 'ret-d17', 'ret-d18', 'ret-d19', 'ret-d20', 'unadj_close',
       'is_next_jan', 'next_month_ret'],
      dtype='object')


## Pre-Processing

Before training the model, we'll apply several pre-processing steps to clean, standardize features, and define target variables

Filter out low-priced stocks - this is a proxy to remove illiquid and noisy stocks. Could enhance this in the future to use the VIX and a $'s traded metric instead.

Next apply **Cross-sectional z-score standardization** to all features (except last two columns) to standardize them (I prefer normalisation)

Next target variable is assigned a 1 if the next months return is above the median return for that date, otherwise a 0. 

Preserve the last feature (`is_next_jan`) and the original forward return for later analysis. All components are then combined into the single DataFrame for model input. 

In [24]:
# Filter out penny stocks: keep only rows where unadjusted close is 
# greater than $5
raw = full_data[full_data['unadj_close'] > 5]

# Drop the unadjusted close column—it’s no longer needed
raw = raw.drop(columns=['unadj_close'])

# Standardize features (z-score) within each date (cross-sectionally)
features_std = raw.iloc[:, :-2].groupby(level=0)\
    .transform(lambda x: (x - x.mean()) / x.std())

# Extract the raw (non-standardized) version of the last feature column (is_next_jan)
feature_raw = raw.iloc[:, -2]

# Binary target: 1 if next_month_ret is above the cross-sectional median 
# for that date, else 0
target = raw.iloc[:, -1]
target = (target > target.groupby(level=0).transform('median')).astype(int)
target.name = 'target'

# Preserve the original next_month_ret values for reference or evaluation
next_month_ret = raw.iloc[:, -1]

# Concatenate standardized features, raw feature, target, and 
# forward return into final dataset
data = pd.concat([features_std, feature_raw, target, next_month_ret], axis=1)

print(data.columns)

print(data["ret-d1"].head)
print(data["ret-d1"].tail)

Index(['ret-m1', 'ret-m2', 'ret-m3', 'ret-m4', 'ret-m5', 'ret-m6', 'ret-m7',
       'ret-m8', 'ret-m9', 'ret-m10', 'ret-m11', 'ret-m12', 'ret-d1', 'ret-d2',
       'ret-d3', 'ret-d4', 'ret-d5', 'ret-d6', 'ret-d7', 'ret-d8', 'ret-d9',
       'ret-d10', 'ret-d11', 'ret-d12', 'ret-d13', 'ret-d14', 'ret-d15',
       'ret-d16', 'ret-d17', 'ret-d18', 'ret-d19', 'ret-d20', 'is_next_jan',
       'target', 'next_month_ret'],
      dtype='object')
<bound method NDFrame.head of date        symbol 
2016-06-30  ALL.csv    0.814109
2016-07-31  ALL.csv    0.818213
2016-08-31  ALL.csv   -0.079368
2016-09-30  ALL.csv   -0.021750
2016-10-31  ALL.csv    1.040548
                         ...   
2024-12-31  WHC.csv    1.763361
2025-01-31  WHC.csv   -0.173003
2025-02-28  WHC.csv    0.474797
2025-03-31  WHC.csv    0.069649
2025-05-31  WHC.csv   -0.577266
Name: ret-d1, Length: 2108, dtype: float64>
<bound method NDFrame.tail of date        symbol 
2016-06-30  ALL.csv    0.814109
2016-07-31  ALL.csv    0.81821

## Cross-Validation splits
To evaluate model performance over time, we implement a **rolling-window cross-validation** framework tailored for time series data:

In [26]:
def train_val_test_split(data, look_back_years, val_years, validation_first=False):
    """
    Generator that yields rolling train/validation/test splits from a time-indexed DataFrame.

    Parameters:
    - data: A pandas DataFrame indexed by date (multi-index allowed, first level must be date)
    - look_back_years: Total number of years in the rolling window (train + val)
    - val_years: Number of years allocated to validation set
    - validation_first: If True, use the order Val → Train → Test; else Train → Val → Test

    Yields:
    - train_data: training data for current rolling window
    - val_data: validation data for current rolling window
    - test_data: test data for current rolling window
    """

    # Create a DataFrame with unique dates from the input index. Takes both Stock and Date for index
    dt = pd.DataFrame(index=data.index.get_level_values(0).unique())

    # Assign each date a corresponding year; shift December forward to avoid lookahead bias
    dt['year'] = dt.index
    dt['year'] = dt['year'].apply(lambda x: x.year if x.month != 12 else x.year + 1)

    # Establish boundaries for rolling window
    current_start_year = dt['year'].iloc[0]  # first available year
    max_year = dt['year'].max()              # last year available in data

    print(f"Start year: {current_start_year}, Max year: {max_year}")

    while True:
        # Determine where to start train and val periods based on validation_first flag
        if validation_first:
            val_start_year = current_start_year
            train_start_year = current_start_year + val_years
        else:
            train_start_year = current_start_year
            val_start_year = current_start_year + (look_back_years - val_years)

        test_year = current_start_year + look_back_years
        print(test_year, max_year)

        # Stop if we've run out of years to create a full test period
        if test_year > max_year:
            break

        # Map year boundaries to actual dates in the original data
        val_start = dt[dt['year'] == val_start_year].index.min()
        train_start = dt[dt['year'] == train_start_year].index.min()
        test_start = dt[dt['year'] == test_year].index.min()
        test_end = dt[dt['year'] == test_year].index.max()

        # Create masks for filtering the data based on date ranges
        if validation_first:
            # Validation → Training → Test
            val_mask = (data.index.get_level_values(0) > val_start) & \
                       (data.index.get_level_values(0) <= train_start)
            train_mask = (data.index.get_level_values(0) > train_start) & \
                         (data.index.get_level_values(0) <= test_start)
        else:
            # Training → Validation → Test (default)
            train_mask = (data.index.get_level_values(0) > train_start) & \
                         (data.index.get_level_values(0) <= val_start)
            val_mask = (data.index.get_level_values(0) > val_start) & \
                       (data.index.get_level_values(0) <= test_start)

        # Define mask for the test set: follows val/train period
        test_mask = (data.index.get_level_values(0) > test_start) & \
                    (data.index.get_level_values(0) <= test_end)

        # Apply masks to extract subsets from data
        train_data = data.loc[train_mask]
        val_data = data.loc[val_mask]
        test_data = data.loc[test_mask]

        # Yield split data
        yield train_data, val_data, test_data

        # Advance rolling window by one year for next iteration
        current_start_year += 1


This function generates chronological train/validation/test splits by sliding a multi-year window forward one year at a time. for each iteration it defintes the training period, validation period and test year - ensuring no data leakage. This is an implimentation of the Cross-Validation split referenced earlier

In [27]:
df = pd.DataFrame(columns=['val_start', 'val_end', 'train_start', 'train_end', 
                           'test_start', 'test_end', 'training_sampes'])
for train_data, val_data, test_data in train_val_test_split(data, 4, 1):
    df.loc[len(df)] = [
        train_data.index.get_level_values(0).min(),
        train_data.index.get_level_values(0).max(),
        val_data.index.get_level_values(0).min(),
        val_data.index.get_level_values(0).max(),
        test_data.index.get_level_values(0).min(),
        test_data.index.get_level_values(0).max(),
        len(train_data)
    ]

df

Start year: 2016, Max year: 2025
2020 2025
2021 2025
2022 2025
2023 2025
2024 2025
2025 2025
2026 2025


Unnamed: 0,val_start,val_end,train_start,train_end,test_start,test_end,training_sampes
0,2016-07-31,2018-12-31,2019-01-31,2019-12-31,2020-01-31,2020-11-30,527
1,2017-01-31,2019-12-31,2020-01-31,2020-12-31,2021-01-31,2021-11-30,658
2,2018-01-31,2020-12-31,2021-01-31,2021-12-31,2022-01-31,2022-11-30,665
3,2019-01-31,2021-12-31,2022-01-31,2022-12-31,2023-01-31,2023-11-30,665
4,2020-01-31,2022-12-31,2023-01-31,2023-12-31,2024-01-31,2024-11-30,675
5,2021-01-31,2023-12-31,2024-01-31,2024-12-31,2025-01-31,2025-05-31,720


## The Model: FFNN
We'll use a simple FeedForwards Neural Network architecture to classify the pre-processed features into two classes, as specified in the paper.  

In [29]:
%pip install torch
%pip install torch.nn


import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader


class FFNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(33, 40)
        self.fc2 = nn.Linear(40, 4)
        self.fc3 = nn.Linear(4, 50)
        self.out = nn.Linear(50, 2)  # logits for 2 classes

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        return self.out(x)  # pass logits to CrossEntropyLoss

Collecting torch
  Downloading torch-2.7.1-cp311-cp311-win_amd64.whl (216.1 MB)
     ---------------------------------------- 0.0/216.1 MB ? eta -:--:--
     ---------------------------------------- 0.2/216.1 MB 5.6 MB/s eta 0:00:39
     ---------------------------------------- 0.5/216.1 MB 6.3 MB/s eta 0:00:35
     ---------------------------------------- 0.8/216.1 MB 6.7 MB/s eta 0:00:33
     ---------------------------------------- 1.4/216.1 MB 7.4 MB/s eta 0:00:29
     ---------------------------------------- 2.0/216.1 MB 9.1 MB/s eta 0:00:24
     ---------------------------------------- 2.3/216.1 MB 9.1 MB/s eta 0:00:24
      --------------------------------------- 3.0/216.1 MB 9.5 MB/s eta 0:00:23
      -------------------------------------- 3.6/216.1 MB 10.0 MB/s eta 0:00:22
      --------------------------------------- 3.9/216.1 MB 9.8 MB/s eta 0:00:22
      -------------------------------------- 4.8/216.1 MB 10.6 MB/s eta 0:00:20
      -------------------------------------- 5.

ERROR: Could not find a version that satisfies the requirement torch.nn (from versions: none)
ERROR: No matching distribution found for torch.nn


Note: you may need to restart the kernel to use updated packages.


In [None]:
def create_dataloader(df, batch_size=512, shuffle=False):
    # Convert input features to float32 tensor (all columns except the last two)
    X = torch.tensor(df.iloc[:, :-2].astype('float32').values)
    # Convert labels (second-to-last column) to int64 tensor
    y = torch.tensor(df.iloc[:, -2].astype('int64').values)
    # Wrap tensors in a TensorDataset and return a DataLoader
    dataset = TensorDataset(X, y)
    return DataLoader(dataset, batch_size=batch_size, shuffle=shuffle)
    
print(train_data["ret-d1"].tail)
# Create a DataLoader for training with shuffling
dataloader = create_dataloader(train_data, shuffle=True)

# Fetch one mini-batch from the DataLoader
batch_X, batch_y = next(iter(dataloader))

# Inspect the shapes of the input features and labels
print(batch_X.shape)  # e.g., torch.Size([512, 33])
print(batch_y.shape)  # e.g., torch.Size([512])

# Initialize the model and run a forward pass on one batch
model = FFNN()
model(batch_X)

<bound method NDFrame.tail of 2021-03-31  ALL.csv   NaN
2021-04-30  ALL.csv   NaN
2021-05-31  ALL.csv   NaN
2021-06-30  ALL.csv   NaN
2021-08-31  ALL.csv   NaN
                       ..
2023-07-31  WHC.csv   NaN
2023-08-31  WHC.csv   NaN
2023-10-31  WHC.csv   NaN
2023-11-30  WHC.csv   NaN
2024-01-31  WHC.csv   NaN
Name: ret-d1, Length: 544, dtype: float64>
torch.Size([512, 33])
torch.Size([512])


tensor([[nan, nan],
        [nan, nan],
        [nan, nan],
        ...,
        [nan, nan],
        [nan, nan],
        [nan, nan]], grad_fn=<AddmmBackward0>)