# Advanced ML Project
---

## Summary

1. **Transformers with Attention Mechanisms**
    - **Project**: Develop a specialized Transformer model to capture long-term dependencies and temporal relationships in financial data, based on "Attention is All You Need" by Vaswani et al. (2017). We could also use the Temporal Fusion Transformer (TFT) for its efficiency with multi-horizon series, as presented by Lim et al. in "Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting" (2021).
    - **Objective**: Improve prediction accuracy by using attention mechanisms to capture the influences of distant data points.

2. **Variational Autoencoders (VAE) for Feature Engineering and Anomaly Detection**
    - **Project**: Use VAEs to compress market data and extract latent representations less sensitive to noise, inspired by "Auto-Encoding Variational Bayes" by Kingma and Welling (2013). We will also explore VAE-based approaches for anomaly detection in time series, as detailed in "Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network" by Hundman et al. (2018).
    - **Objective**: Extract robust latent features and detect anomalies to improve predictions and identify unexpected market variations.

3. **Performance Comparison**
    - **Objective**: Compare the performance of the two methods (Transformers vs VAE) for time series forecasting.

4. **Hyperparameter Fine-Tuning**
    - **Objective**: Fine-tune the model hyperparameters to optimize prediction performance.

https://github.com/xuxu-wei/HybridVAE

In [7]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

In [72]:
from transformers import TimeSeriesTransformerForPrediction
from lightning.pytorch import Trainer
from pytorch_forecasting import TimeSeriesDataSet, TemporalFusionTransformer
from pytorch_forecasting.metrics import QuantileLoss
from pytorch_forecasting.data import GroupNormalizer
import torch
from tqdm import tqdm

## Loading data

In [54]:
df = pd.read_parquet('../preprocessed_data/training.parquet')
val_df = pd.read_parquet('../preprocessed_data/validation.parquet')

In [39]:
max_encoder_length = 100  # lookback window
max_prediction_length = 10  # forecast window
training_cutoff = int(len(df) * 0.8)

In [82]:
df["symbol_id"] = df["symbol_id"].astype("str")

In [83]:
df.dtypes

id                    uint32
date_id              float64
time_id                int64
symbol_id             object
weight               float32
                      ...   
responder_4_lag_1    float32
responder_5_lag_1    float32
responder_6_lag_1    float32
responder_7_lag_1    float32
responder_8_lag_1    float32
Length: 103, dtype: object

## Benchmark linear regression

In [98]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [100]:
df_small = df.dropna().sample(1000000)

In [101]:
df_small

Unnamed: 0,id,date_id,time_id,symbol_id,weight,feature_00,feature_01,feature_02,feature_03,feature_04,...,label,responder_0_lag_1,responder_1_lag_1,responder_2_lag_1,responder_3_lag_1,responder_4_lag_1,responder_5_lag_1,responder_6_lag_1,responder_7_lag_1,responder_8_lag_1
2128308,43756438,1608.0,364,0,3.937173,2.306764,0.519196,2.637494,2.254404,-0.841473,...,0.0,-0.162277,-0.116389,1.560023,0.979339,0.343606,0.846384,0.210039,0.088136,0.334838
1714380,43342510,1597.0,109,0,4.475202,-0.058088,-2.612298,-0.046534,0.057987,-1.318175,...,0.0,1.592873,1.349861,0.034853,1.579569,0.825219,1.455915,0.738635,0.376521,1.107516
2649701,44277831,1622.0,364,5,3.011554,1.774683,-1.365354,1.586449,1.855868,0.517707,...,0.0,1.264722,0.778407,-0.526852,0.241921,0.177216,0.372287,0.144821,0.096059,0.245909
3989011,45617141,1658.0,518,9,1.194727,1.728448,0.056828,2.213796,2.091799,-1.545495,...,0.0,-0.106015,0.136787,0.622785,0.089944,0.067552,-0.290040,-1.256863,-0.516847,-3.350451
3119651,44747781,1634.0,838,17,3.891044,0.739156,0.923295,0.571764,0.298341,-0.360414,...,0.0,-0.615973,-0.177908,-0.086522,-0.439819,-0.191102,-0.016968,-0.371733,-0.204741,-0.872424
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3080413,44708543,1633.0,800,13,2.670314,0.455579,0.923407,0.319387,0.632010,0.082968,...,3.0,0.327784,0.268779,1.165349,0.699005,0.443276,1.196272,0.697565,0.234190,1.262335
963786,42591916,1576.0,884,30,2.354905,2.320300,-1.297566,2.499391,2.726670,1.380740,...,0.0,1.309887,0.979838,1.944776,0.420354,0.166707,-0.033193,-0.340363,-0.069098,-0.533852
4448471,46076601,1670.0,828,24,1.294774,-0.114172,-1.025281,0.161494,0.267733,-1.050663,...,0.0,0.376283,-0.302141,-0.045680,0.456144,0.184020,-0.004782,-0.003736,0.015292,-0.024052
2291974,43920104,1612.0,706,26,1.639267,2.598802,-1.028428,1.620213,1.803322,-0.750055,...,0.0,0.428165,0.269177,0.118191,-0.301266,-0.128761,-0.843805,-0.831479,-0.384089,-1.556916


In [102]:
model = LinearRegression()
model.fit(df_small.drop(columns=["label"]), df_small["label"])

In [113]:
val_index = [e for e in df.index if e not in df_small.index]
df_val = df.loc[val_index].dropna().sample(100000)

In [105]:
r_sq = model.score(df_small.drop(columns=["label"]), df_small["label"])
print('coefficient of determination:', r_sq)

coefficient of determination: 0.9072819665775009


In [116]:
y_pred = model.predict(df_val.drop(columns=["label"]))
rms = mean_squared_error(df_val["label"], y_pred)

In [123]:
predi = pd.DataFrame(y_pred, columns=["predicted"])
predi['true'] = df_val["label"].values
predi

Unnamed: 0,predicted,true
0,-0.082390,0.0
1,0.444968,0.0
2,5.660757,7.0
3,0.106000,0.0
4,-0.031217,0.0
...,...,...
99995,-0.125269,0.0
99996,0.175710,0.0
99997,-1.262765,-1.0
99998,0.387893,0.0


## Training Temporal Fusion Transformers

In [84]:
training = TimeSeriesDataSet(
    df[lambda x: x.index < training_cutoff],
    time_idx="time_id",
    target="label",
    group_ids=["symbol_id"],
    max_encoder_length=max_encoder_length,
    max_prediction_length=max_prediction_length,
    static_categoricals=["symbol_id"],
    static_reals=[],
    time_varying_known_categoricals=[],
    #time_varying_known_reals=["time_id","date_id"],
    time_varying_unknown_reals=["label"],
    #target_normalizer=GroupNormalizer(transformation="softplus"),
    add_relative_time_idx=True,
    add_target_scales=True,
    allow_missing_timesteps=True,
)

In [96]:
tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate=0.03,
    hidden_size=16,
    attention_head_size=4,
    dropout=0.1,
    hidden_continuous_size=8,
    output_size=3,  # Quantile outputs
    loss=QuantileLoss(),
)

/Users/paul-antoine/Desktop/ENSAE/Advanced-ML-Project/.venv/lib/python3.8/site-packages/lightning/pytorch/utilities/parsing.py:208: Attribute 'loss' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['loss'])`.
/Users/paul-antoine/Desktop/ENSAE/Advanced-ML-Project/.venv/lib/python3.8/site-packages/lightning/pytorch/utilities/parsing.py:208: Attribute 'logging_metrics' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['logging_metrics'])`.
  super().__init__(loss=loss, logging_metrics=logging_metrics, **kwargs)


In [86]:
train_dataloader = training.to_dataloader(train=True, batch_size=64, num_workers=0)

In [97]:
trainer = Trainer(max_epochs=10, gradient_clip_val=0.1)
trainer.fit(tft, train_dataloader)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
/Users/paul-antoine/Desktop/ENSAE/Advanced-ML-Project/.venv/lib/python3.8/site-packages/lightning/pytorch/trainer/configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.

   | Name                               | Type                            | Params | Mode 
------------------------------------------------------------------------------------------------
0  | loss                               | QuantileLoss                    | 0      | train
1  | logging_metrics                    | ModuleList                      | 0      | train
2  | input_embeddings                   | MultiEmbedding                  | 468    | train
3  | prescalers                         | ModuleDict                      | 64     | train
4  | static_variable_selection          | VariableSelectionNetwork        | 1.2 K  | train
5  | encoder_variab

Epoch 0:   0%|          | 0/55162 [00:00<?, ?it/s] 



IndexError: index 3 is out of bounds for dimension 1 with size 3

In [88]:
print(df["label"].describe())

count    4.971648e+06
mean     2.458742e-02
std      1.392350e+00
min     -1.000000e+01
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      1.000000e+01
Name: label, dtype: float64


In [93]:
l= []
for group, group_data in df.groupby("symbol_id"):
    l.append(len(group_data))

In [95]:
np.min(l)


117128