# Advanced ML Project
---

## Summary

1. **Transformers with Attention Mechanisms**
    - **Project**: Develop a specialized Transformer model to capture long-term dependencies and temporal relationships in financial data, based on "Attention is All You Need" by Vaswani et al. (2017). We could also use the Temporal Fusion Transformer (TFT) for its efficiency with multi-horizon series, as presented by Lim et al. in "Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting" (2021).
    - **Objective**: Improve prediction accuracy by using attention mechanisms to capture the influences of distant data points.

2. **Variational Autoencoders (VAE) for Feature Engineering and Anomaly Detection**
    - **Project**: Use VAEs to compress market data and extract latent representations less sensitive to noise, inspired by "Auto-Encoding Variational Bayes" by Kingma and Welling (2013). We will also explore VAE-based approaches for anomaly detection in time series, as detailed in "Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network" by Hundman et al. (2018).
    - **Objective**: Extract robust latent features and detect anomalies to improve predictions and identify unexpected market variations.

3. **Performance Comparison**
    - **Objective**: Compare the performance of the two methods (Transformers vs VAE) for time series forecasting.

4. **Hyperparameter Fine-Tuning**
    - **Objective**: Fine-tune the model hyperparameters to optimize prediction performance.

https://github.com/xuxu-wei/HybridVAE

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

In [None]:
from transformers import TimeSeriesTransformerForPrediction
#from lightning.pytorch import Trainer
from pytorch_forecasting import TimeSeriesDataSet, TemporalFusionTransformer, Baseline
from pytorch_forecasting.metrics import QuantileLoss, MAE, RMSE
from pytorch_forecasting.data import GroupNormalizer
from pytorch_forecasting.data.encoders import NaNLabelEncoder
import lightning.pytorch as pl
from lightning.pytorch.tuner import Tuner
from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint
from pytorch_forecasting.models.temporal_fusion_transformer.tuning import optimize_hyperparameters
from lightning.pytorch.loggers import TensorBoardLogger


  from .autonotebook import tqdm as notebook_tqdm


## Loading data

In [117]:
df = pd.read_parquet('../preprocessed_data/training.parquet').dropna()
test_df = pd.read_parquet('../preprocessed_data/test.parquet')

In [152]:
# Samples to try the algo
df_small = df.iloc[-1000000:].reset_index(drop=True)
test_df_small = test_df.dropna().sample(40000).reset_index(drop=True)

In [153]:
max_encoder_length = 50  # lookback window
max_prediction_length = 10  # forecast window

In [154]:
df_small["symbol_id"] = df_small["symbol_id"].astype("str")
test_df_small["symbol_id"] = test_df_small["symbol_id"].astype("str")

## Benchmark linear regression

In [27]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [32]:
benchmark = LinearRegression()
benchmark.fit(df_small.drop(columns=["label"]), df_small["label"])

In [33]:
r_sq = benchmark.score(test_df_small.drop(columns=["label"]), test_df_small["label"])
print('coefficient of determination:', r_sq)

coefficient of determination: 0.6959391160061761


In [34]:
y_pred = benchmark.predict(test_df_small.drop(columns=["label"]))
rms = mean_squared_error(test_df_small["label"], y_pred)
rms

0.38671048956829474

## Training Temporal Fusion Transformers

### Creation of an unique time id column for TimeSeriesDataSet

In [155]:
mapping = pd.DataFrame(pd.concat([df,test_df]).groupby(['date_id','time_id']).count().index,columns=["date_time"]).reset_index().rename(columns={"index":"time_idx"})

In [156]:
df_small["date_time"] = list(zip(df_small["date_id"], df_small["time_id"]))
df_small["time_idx"] = df_small["date_time"].map(dict(zip(mapping["date_time"], mapping["time_idx"])))

In [157]:
test_df_small["date_time"] = list(zip(test_df_small["date_id"], test_df_small["time_id"]))
test_df_small["time_idx"] = test_df_small["date_time"].map(dict(zip(mapping["date_time"], mapping["time_idx"])))

In [158]:
first_val = df_small["time_idx"].min()

In [159]:
df_small['time_idx'] = df_small['time_idx']-first_val
test_df_small['time_idx'] = test_df_small['time_idx']-first_val

In [160]:
columns_to_drop = ["id","date_id",'time_id',"date_time"]
df_small.drop(columns=columns_to_drop, inplace=True)
test_df_small.drop(columns=columns_to_drop, inplace=True)

In [161]:
time_varying_unknown_reals = df_small.drop(columns=["label","symbol_id","time_idx"]).columns.to_list()

### Dataloader creation

In [162]:
training = TimeSeriesDataSet(
    df_small,
    time_idx="time_idx",
    target="label",
    group_ids=["symbol_id"],
    min_encoder_length=max_encoder_length // 2,  # allow predictions without history
    max_encoder_length=max_encoder_length,
    min_prediction_idx=0,
    min_prediction_length=1,
    max_prediction_length=max_prediction_length,
    static_categoricals=["symbol_id"],
    static_reals=[],
    time_varying_known_categoricals=[],
    time_varying_known_reals=[],
    time_varying_unknown_reals=time_varying_unknown_reals,
    #target_normalizer=GroupNormalizer(transformation="softplus"),
    add_relative_time_idx=True,
    add_target_scales=True,
    allow_missing_timesteps=True,
)



In [163]:
train_dataloader = training.to_dataloader(train=True, batch_size=64, num_workers=0)
validation = TimeSeriesDataSet.from_dataset(training, df_small, predict=True, stop_randomization=True)
val_dataloader = validation.to_dataloader(train=False, batch_size=64 * 10, num_workers=0)


In [131]:
baseline_predictions = Baseline().predict(val_dataloader, return_y=True)
RMSE()(baseline_predictions.output, baseline_predictions.y)

  baseline_predictions = Baseline().predict(val_dataloader, return_y=True)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


tensor(0.9974)

In [246]:
pl.seed_everything(42)

Seed set to 42


42

### Model creation and fit

In [285]:
early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=5, verbose=False, mode="min")

checkpoint_callback = ModelCheckpoint(
    dirpath="checkpoints/",
    filename="best-checkpoint",
    save_top_k=1,  
    monitor="val_loss", 
    mode="min", 
)

trainer = pl.Trainer(
    max_epochs=20,
    accelerator="cpu",
    enable_model_summary=True,
    gradient_clip_val=0.1,
    limit_train_batches=0.15,  
    val_check_interval=0.1,
    callbacks=[early_stop_callback, checkpoint_callback],
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


In [None]:
tft = TemporalFusionTransformer.from_dataset(
    training,
    learning_rate=0.7,
    hidden_size=16,
    attention_head_size=4,
    dropout=0.1,
    hidden_continuous_size=8,
    loss=QuantileLoss(quantiles=[0.5]), # corresponds to MAE
)
print(f"Number of parameters in network: {tft.size() / 1e3:.1f}k")

Number of parameters in network: 72.2k


In [None]:
trainer.fit(
    tft,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader,
)


   | Name                               | Type                            | Params | Mode 
------------------------------------------------------------------------------------------------
0  | loss                               | QuantileLoss                    | 0      | train
1  | logging_metrics                    | ModuleList                      | 0      | train
2  | input_embeddings                   | MultiEmbedding                  | 456    | train
3  | prescalers                         | ModuleDict                      | 1.3 K  | train
4  | static_variable_selection          | VariableSelectionNetwork        | 646    | train
5  | encoder_variable_selection         | VariableSelectionNetwork        | 56.8 K | train
6  | decoder_variable_selection         | VariableSelectionNetwork        | 528    | train
7  | static_context_variable_selection  | GatedResidualNetwork            | 1.1 K  | train
8  | static_context_initial_hidden_lstm | GatedResidualNetwork            | 1.1 K  

                                                          

/Users/paul-antoine/Desktop/ENSAE/Advanced-ML-Project/.venv/lib/python3.8/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


Epoch 0:   4%|▎         | 84/2348 [01:06<30:05,  1.25it/s, v_num=73, train_loss_step=1.040]

In [142]:
best_model_path = trainer.checkpoint_callback.best_model_path
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)

### Test set prediction

In [None]:
test_DS = TimeSeriesDataSet.from_dataset(training, test_df_small, predict=True, stop_randomization=True)
test_DL = test_DS.to_dataloader(train=False, batch_size=100, num_workers=0)



In [229]:
predictions = best_tft.predict(test_DL, return_y=True)
RMSE()(predictions.output, predictions.y)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs


/Users/paul-antoine/Desktop/ENSAE/Advanced-ML-Project/.venv/lib/python3.8/site-packages/lightning/pytorch/trainer/connectors/data_connector.py:424: The 'predict_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=11` in the `DataLoader` to improve performance.


tensor(1.3832)

In [242]:
predictions.output.shape

torch.Size([33, 10])

### HP optimization

In [80]:
import pickle

# create study
study = optimize_hyperparameters(
    train_dataloader,
    val_dataloader,
    model_path="optuna_test",
    n_trials=200,
    max_epochs=50,
    gradient_clip_val_range=(0.01, 1.0),
    hidden_size_range=(8, 128),
    hidden_continuous_size_range=(8, 128),
    attention_head_size_range=(1, 4),
    learning_rate_range=(0.001, 0.1),
    dropout_range=(0.1, 0.3),
    trainer_kwargs=dict(limit_train_batches=30),
    reduce_on_plateau_patience=4,
    use_learning_rate_finder=False,  # use Optuna to find ideal learning rate or use in-built learning rate finder
)

# save study results - also we can resume tuning at a later point in time
with open("test_study.pkl", "wb") as fout:
    pickle.dump(study, fout)

# show best hyperparameters
print(study.best_trial.params)

[I 2025-01-15 21:29:06,485] A new study created in memory with name: no-name-8517a7e3-144d-4ed3-8ccc-ae0c7ef7608b
  gradient_clip_val = trial.suggest_loguniform("gradient_clip_val", *gradient_clip_val_range)
GPU available: False, used: False
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
  dropout=trial.suggest_uniform("dropout", *dropout_range),
/Users/paul-antoine/Desktop/ENSAE/Advanced-ML-Project/.venv/lib/python3.8/site-packages/lightning/pytorch/utilities/parsing.py:208: Attribute 'loss' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['loss'])`.
/Users/paul-antoine/Desktop/ENSAE/Advanced-ML-Project/.venv/lib/python3.8/site-packages/lightning/pytorch/utilities/parsing.py:208: Attribute 'logging_metrics' is an instance of `nn.Module` and is already saved during checkpointing. It is recommended to ignore them using `self.save_hyperparameters(ignore=['loggin

KeyError: 'val_loss'