# Time Series Forecasting with FinTorch and NeuralForecast

In this tutorial, we will demonstrate how to use the [FinTorch](https://github.com/AI4FinTech/FinTorch) Python package to perform time series forecasting on the `AuctionDataset`, which is related to the Kaggle competition [Trading at the Close](https://www.kaggle.com/competitions/optiver-trading-at-the-close/overview). 

We will utilize state-of-the-art models from the [NeuralForecast](https://github.com/Nixtla/neuralforecast) package, including `NHITS`, `BiTCN`, `NBEATS`, and `NBEATSx`, to forecast auction data.

<a target="_blank" href="https://colab.research.google.com/github/AI4FinTech/FinTorch/blob/main/docs/tutorials/tradingattheclose/tradingattheclose.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Prerequisites

In [None]:
!pip install fintorch

> **Note:** For GPU acceleration, install the GPU-compatible version of PyTorch.


## Optional: colab Kaggle setup

To download the dataset from Kaggle in Colab, you need to set your kaggle username and kaggle secret.

First configure the `KAGGLE_USERNAME` and `KAGGLE_SECRET` in colab.
![Colab secrets](colabsecrets.png)

Next, make the secrets available as environment variables as follows:

In [None]:
from google.colab import userdata
import os

os.environ["KAGGLE_KEY"] = userdata.get('KAGGLE_KEY')
os.environ["KAGGLE_USERNAME"] = userdata.get('KAGGLE_USERNAME')

## Dataset Background

### Trading at the Close

Stock exchanges are dynamic environments where every second counts, and the final moments of the trading day are particularly critical. On the Nasdaq Stock Exchange, the trading day concludes with the **Nasdaq Closing Cross** auction. This process determines the official closing prices for securities listed on the exchange, serving as key indicators for investors and analysts in evaluating market performance.

Approximately 10% of Nasdaq's average daily volume occurs during this closing auction. The auction provides true price and size discovery, determining benchmark prices for index funds and various investment strategies. Market makers play a crucial role in this process by consolidating information from both the traditional order book and the auction book during the last ten minutes of trading.

### The Challenge

In this tutorial, we aim to develop a model capable of predicting the closing price movements for hundreds of Nasdaq-listed stocks using data from both the order book and the closing auction. Accurate predictions can enhance market efficiency and accessibility, especially during the intense final moments of trading.

### Understanding the Order Book and Auction Mechanics

#### Order Book

The **order book** is an electronic ledger of buy (bid) and sell (ask) orders for a specific security, organized by price levels. It displays the interest of buyers and sellers, helping market participants gauge supply and demand.

In continuous trading:

- **Best Bid**: The highest price a buyer is willing to pay.
- **Best Ask**: The lowest price a seller is willing to accept.
- Orders are matched when the bid price meets or exceeds the ask price.

#### Auction Order Book

The **auction order book** differs from the continuous trading order book:

- Orders are collected over a predefined timeframe but are not immediately matched.
- The auction culminates at a specific time, matching orders at a single price known as the **uncross price**.
- The goal is to maximize the number of matched shares.

Key terms:

- **Uncross Price**: The price at which the maximum number of shares can be matched.
- **Matched Size**: The total number of shares matched at the uncross price.
- **Imbalance**: The difference between the number of buy and sell orders that remain unmatched at the uncross price.

#### Combining Order Books

Merging the traditional order book with the auction book provides a comprehensive view of market interest across price levels. This combined book aids in better price discovery, allowing for a more accurate equilibrium price when the auction uncrosses.

Additional terms:

- **Near Price**: The hypothetical uncross price of the combined book, provided by Nasdaq five minutes before the closing auction.
- **Far Price**: The hypothetical uncross price based solely on the auction book.
- **Reference Price**: An indicator of the fair price, calculated based on the near price and bounded by the best bid and ask prices.

### The Data

The dataset we use is sourced from the Kaggle competition [Trading at the Close](https://www.kaggle.com/competitions/optiver-trading-at-the-close/overview). It provides comprehensive data on the Nasdaq Closing Cross auction, including both order book and auction book data.

For a detailed exploration of the dataset and the auction mechanisms, please refer to this excellent notebook: [Optiver Trading at the Close Introduction](https://www.kaggle.com/code/tomforbes/optiver-trading-at-the-close-introduction).

In this tutorial, we prepare training and test sets that can be directly used with the NeuralForecast library, formatted as Polars DataFrames for efficient processing. The `AuctionDataset` is implemented as a PyTorch dataset, which returns tensors through the `__getitem__` method when iterating over the dataset. This design allows seamless integration with PyTorch-based models and facilitates efficient data handling during model training and evaluation.

### Objective

Our objective is to use state-of-the-art neural forecasting models to predict the closing prices of stocks. By accurately forecasting these prices, we contribute to improved market efficiency and provide valuable insights into the supply and demand dynamics during the critical closing moments of trading.

## Importing Libraries

First, import the necessary libraries and set up the environment:

In [1]:
import logging
from pathlib import Path

import polars as pl
import torch
from neuralforecast import NeuralForecast
from neuralforecast.models import NBEATS, NHITS, BiTCN, NBEATSx
from neuralforecast.losses.numpy import mae, mse

from fintorch.datasets.auctiondata import AuctionDataset

# Set up logging
logging.basicConfig(level=logging.INFO)
torch.set_float32_matmul_precision("medium")

## Loading the AuctionDataset

Load the `AuctionDataset` from FinTorch:

In [2]:
# Define the data path
data_path = Path("~/.fintorch_data/auctiondata-optiver/").expanduser()

# Load the auction data
auction_data = AuctionDataset(data_path)

INFO:root:Load auction data


## Defining Model Parameters

Set common parameters for the models:

In [3]:
input_size = 30          # Number of past time steps used for prediction
days = 3                 # Number of days to forecast
steps_per_day = 55       # Number of steps per day
horizon = days * steps_per_day  # Forecast horizon
max_steps = 10           # Max training steps

## Initializing the Models

Initialize the models with the defined parameters:

In [4]:
# Initialize the models
models = [
    NHITS(
        input_size=input_size,
        h=horizon,
        futr_exog_list=["wap", "bid_price", "ask_price"],
        scaler_type="robust",
        max_steps=max_steps,
    ),
    BiTCN(
        input_size=input_size,
        h=horizon,
        futr_exog_list=["wap", "bid_price", "ask_price"],
        scaler_type="robust",
        max_steps=max_steps,
    ),
    NBEATS(
        input_size=input_size,
        h=horizon,
        max_steps=max_steps,
    ),
    NBEATSx(
        input_size=input_size,
        futr_exog_list=["wap", "bid_price", "ask_price"],
        h=horizon,
        max_steps=max_steps,
    ),
]

Seed set to 1
Seed set to 1
Seed set to 1
Seed set to 1


## Preparing the Data

Select relevant columns from the training data:

In [5]:
# Define validation and test sizes
val_size = horizon  # Validation set size
test_size = horizon  # Test set size

# Select necessary columns
train_df = auction_data.train.select(
    [
        "y",
        "ds",
        "unique_id",
        "wap",
        "imbalance_size",
        "imbalance_buy_sell_flag",
        "reference_price",
        "matched_size",
        "bid_price",
        "ask_price",
        "ask_size",
    ]
)

## Performing Cross-Validation

Create a `NeuralForecast` object and perform cross-validation:

In [6]:
# Create NeuralForecast object
nf = NeuralForecast(models=models, freq="10s")

# Perform cross-validation
Y_hat_df = nf.cross_validation(
    df=train_df,
    val_size=val_size,
    test_size=test_size,
    n_windows=None,  # Uses expanding window if None
).to_pandas()

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type          | Params | Mode 
-------------------------------------------------------
0 | loss         | MAE           | 0      | train
1 | padder_train | ConstantPad1d | 0      | train
2 | scaler       | TemporalNorm  | 0      | train
3 | blocks       | ModuleList    | 3.2 M  | train
-------------------------------------------------------
3.2 M     Trainable params
0         Non-trainable params
3.2 M     Total params
12.763    Total estimated model params size (MB)


Epoch 1:  43%|████▎     | 3/7 [00:00<00:00,  5.71it/s, v_num=8, train_loss_step=2.440, train_loss_epoch=2.320, valid_loss=7.210]

`Trainer.fit` stopped: `max_steps=10` reached.


Epoch 1:  43%|████▎     | 3/7 [00:00<00:00,  5.70it/s, v_num=8, train_loss_step=2.440, train_loss_epoch=2.320, valid_loss=7.210]


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting DataLoader 0: 100%|██████████| 7/7 [00:00<00:00, 27.58it/s] 


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

   | Name          | Type          | Params | Mode 
---------------------------------------------------------
0  | loss          | MAE           | 0      | train
1  | padder_train  | ConstantPad1d | 0      | train
2  | scaler        | TemporalNorm  | 0      | train
3  | lin_hist      | Linear        | 80     | train
4  | drop_hist     | Dropout       | 0      | train
5  | net_bwd       | Sequential    | 5.4 K  | train
6  | lin_futr      | Linear        | 64     | train
7  | drop_futr     | Dropout       | 0      | train
8  | net_fwd       | Sequential    | 8.6 K  | train
9  | drop_temporal | Dropout       | 0      | train
10 | temporal_lin1 | Linear        | 496    | train
11 | temporal_lin2 | Linear        | 2.8 K  | train
12 | output_lin    | Linear        | 49     | train
-------------------------------------------------------

Epoch 1:  43%|████▎     | 3/7 [00:00<00:00,  4.95it/s, v_num=10, train_loss_step=2.800, train_loss_epoch=2.320, valid_loss=6.510]

`Trainer.fit` stopped: `max_steps=10` reached.


Epoch 1:  43%|████▎     | 3/7 [00:00<00:00,  4.94it/s, v_num=10, train_loss_step=2.800, train_loss_epoch=2.320, valid_loss=6.510]


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting DataLoader 0: 100%|██████████| 7/7 [00:00<00:00, 25.55it/s] 

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]






  | Name         | Type          | Params | Mode 
-------------------------------------------------------
0 | loss         | MAE           | 0      | train
1 | padder_train | ConstantPad1d | 0      | train
2 | scaler       | TemporalNorm  | 0      | train
3 | blocks       | ModuleList    | 2.9 M  | train
-------------------------------------------------------
2.9 M     Trainable params
64.5 K    Non-trainable params
2.9 M     Total params
11.663    Total estimated model params size (MB)


Epoch 1:  43%|████▎     | 3/7 [00:00<00:00,  5.93it/s, v_num=12, train_loss_step=7.140, train_loss_epoch=7.030, valid_loss=6.290]

`Trainer.fit` stopped: `max_steps=10` reached.


Epoch 1:  43%|████▎     | 3/7 [00:00<00:00,  5.91it/s, v_num=12, train_loss_step=7.140, train_loss_epoch=7.030, valid_loss=6.290]


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting DataLoader 0: 100%|██████████| 7/7 [00:00<00:00, 29.18it/s] 


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name         | Type          | Params | Mode 
-------------------------------------------------------
0 | loss         | MAE           | 0      | train
1 | padder_train | ConstantPad1d | 0      | train
2 | scaler       | TemporalNorm  | 0      | train
3 | blocks       | ModuleList    | 3.8 M  | train
-------------------------------------------------------
3.7 M     Trainable params
64.5 K    Non-trainable params
3.8 M     Total params
15.257    Total estimated model params size (MB)


Epoch 1:  43%|████▎     | 3/7 [00:00<00:00,  5.62it/s, v_num=14, train_loss_step=7.750, train_loss_epoch=7.600, valid_loss=6.450]

`Trainer.fit` stopped: `max_steps=10` reached.


Epoch 1:  43%|████▎     | 3/7 [00:00<00:00,  5.61it/s, v_num=14, train_loss_step=7.750, train_loss_epoch=7.600, valid_loss=6.450]


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting DataLoader 0: 100%|██████████| 7/7 [00:00<00:00, 28.11it/s] 


## Evaluating the Models

Compute MSE and MAE for each model:

In [8]:
# List of model names
model_names = ["NHITS", "BiTCN", "NBEATS", "NBEATSx"]

# Number of unique series
n_series = len(auction_data.train["unique_id"].unique())

# Iterate over models to compute metrics
for model_name in model_names:
    # Extract true values and predictions
    y_true = Y_hat_df.y.values
    y_hat = Y_hat_df[model_name].values

    # Reshape arrays
    y_true = y_true.reshape(n_series, -1, horizon)
    y_hat = y_hat.reshape(n_series, -1, horizon)

    # Compute metrics
    mse_score = mse(y_true, y_hat)
    mae_score = mae(y_true, y_hat)

    # Print results
    print(f"\nModel: {model_name}")
    print(f"MSE: {mse_score}")
    print(f"MAE: {mae_score}")


Model: NHITS
MSE: 103.54132703286024
MAE: 7.068754453999026

Model: BiTCN
MSE: 100.66011143269779
MAE: 6.521046070610383

Model: NBEATS
MSE: 63.95072834895031
MAE: 5.594107221445525

Model: NBEATSx
MSE: 73.7288435229088
MAE: 5.9910440166960015


## Making Predictions on the Test Set

Prepare the test data and make predictions:


In [9]:
# Prepare the future dataframe
fcsts_df = nf.make_future_dataframe()

# Select columns from test data
selected_data = auction_data.test.select(
    [
        "unique_id",
        "ds",
        "wap",
        "imbalance_size",
        "imbalance_buy_sell_flag",
        "reference_price",
        "matched_size",
        "bid_price",
        "ask_price",
        "ask_size",
        "row_id",
        "time_id",
    ]
)

# Add 'y' column filled with zeros due to a requirement in NeuralForecast
selected_data = selected_data.with_columns(pl.lit(0).alias("y"))

# Make predictions
fcsts_df = nf.predict(futr_df=selected_data)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting DataLoader 0: 100%|██████████| 7/7 [00:00<00:00, 37.01it/s] 


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting DataLoader 0: 100%|██████████| 7/7 [00:00<00:00, 34.14it/s] 


GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Predicting DataLoader 0: 100%|██████████| 7/7 [00:00<00:00, 33.87it/s] 

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]



Predicting DataLoader 0: 100%|██████████| 7/7 [00:00<00:00, 36.93it/s] 


### Displaying Predictions

Join predictions with test data:

In [10]:
# Join predictions with test data
fcsts_df = fcsts_df.join(selected_data, on=["unique_id", "ds"], how="right")

# Display predictions
print("Predictions on the test dataset by all models:")
print(fcsts_df)

Predictions on the test dataset by all models:
shape: (33_000, 17)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬─────────────┬─────────┬─────┐
│ NHITS     ┆ BiTCN     ┆ NBEATS    ┆ NBEATSx   ┆ … ┆ ask_size  ┆ row_id      ┆ time_id ┆ y   │
│ ---       ┆ ---       ┆ ---       ┆ ---       ┆   ┆ ---       ┆ ---         ┆ ---     ┆ --- │
│ f32       ┆ f32       ┆ f32       ┆ f32       ┆   ┆ f64       ┆ str         ┆ i64     ┆ i32 │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═════════════╪═════════╪═════╡
│ 1.783648  ┆ -0.212747 ┆ 0.895329  ┆ 1.31668   ┆ … ┆ 9177.6    ┆ 478_0_0     ┆ 26290   ┆ 0   │
│ 0.22948   ┆ 6.136395  ┆ -0.696216 ┆ 0.21183   ┆ … ┆ 19692.0   ┆ 478_0_1     ┆ 26290   ┆ 0   │
│ 13.327369 ┆ -3.136321 ┆ 12.290155 ┆ 10.873125 ┆ … ┆ 34955.12  ┆ 478_0_2     ┆ 26290   ┆ 0   │
│ 0.986083  ┆ -1.599667 ┆ 1.061661  ┆ 1.216592  ┆ … ┆ 10314.0   ┆ 478_0_3     ┆ 26290   ┆ 0   │
│ 1.251347  ┆ -0.632016 ┆ 0.878551  ┆ 0.20115   ┆ … ┆ 7245.6    ┆ 478

## Conclusion

In this tutorial, we demonstrated how to use FinTorch and NeuralForecast to perform time series forecasting on auction data. We initialized multiple models, performed cross-validation, evaluated their performance, and made predictions on the test set.

Feel free to experiment with different models and parameters to improve forecasting accuracy.