# Implementation of a Spatiotemporal Self-Attention Based LSTNet

## Imports

In [1]:
import pandas as pd
import yfinance as yf
import tensorflow as tf
from sklearn.model_selection import train_test_split
from data_processing import DataProcessor
from models.simple_lstm import SimpleLSTM

## Data Import and formatting

We are looking to work with multivariate time series data, which will be given to us as a 2D shape: # of timesteps x number of variables. We then need to format this data so that we split it into a train and test set, and use a sliding window approach to divide it into sequences. Then take the last x timesteps in every sequence and make those the labels. We will also of course normalize the data with z-score standardization.

First we will load data. Change the below code block to adjust for data source. Included in this code block should be any variable-specific feature extraction that needs to be down, such as calculating returns from price, etc

In [2]:
raw_data = pd.read_csv("DATA/nasdaq100/full/full_non_padding.csv")

#display info about loaded data
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74501 entries, 0 to 74500
Columns: 105 entries, AAL to NDX
dtypes: float64(105)
memory usage: 59.7 MB


Then we do a train/test split, create sequences with sliding window, normalize data, and designate labels(what we are trying to predict). 

One question here is whether to normalize the target values. In other words, should we normalize data and then split the time series in to features and labels, or split first and then only normalize features. From past work, I have seen better Mean Average Percent Error (MAPE) by doing the former, but intuition says it shouldn't matter.

We also need to deal with missing values. For now, we will just drop the rows (timesteps) that have missing values but I think we should further consider the effects of this later down the line.

In [3]:
train_data, test_data = train_test_split(raw_data, test_size=0.2, shuffle=False) #it is imporant the data is not shuffled here, since this is time series data

train_data.dropna(inplace=True)
test_data.dropna(inplace=True)

train_sequences = DataProcessor.zscore_standardization(DataProcessor.sliding_window_sequence(train_data, 20))
test_sequences = DataProcessor.zscore_standardization(DataProcessor.sliding_window_sequence(test_data, 20))

train_x, train_y = DataProcessor.sequence_target_split(train_sequences, target_size=1)
test_x, test_y = DataProcessor.sequence_target_split(test_sequences, target_size=1)

We then need to convert this multiIndex DataFrame we have into a 3 dimensional numpy to feed into the model. Each of x_train, x_test, y_train, y_test will have the 3D shape num_sequences x timesteps_per_sequence x num_variables. For our y-values, we will only predict one timestep in the future for now, for simplicity's sake.

In [4]:
x_train = DataProcessor.fold_sequences(train_x)
x_test = DataProcessor.fold_sequences(test_x)
y_train = DataProcessor.fold_sequences(train_y)
y_test = DataProcessor.fold_sequences(test_y)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(9686, 19, 105)
(1948, 19, 105)
(9686, 105)
(1948, 105)


We also likely only want to predict one of the time series variables, so we can select which one we want that to be. In the code block below, choose the index of the variable to be selected.

In [5]:
Y_VAR = 1   #index of variable to be predicted

y_train = y_train[:,Y_VAR]
y_test = y_test[:,Y_VAR]

print(y_train)
print(y_test)

[-2.08270694 -2.08863787 -2.07784358 ...  2.96908063  2.95976907
  2.98170165]
[-1.42823709 -1.42534895 -1.46422771 ...  0.90543229  0.93236972
  0.8851598 ]


## Training

Now that we have our data properly formatted, we initialize the model, train it, and test it.

In [7]:
model = SimpleLSTM() # replace this with the model to train, test, and run

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.005),
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=[tf.keras.metrics.mean_absolute_percentage_error] 
)

#train
model.fit(
    x=x_train,
    y=y_train,
    epochs=500,
    validation_data=(x_test, y_test),
    batch_size=16,
    shuffle=False
)

Epoch 1/500
  3/606 [..............................] - ETA: 1:50 - loss: 1.6371 - mean_absolute_percentage_error: 77.4676 

KeyboardInterrupt: 