# Implementation of a Spatiotemporal Self-Attention Based LSTNet

## Imports

In [None]:
import pandas as pd
import yfinance as yf
import tensorflow as tf
import numpy as np
from sklearn.model_selection import train_test_split
from data_processing import DataProcessor
from models.simple_lstm import SimpleLSTM
import matplotlib.pyplot as plt

## Data Import and formatting

We are looking to work with multivariate time series data, which will be given to us as a 2D shape: # of timesteps x number of variables. We then need to format this data so that we split it into a train and test set, and use a sliding window approach to divide it into sequences. Then take the last x timesteps in every sequence and make those the labels. We will also of course normalize the data with z-score standardization.

First we will load data. Change the below code block to adjust for data source. Included in this code block should be any variable-specific feature extraction that needs to be down, such as calculating returns from price, etc

In [None]:
raw_data = pd.read_csv("DATA/nasdaq100/full/full_non_padding.csv")

#display info about loaded data
raw_data.info()
raw_data.dropna(inplace=True)

#we need to calculate returns, as this is more useful then the actual price. (price is a function of returns and prev price, so if we know returns we can always calculate price)
#raw_data = DataProcessor.calc_log_returns(raw_data)

Then we do a train/test split, create sequences with sliding window, normalize data, and designate labels(what we are trying to predict). The length of each sequence is set by SEQ_LEN.

One question here is whether to normalize the target values. In other words, should we normalize data and then split the time series in to features and labels, or split first and then only normalize features. From past work, ~~I have seen better Mean Average Percent Error (MAPE) by doing the former, but intuition says it shouldn't matter.~~ From some reading, I have learned that normalizing all data, and not just the features, is a form of train-test contamination, as it allows some information from the test set to be encoded in the train set. Only features should be normalized. 

We also need to deal with missing values. For now, we will just drop the rows (timesteps) that have missing values but I think we should further consider the effects of this later down the line.

In [None]:
SEQ_LEN = 20

train_data, test_data = train_test_split(raw_data, test_size=0.3, shuffle=False) #it is imporant the data is not shuffled here, since this is time series data

train_data.dropna(inplace=True)
test_data.dropna(inplace=True)

val_data, test_data = train_test_split(test_data, test_size=0.2, shuffle=False)

train_sequences = DataProcessor.sliding_window_sequence(train_data, SEQ_LEN)
test_sequences = DataProcessor.sliding_window_sequence(val_data, SEQ_LEN)

train_x, train_y = DataProcessor.sequence_target_split(train_sequences, target_size=1)
test_x, test_y = DataProcessor.sequence_target_split(test_sequences, target_size=1)

train_x = DataProcessor.zscore_standardization(train_x)
test_x = DataProcessor.zscore_standardization(test_x)

We then need to convert this multiIndex DataFrame we have into a 3 dimensional numpy to feed into the model. Each of x_train, x_test, y_train, y_test will have the 3D shape num_sequences x timesteps_per_sequence x num_variables. For our y-values, we will only predict one timestep in the future for now, for simplicity's sake.

In [None]:
x_train = DataProcessor.fold_sequences(train_x)
x_test = DataProcessor.fold_sequences(test_x)
y_train = DataProcessor.fold_sequences(train_y)
y_test = DataProcessor.fold_sequences(test_y)

print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

We also likely only want to predict one of the time series variables, so we can select which one we want that to be. In the code block below, choose the index of the variable to be selected.

In [None]:
Y_VAR = 1   #index of variable to be predicted

y_train = y_train[:,Y_VAR]
y_test = y_test[:,Y_VAR]

## Training

Now that we have our data properly formatted, we initialize the model, train it, and test it. Please change the name of the model to be appropriate, as this name is used as a filename for saving the model.

In [None]:
model = SimpleLSTM() # replace this with the model to train, test, and run
model._name = "simpleLSTM"

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.005),
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=[tf.keras.metrics.mean_absolute_error] 
)

checkpoint_path = "SAVED_MODELS/{mn}.ckpt".format(mn=model.name)

cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, save_weights_only=True, verbose=1)

#train
model.fit(
    x=x_train,
    y=y_train,
    epochs=5,
    validation_data=(x_test, y_test),
    batch_size=128,
    callbacks=[cp_callback],
    shuffle=False
)

## Testing

The below code is for testing and visualization.

In [None]:
#make sequences, x/y split, normalize, and fold
td = DataProcessor.sequence_target_split(DataProcessor.sliding_window_sequence(test_data, SEQ_LEN), target_size=1)[0]
td = DataProcessor.fold_sequences(DataProcessor.zscore_standardization(td))

predicted_var = test_data.to_numpy()[:,Y_VAR]

plt.plot(np.arange(0, predicted_var.shape[0]), predicted_var)
plt.plot(np.arange(SEQ_LEN - 1, predicted_var.shape[0]), model.predict(td).reshape(td.shape[0]))

plt.show()