# Implementation of a Spatiotemporal Self-Attention Based LSTNet

## Imports

In [1]:
import pandas as pd
import yfinance as yf
from sklearn.model_selection import train_test_split
from data_processing import DataProcessor

## Data Import and formatting

We are looking to work with multivariate time series data, which will be given to us as a 2D shape: # of timesteps x number of variables. We then need to format this data so that we split it into a train and test set, and use a sliding window approach to divide it into sequences. Then take the last x timesteps in every sequence and make those the labels. We will also of course normalize the data with z-score standardization.

First we will load data. Change the below code block to adjust for data source.

In [2]:
raw_data = pd.read_csv("DATA/nasdaq100/full/full_non_padding.csv")

#display info about loaded data
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74501 entries, 0 to 74500
Columns: 105 entries, AAL to NDX
dtypes: float64(105)
memory usage: 59.7 MB


Then we do a train/test split, create sequences with sliding window, normalize data, and designate labels(what we are trying to predict). 

One question here is whether to normalize the target values. In other words, should we normalize data and then split the time series in to features and labels, or split first and then only normalize features. From past work, I have seen better Mean Average Percent Error (MAPE) by doing the former, but intuition says it shouldn't matter.

We also need to deal with missing values. For now, we will just drop the rows (timesteps) that have missing values but I think we should further consider the effects of this later down the line.

In [3]:
train_data, test_data = train_test_split(raw_data, test_size=0.2, shuffle=False) #it is imporant the data is not shuffled here, since this is time series data

train_data.dropna(inplace=True)
test_data.dropna(inplace=True)

train_sequences = DataProcessor.zscore_standardization(DataProcessor.sliding_window_sequence(train_data, 20))
test_sequences = DataProcessor.zscore_standardization(DataProcessor.sliding_window_sequence(test_data, 20))

x_train, y_train = DataProcessor.sequence_target_split(train_sequences, target_size=1)
x_test, y_test = DataProcessor.sequence_target_split(test_sequences, target_size=1)

               AAL      AAPL      ADBE       ADI       ADP      ADSK  \
0     0  -1.188402 -2.084871 -1.484487 -1.187783  0.199544 -1.992492   
      1  -1.198082 -2.075032 -1.484487 -1.116616  0.201428 -1.991189   
      2  -1.203336 -2.068472 -1.522971 -1.059373  0.217572       NaN   
      3  -1.206181 -2.066286 -1.526317 -1.042355  0.219210       NaN   
      4  -1.182871 -2.067379 -1.496200 -1.073297  0.220849 -1.985977   
...            ...       ...       ...       ...       ...       ...   
59580 15  0.605093  2.574539  2.251760  2.361265  1.743386  1.234960   
      16  0.603296  2.574801  2.251760  2.364359  1.741747  1.233657   
      17  0.598377  2.574550  2.246740  2.362812  1.742566  1.232354   
      18  0.588460  2.570177  2.244231  2.364359  1.744205  1.232354   
      19  0.593438  2.572910  2.248414  2.362812  1.753219  1.238869   

              AKAM      ALXN      AMAT      AMGN  ...       VOD      VRSK  \
0     0  -0.414227 -0.079535 -1.530343  0.414444  ...  1.5

We then need to convert this multiIndex DataFrame we have into a 3 dimensional numpy to feed into the model. Each of x_train, x_test, y_train, y_test will have the 3D shape num_sequences x timesteps_per_sequence x num_variables.

In [4]:
x_train = DataProcessor.fold_sequences(x_train)
x_test = DataProcessor.fold_sequences(x_test)
y_train = DataProcessor.fold_sequences(y_train)
y_test = DataProcessor.fold_sequences(y_test)

AttributeError: type object 'DataProcessor' has no attribute 'fold_sequences'