<a href="https://colab.research.google.com/github/ArturAzarskyy/CSC413-Stock-Prediction/blob/main/transformer_prepros.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model Preprocessing

Note that some parts of ideas and code was taken form the Jan Schmitz notebook on IBM stock prediciton [IBM stock predictor](https://github.com/JanSchm/CapMarket/blob/master/bot_experiments/IBM_Transformer%2BTimeEmbedding.ipynb). Though Jan S. worked only with one stock we extended the idea to multiple stocks as well as used different dataset. We are also looking at a bit different model as well.


In [1]:
!echo '{"username":"arturusmaximus","key":"4f14194978499e9ae1ad6adb74b94add"}' > /content/kaggle.json
!cp kaggle.json ~/.kaggle/
# !kaggle datasets download -d borismarjanovic/price-volume-data-for-all-us-stocks-etfs
!kaggle datasets download -d qks1lver/amex-nyse-nasdaq-stock-histories
# !unzip price-volume-data-for-all-us-stocks-etfs.zip
!unzip amex-nyse-nasdaq-stock-histories.zip

zsh:1: no such file or directory: /content/kaggle.json
cp: directory /Users/lorybuttazzoni/.kaggle does not exist
zsh:1: command not found: kaggle
unzip:  cannot find or open amex-nyse-nasdaq-stock-histories.zip, amex-nyse-nasdaq-stock-histories.zip.zip or amex-nyse-nasdaq-stock-histories.zip.ZIP.


In [1]:
import tables
import pandas as pd
import torch as ty
import os.path
import numpy as np

In [2]:
# sets: training, validation, test
train_file = tables.open_file("train_data.hdf5", mode='w')
val_file = tables.open_file("val_data.hdf5", mode='w')
test_file = tables.open_file("test_data.hdf5", mode='w')

We create the data and labels for the sets.

In [3]:
time_step = 1
stock_history_length = 128 # this does not include the label
num_params = 5
do_moving_avg = True # case for considering the moving average
moving_dist = 5
# stock_history_length = 63 # this does not include the label

filters = tables.Filters(complevel=5, complib='blosc')

train_data  = train_file.create_earray(train_file.root, 'data',
                                      tables.Atom.from_dtype(np.dtype('float64')),
                                      shape=(0, stock_history_length, num_params),
                                      filters=filters,
                                      expectedrows=10e6)
train_labels = train_file.create_earray(train_file.root, 'labels',
                                       tables.Atom.from_dtype(np.dtype('float64')),
                                       shape=(0,),
                                       filters=filters,
                                       expectedrows=10e6)
val_data = val_file.create_earray(val_file.root, 'data',
                                  tables.Atom.from_dtype(np.dtype('float64')),
                                  shape=(0, stock_history_length, num_params),
                                  filters=filters,
                                  expectedrows=4e6)
val_labels = val_file.create_earray(val_file.root, 'labels',
                                   tables.Atom.from_dtype(np.dtype('float64')),
                                   shape=(0,),
                                   filters=filters,
                                   expectedrows=4e6)
test_data = test_file.create_earray(test_file.root, 'data',
                                   tables.Atom.from_dtype(np.dtype('float64')),
                                   shape=(0, stock_history_length, num_params),
                                   filters=filters,
                                   expectedrows=1e6)
test_labels = test_file.create_earray(test_file.root, 'labels',
                                     tables.Atom.from_dtype(np.dtype('float64')),
                                     shape=(0,),
                                     filters=filters,
                                     expectedrows=1e6)

Create data frames for train/valid/test. 

In [None]:
with open('all_symbols.txt') as topo_file:
    for line in topo_file:
        
        if os.path.isfile("full_history/"+line[:-1]+".csv"):
            df = pd.read_csv("full_history/"+line[:-1]+".csv",
                             delimiter=',', 
                             usecols=['date', 'open', 'high', 'low', 'close', 'volume'])
            
            if len(df.index.values) == 0:
                continue
            df['volume'].replace(to_replace=0, method='ffill', inplace=True) 
            df.sort_values('date', inplace=True)
            df = df.reset_index(drop=True)

            if do_moving_avg:
                df[['open', 'high', 'low', 'close', 
                    'volume']] = df[['open', 'high',
                                     'low', 'close',
                                     'volume']].rolling(moving_dist).mean() 

            # - Convert to Percentage Change -
            df.dropna(how='any', axis=0, inplace=True)
            df['open'] = df['open'].pct_change()
            df['high'] = df['high'].pct_change()
            df['low'] = df['low'].pct_change()
            df['close'] = df['close'].pct_change()
            df['volume'] = df['volume'].pct_change()
            df.dropna(how='any', axis=0, inplace=True)

            # - Sort -
            if int(0.2*df.shape[0]) < stock_history_length+1 or int(0.1*df.shape[0]) < stock_history_length+1:
                continue
            valid_start = sorted(df.index.values)[-int(0.3*df.shape[0])]  
            test_start = sorted(df.index.values)[-int(0.1*df.shape[0])]
            min_return = min(df[(df.index < valid_start)][['open', 'high', 'low', 'close']].min(axis=0))
            max_return = max(df[(df.index < valid_start)][['open', 'high', 'low', 'close']].max(axis=0))

            df['open'] = (df['open'] - min_return) / (max_return - min_return)
            df['high'] = (df['high'] - min_return) / (max_return - min_return)
            df['low'] =  (df['low'] - min_return) / (max_return - min_return)
            df['close']= (df['close'] - min_return) / (max_return - min_return)

            min_volume = df[(df.index < valid_start)]['volume'].min(axis=0)
            max_volume = df[(df.index < valid_start)]['volume'].max(axis=0)

            df['volume'] = (df['volume'] - min_volume) / (max_volume - min_volume)

            # - Partition the data frame into train/valid/test -
            df.drop(columns=['date'], inplace=True)
            df_train = df[(df.index < valid_start)]
            df_val = df[(df.index >= valid_start) & (df.index < test_start)]
            df_test = df[(df.index >= test_start)]

            c_train_data = df_train.values
            c_val_data = df_val.values
            c_test_data = df_test.values
            
            # -Add the data frame data to the train/valid/test-
            for i in range(stock_history_length, len(c_train_data), time_step):
                train_data.append(c_train_data[i-stock_history_length:i][None])
                train_labels.append(c_train_data[i, 3][None])
            for i in range(stock_history_length, len(c_val_data), time_step):
                val_data.append(c_val_data[i-stock_history_length:i][None])
                val_labels.append(c_val_data[i, 3][None])
            for i in range(stock_history_length, len(c_test_data), time_step):
                test_data.append(c_test_data[i-stock_history_length:i][None])
                test_labels.append(c_test_data[i, 3][None])




Contents of the train, valid and test files:

In [None]:
train_file

File(filename=train_data.hdf5, title='', mode='w', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/data (EArray(12573778, 128, 5)shuffle, blosc(5)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (204, 128, 5)
/labels (EArray(12573778,)shuffle, blosc(5)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (16384,)

In [None]:
val_file

File(filename=val_data.hdf5, title='', mode='w', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/data (EArray(3204370, 128, 5)shuffle, blosc(5)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (204, 128, 5)
/labels (EArray(3204370,)shuffle, blosc(5)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (16384,)

In [None]:
test_file

File(filename=test_data.hdf5, title='', mode='w', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/data (EArray(1328977, 128, 5)shuffle, blosc(5)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (204, 128, 5)
/labels (EArray(1328977,)shuffle, blosc(5)) ''
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (8192,)

We output the shape of the data and labels for our sets.

In [10]:
train_file.close()
val_file.close()
test_file.close()

In [None]:
# reopen in read mode
train_file = tables.open_file("train_data.hdf5", mode='r')
val_file = tables.open_file("val_data.hdf5", mode='r')
test_file = tables.open_file("test_data.hdf5", mode='r')


In [None]:
train_file.root.data.shape, train_file.root.labels.shape

((12573778, 128, 5), (12573778,))

In [None]:
val_file.root.data.shape, val_file.root.labels.shape

((3204370, 128, 5), (3204370,))

In [None]:
test_file.root.data.shape, test_file.root.labels.shape

((1328977, 128, 5), (1328977,))

In [None]:
train_file.close()
val_file.close()
test_file.close()


## Zip and upload data to Google Drive

In [None]:
if do_moving_avg:
    !zip sp_data_orig_m_avg.zip test_data.hdf5 train_data.hdf5 val_data.hdf5
else:
    !zip sp_data_orig.zip test_data.hdf5 train_data.hdf5 val_data.hdf5

  adding: test_data.hdf5 (deflated 42%)
  adding: train_data.hdf5 (deflated 48%)
  adding: val_data.hdf5 (deflated 46%)


In [None]:
from google.colab import drive
drive.mount('/amd/')

Mounted at /amd/


In [None]:
if do_moving_avg:
    !cp sp_data_orig_m_avg.zip /amd/My\ Drive/CSC413/Data
else:   
    !cp sp_data_orig.zip /amd/My\ Drive/CSC413/Data