<a name = 'top'></a>

**SECTIONS: [Top](#top)  |  [Compile Stocks](#compile_stocks)  |  [The Data](#the_data)  |  [Add Change Column](#change_column)  |  [Feature Engineering](#feature_engineering)  |  [Splitting & Scaling](#splitting)  |  [Creating Sequences](#sequences)  |  [Link](#link)  |  [Link](#link)  |** 

<font size = 7>**Multivariate Forecasting**
<br>

<font size = 5> [**Part 1: Video**](https://www.youtube.com/watch?v=jR0phoeXjrc)  |  [**Part 2: Video**](https://www.youtube.com/watch?v=ODEGJ_kh2aA)

In [1]:
import pandas as pd
from pylab import rcParams
import numpy as np
import seaborn as sns
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.autograd as ag
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint, EarlyStopping
from pytorch_lightning.loggers import TensorBoardLogger
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
from matplotlib import rc
%matplotlib inline
from helpers import *
import tqdm as tq
from collections import defaultdict

In [2]:
sns.set(style='whitegrid', 
		palette='muted', 
		font_scale=1.2)

HAPPY_COLORS = ['#01BEFE', '#FFDD00', '#FF7D00', '#FF006D', 
				'#ADFF02', '#8F00FF']

sns.set_palette(sns.color_palette(HAPPY_COLORS))

rcParams['figure.figsize'] = 13, 7

pl.seed_everything(123)
import yfinance as yf
tq.tqdm.pandas()

Global seed set to 123


<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**compile_stocks()**
* Yahoo Finance only allows 7 days worth of data at 1 minute intervals
* This function will compile dataframes in 7 day increments for an entire range of dates
* They also only allow 1 min data for up to 30 days prior
* This can be used for other intervals that allow for longer time periods

<a name = 'compile_stocks'></a>

**SECTIONS: [Top](#top)  |  [Compile Stocks](#compile_stocks)  |  [The Data](#the_data)  |  [Add Change Column](#change_column)  |  [Feature Engineering](#feature_engineering)  |  [Splitting Data](#splitting)  |  [Creating Sequences](#sequences)  |  [Link](#link)  |  [Link](#link)  |** 

## **`compile_stocks()`**

In [3]:
def compile_stocks(symbol, end, start, day_window, interval):
	import datetime
	import yfinance as yf
	
	end_date = end
	start_date = (pd.to_datetime(end) - datetime.timedelta(days = day_window))
	
	dfs = []
	stop_me = False
	
	while pd.to_datetime(start_date) >= pd.to_datetime(start):
		df = yf.download(symbol, 
						 start = start_date,
						 end = end_date, 
						 interval = interval)

		dfs.append(df)
		end_date = start_date
		start_date = start_date - datetime.timedelta(days = day_window)

		if start_date < pd.to_datetime(start):
			start_date = pd.to_datetime(start)
		else:
			start_date = start_date

		if start_date == end_date:
			break
		
	master_df = pd.concat(dfs).sort_values(by="Datetime")
		
	return master_df

In [4]:
data = compile_stocks('BTC-USD', 
					 '2023-01-22', 
					 '2022-12-25',
					 7, '1m')

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed


<a name = 'the_data'></a>

**SECTIONS: [Top](#top)  |  [Compile Stocks](#compile_stocks)  |  [The Data](#the_data)  |  [Add Change Column](#change_column)  |  [Feature Engineering](#feature_engineering)  |  [Splitting Data](#splitting)  |  [Creating Sequences](#sequences)  |  [Link](#link)  |  [Link](#link)  |** 

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**The Data**

In [5]:
df = data.copy()
df.columns = [x.lower() for x in df.columns]
head_tail_vert(df, 3, 'raw bitcoin data', intraday=True)




Unnamed: 0_level_0,open,high,low,close,adj close,volume
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2022-12-25 00:00:00+00:00,16847.51,16847.51,16847.51,16847.51,16847.51,0
2022-12-25 00:01:00+00:00,16846.72,16846.72,16846.72,16846.72,16846.72,0
2022-12-25 00:02:00+00:00,16846.7,16846.7,16846.7,16846.7,16846.7,0





Unnamed: 0_level_0,open,high,low,close,adj close,volume
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2023-01-21 23:56:00+00:00,22780.84,22780.84,22780.84,22780.84,22780.84,0
2023-01-21 23:57:00+00:00,22784.79,22784.79,22784.79,22784.79,22784.79,0
2023-01-21 23:58:00+00:00,22777.93,22777.93,22777.93,22777.93,22777.93,10856448





In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 40074 entries, 2022-12-25 00:00:00+00:00 to 2023-01-21 23:58:00+00:00
Data columns (total 6 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   open       40074 non-null  float64
 1   high       40074 non-null  float64
 2   low        40074 non-null  float64
 3   close      40074 non-null  float64
 4   adj close  40074 non-null  float64
 5   volume     40074 non-null  int64  
dtypes: float64(5), int64(1)
memory usage: 2.1 MB


In [7]:
df.shape

(40074, 6)

<a name = 'change_column'></a>

**SECTIONS: [Top](#top)  |  [Compile Stocks](#compile_stocks)  |  [The Data](#the_data)  |  [Add Change Column](#change_column)  |  [Feature Engineering](#feature_engineering)  |  [Splitting Data](#splitting)  |  [Creating Sequences](#sequences)  |  [Link](#link)  |  [Link](#link)  |** 

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Shift Method**
* using `shift()` to add the previous timestamp value
* then creating a difference column to show change since last
* creating a function to do all this

## **`add_change_column()`**

In [8]:
def add_change_column(df, column_changing, new_col_name):
	df['previous'] = df[column_changing].shift()
	df = df.drop(df.index[0])
	df[new_col_name] = df[column_changing] - df.previous
	df = df.drop(columns = ['previous'])
	return df

In [9]:
df = add_change_column(df, 'close', 'change')

In [10]:
head_tail_vert(df, 3, "Change column added", intraday=True)




Unnamed: 0_level_0,open,high,low,close,adj close,volume,change
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2022-12-25 00:01:00+00:00,16846.72,16846.72,16846.72,16846.72,16846.72,0,-0.79
2022-12-25 00:02:00+00:00,16846.7,16846.7,16846.7,16846.7,16846.7,0,-0.02
2022-12-25 00:03:00+00:00,16846.54,16846.54,16846.54,16846.54,16846.54,0,-0.17





Unnamed: 0_level_0,open,high,low,close,adj close,volume,change
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2023-01-21 23:56:00+00:00,22780.84,22780.84,22780.84,22780.84,22780.84,0,3.23
2023-01-21 23:57:00+00:00,22784.79,22784.79,22784.79,22784.79,22784.79,0,3.95
2023-01-21 23:58:00+00:00,22777.93,22777.93,22777.93,22777.93,22777.93,10856448,-6.87





<a name = 'feature_engineering'></a>

**SECTIONS: [Top](#top)  |  [Compile Stocks](#compile_stocks)  |  [The Data](#the_data)  |  [Add Change Column](#change_column)  |  [Feature Engineering](#feature_engineering)  |  [Splitting Data](#splitting)  |  [Creating Sequences](#sequences)  |  [Link](#link)  |  [Link](#link)  |** 

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Feature Engineering**

<font size = 4><b> The following is Valkov's (tutorial guide) method of adding these features, which loops through the entire dataframe. This makes no sense in Pandas. 

In [11]:
def his_featurize(df):
	rows = []
	df['date'] = df.index
	
	for item, row in df.iterrows():
		row_data = dict(
			weekday = row.date.dayofweek,
			month_day = row.date.day,
			year_week = row.date.week,
			month = row.date.month,
			open = row.open,
			high = row.high,
			low = row.low,
			close = row.close,
			change = row.change)

		rows.append(row_data)
		
	features_df = pd.DataFrame(rows)
	
	return features_df

<font size = 4><b> This is my version. Much better.

## **`featurize_stocks()`**

In [12]:
def featurize_stocks(df):
	df['weekday'] = df.index.dayofweek
	df['month_day'] = df.index.day
	df['year_week'] = df.index.isocalendar().week
	df['month'] = df.index.month
	
	return df

In [13]:
%%time
test = his_featurize(df.copy())

CPU times: user 2.21 s, sys: 18.8 ms, total: 2.23 s
Wall time: 2.23 s


In [14]:
%%time
df = featurize_stocks(df)

CPU times: user 14.7 ms, sys: 1.18 ms, total: 15.9 ms
Wall time: 16.4 ms


In [15]:
head_tail_vert(df, 3, 'df with added features')




Unnamed: 0_level_0,open,high,low,close,adj close,volume,change,weekday,month_day,year_week,month
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2022-12-25,16846.72,16846.72,16846.72,16846.72,16846.72,0,-0.79,6,25,51,12
2022-12-25,16846.7,16846.7,16846.7,16846.7,16846.7,0,-0.02,6,25,51,12
2022-12-25,16846.54,16846.54,16846.54,16846.54,16846.54,0,-0.17,6,25,51,12





Unnamed: 0_level_0,open,high,low,close,adj close,volume,change,weekday,month_day,year_week,month
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2023-01-21,22780.84,22780.84,22780.84,22780.84,22780.84,0,3.23,5,21,3,1
2023-01-21,22784.79,22784.79,22784.79,22784.79,22784.79,0,3.95,5,21,3,1
2023-01-21,22777.93,22777.93,22777.93,22777.93,22777.93,10856448,-6.87,5,21,3,1





In [16]:
describe_em(df, ['close', 'volume', 'change'])

Unnamed: 0,close
count,40073.0
mean,18235.33
std,2011.31
min,16408.47
25%,16735.81
50%,16944.59
75%,20756.78
max,23282.35

Unnamed: 0,volume
count,40073.0
mean,5432733.62
std,18446179.2
min,0.0
25%,0.0
50%,501760.0
75%,4270080.0
max,640858112.0

Unnamed: 0,change
count,40073.0
mean,0.15
std,6.17
min,-233.53
25%,-1.14
50%,0.04
75%,1.33
max,251.79





In [17]:
df.shape

(40073, 11)

<a name = 'splitting'></a>

**SECTIONS: [Top](#top)  |  [Compile Stocks](#compile_stocks)  |  [The Data](#the_data)  |  [Add Change Column](#change_column)  |  [Feature Engineering](#feature_engineering)  |  [Splitting & Scaling](#splitting)  |  [Creating Sequences](#sequences)  |  [Link](#link)  |  [Link](#link)  |** 

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Splitting & Scaling the Data**

In [18]:
train_size = int(len(df) * .8)
pretty(f'number of training inputs: {train_size:,}.')
test_size = int(len(df) * .2)
pretty(f'number of testing inputs: {test_size:,}.')

In [19]:
train_df, test_df = df[:train_size], df[train_size + 1:]

In [20]:
pretty(f'train_df.shape: {train_df.shape}  |  test_df.shape: {test_df.shape}')

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Scaling the data**

In [21]:
scaler = MinMaxScaler(feature_range = (-1, 1))
scaler = scaler.fit(train_df)

In [22]:
train_df = pd.DataFrame(scaler.transform(train_df), 
										  index = train_df.index, 
										  columns = train_df.columns)

test_df = pd.DataFrame(scaler.transform(test_df), 
										 index = test_df.index, 
										 columns = test_df.columns)

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Viewing the scaled data**

In [23]:
multi([(train_df.head(5),'train_df.head(5)'),
	   (train_df.tail(5),'train_df.tail(5)'),
	   (test_df.head(5),'test_df.head(5)'),
	   (test_df.tail(5),'test_df.tail(5)')], intraday=True)

Unnamed: 0_level_0,open,high,low,close,adj close,volume,change,weekday,month_day,year_week,month
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2022-12-25 00:01:00+00:00,-0.82,-0.82,-0.82,-0.82,-0.82,-1.0,-0.22,1.0,0.6,0.96,1.0
2022-12-25 00:02:00+00:00,-0.82,-0.82,-0.82,-0.82,-0.82,-1.0,-0.22,1.0,0.6,0.96,1.0
2022-12-25 00:03:00+00:00,-0.82,-0.82,-0.82,-0.82,-0.82,-1.0,-0.22,1.0,0.6,0.96,1.0
2022-12-25 00:04:00+00:00,-0.82,-0.82,-0.82,-0.82,-0.82,-1.0,-0.22,1.0,0.6,0.96,1.0
2022-12-25 00:05:00+00:00,-0.82,-0.82,-0.82,-0.82,-0.82,-1.0,-0.22,1.0,0.6,0.96,1.0

Unnamed: 0_level_0,open,high,low,close,adj close,volume,change,weekday,month_day,year_week,month
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2023-01-16 09:16:00+00:00,0.78,0.78,0.78,0.78,0.78,-0.97,-0.17,-1.0,0.0,-0.92,-1.0
2023-01-16 09:17:00+00:00,0.78,0.78,0.78,0.78,0.78,-0.99,-0.22,-1.0,0.0,-0.92,-1.0
2023-01-16 09:18:00+00:00,0.78,0.78,0.78,0.78,0.78,-0.98,-0.22,-1.0,0.0,-0.92,-1.0
2023-01-16 09:19:00+00:00,0.78,0.78,0.78,0.78,0.78,-0.98,-0.21,-1.0,0.0,-0.92,-1.0
2023-01-16 09:20:00+00:00,0.78,0.78,0.78,0.78,0.78,-1.0,-0.22,-1.0,0.0,-0.92,-1.0

Unnamed: 0_level_0,open,high,low,close,adj close,volume,change,weekday,month_day,year_week,month
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2023-01-16 09:22:00+00:00,0.78,0.78,0.78,0.78,0.78,-1.0,-0.25,-1.0,0.0,-0.92,-1.0
2023-01-16 09:23:00+00:00,0.78,0.78,0.78,0.78,0.78,-0.97,-0.18,-1.0,0.0,-0.92,-1.0
2023-01-16 09:24:00+00:00,0.79,0.79,0.79,0.79,0.79,-1.0,-0.21,-1.0,0.0,-0.92,-1.0
2023-01-16 09:25:00+00:00,0.79,0.79,0.79,0.79,0.79,-0.95,-0.19,-1.0,0.0,-0.92,-1.0
2023-01-16 09:26:00+00:00,0.79,0.79,0.79,0.79,0.79,-0.99,-0.22,-1.0,0.0,-0.92,-1.0

Unnamed: 0_level_0,open,high,low,close,adj close,volume,change,weekday,month_day,year_week,month
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2023-01-21 23:54:00+00:00,1.59,1.59,1.59,1.59,1.59,-0.99,-0.24,0.67,0.33,-0.92,-1.0
2023-01-21 23:55:00+00:00,1.58,1.58,1.58,1.58,1.58,-0.91,-0.28,0.67,0.33,-0.92,-1.0
2023-01-21 23:56:00+00:00,1.58,1.58,1.58,1.58,1.58,-1.0,-0.2,0.67,0.33,-0.92,-1.0
2023-01-21 23:57:00+00:00,1.58,1.58,1.58,1.58,1.58,-1.0,-0.2,0.67,0.33,-0.92,-1.0
2023-01-21 23:58:00+00:00,1.58,1.58,1.58,1.58,1.58,-0.97,-0.25,0.67,0.33,-0.92,-1.0





<a name = 'sequences'></a>

**SECTIONS: [Top](#top)  |  [Compile Stocks](#compile_stocks)  |  [The Data](#the_data)  |  [Add Change Column](#change_column)  |  [Feature Engineering](#feature_engineering)  |  [Splitting Data](#splitting)  |  [Creating Sequences](#sequences)  |  [Link](#link)  |  [Link](#link)  |** 

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Creating sequences for training the model**
	
<font size = 4>**`create_sequences()`**<br>
* will take input data as dataframe, a target column and the length we want the sequences to be
	
	- This will iterate over all the data from the beginning up to the last possible full sequence length in the data.<br>
	- It is also possible to write this code in a more elegant way, such as with sliding windows




## **`create_sequences()`**

In [24]:
def create_sequences(input_data, target_column, sequence_length):
					 
	sequences = []
	data_size = len(input_data)
	
	for item in tq.tqdm(range(data_size - sequence_length)):
	
		# end not included, therefor it will be the label
		sequence = input_data[item : item + sequence_length]
		label_position = item + sequence_length
		
		# defining the label with the value in the label index position
		label = input_data.iloc[label_position][target_column]
		sequences.append((sequence, label))
	return sequences

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**How creating sequences works**

In [25]:
sample_data = pd.DataFrame(dict(
							feature1 = [1, 2, 3, 4, 5, 6, 7, 8],
							label = [6, 7, 8, 9, 10, 11, 12, 13]))

In [26]:
sample_data.head(5)

Unnamed: 0,feature1,label
0,1,6
1,2,7
2,3,8
3,4,9
4,5,10


In [27]:
sample_sequences = create_sequences(sample_data, 'label', 3)

100%|████████████████████████████████████████████████████████████████████████| 5/5 [00:00<00:00, 3052.18it/s]


## **`see_sequence_samples()`**

In [28]:
def see_sequence_samples(data, feature_list, num_samples, 
						 sequence_portion=5):
	
	total_sequences = len(data)
	
	div_print(f'There are {total_sequences:,} total squences in this data.', 
			  fontsize = 4)
	counter = 1
	
	for record in range(0, num_samples):
			see(data[record][0][feature_list].head(sequence_portion), 
				f'sequence no.{record + 1}   |   target / label -> {data[record][1]}',
			   fontsize = 3)
			counter += 1

In [29]:
see_sequence_samples(sample_sequences, ['feature1'], num_samples = 3)

Unnamed: 0,feature1
0,1
1,2
2,3





Unnamed: 0,feature1
1,2
2,3
3,4





Unnamed: 0,feature1
2,3
3,4
4,5





<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Creating sequences from the data**

In [30]:
sequence_length = 60

train_sequences = create_sequences(train_df,
								   'close',
								   sequence_length)

test_sequences = create_sequences(test_df,
								   'close',
								   sequence_length)

100%|███████████████████████████████████████████████████████████████| 31998/31998 [00:01<00:00, 19708.27it/s]
100%|█████████████████████████████████████████████████████████████████| 7954/7954 [00:00<00:00, 18104.46it/s]


In [31]:
see_sequence_samples(train_sequences,
					['close', 'volume', 'change'],
					num_samples = 2,
					sequence_portion = 3)

Unnamed: 0_level_0,close,volume,change
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-12-25,-0.82,-1.0,-0.22
2022-12-25,-0.82,-1.0,-0.22
2022-12-25,-0.82,-1.0,-0.22





Unnamed: 0_level_0,close,volume,change
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-12-25,-0.82,-1.0,-0.22
2022-12-25,-0.82,-1.0,-0.22
2022-12-25,-0.82,-1.0,-0.22





In [32]:
see_sequence_samples(test_sequences,
					['close', 'volume', 'change'],
					num_samples = 2,
					sequence_portion = 3)

Unnamed: 0_level_0,close,volume,change
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-01-16,0.78,-1.0,-0.25
2023-01-16,0.78,-0.97,-0.18
2023-01-16,0.79,-1.0,-0.21





Unnamed: 0_level_0,close,volume,change
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-01-16,0.78,-0.97,-0.18
2023-01-16,0.79,-1.0,-0.21
2023-01-16,0.79,-0.95,-0.19





<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Sequences to PyTorch Datasets**
	
* `__init__()` - takes the sequences it will work with
* `__len__()` - returns the number of sequences in any given example
* `__getitem__()` - takes the index of the item we are interested in
	* within each sequence item is a tuple with the input data as the first tuple value and the target data or label as the second
	* sequences are converted from pandas to numpy to Tensor
	* labels are converted into floats, since we are doing regression and wanting to predict floating point numbers 

In [33]:
class TickerDataset(Dataset):
	def __init__(self, sequences):
		self.sequences = sequences
		
	def __len__(self):
		return len(self.sequences)
	
	def __getitem__(self, idx):
		sequence, label = self.sequences[idx]
		return dict(sequence = torch.Tensor(sequence.to_numpy()),
					label = torch.tensor(label).float())

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Wrapping data with PyTorch Lightning**
	
* `setup()` converts sequences into the dataset class defined above
* `train_dataloader()`, `validation_dataloader()`, `test_dataloader()`
	* create the three different kinds of dataloaders for the data
	* `train_dataloader()` -> `shuffle = False` - do not want to shuffle the data because it is timeseries data, so the order is important
		* `num_workers = 2` - speeds up training depending on GPU vs CPU
	* `validation_dataloader()` and `test_dataloader()` - `batch_size = 1` - for testing and making predictions, we would usually do one record at a time

In [34]:
class TickerDataModule(pl.LightningDataModule):
	
	def __init__(self, 
				 train_sequences, 
				 test_sequences, 
				 batch_size = 8):
		super().__init__()
		self.train_sequences = train_sequences
		self.test_sequences = test_sequences
		self.batch_size = batch_size
	
	def setup(self):
		self.train_dataset = TickerDataset(self.train_sequences)
		self.test_dataset = TickerDataset(self.test_sequences)
		
	def train_dataloader(self):
		return DataLoader(self.train_dataset,
						 batch_size = self.batch_size,
						 shuffle = False,
						 num_workers = 2)
	
	def validation_dataloader(self):
		return DataLoader(self.test_dataset,
				 batch_size = 1,
				 shuffle = Flase,
				 num_workers = 1)
	
	def test_dataloader(self):
		return DataLoader(self.test_dataset,
				 batch_size = 1,
				 shuffle = False,
				 num_workers = 1)

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Defining hyperparameters & the data module**
* calling the module with the sequences, with `setup()`, which will create the PyTorch Datasets

In [35]:
num_epochs = 8
batch_size = 64

data_module = TickerDataModule(train_sequences, 
							   test_sequences, 
							   batch_size = batch_size)

data_module.setup()

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Investigating a single item from TickerDataset**

In [36]:
train_dataset = TickerDataset(train_sequences)

In [37]:
for item in train_dataset:
	print(item['sequence'].shape)
	print(item['label'].shape)
	print(item['label'])
	break

torch.Size([60, 11])
torch.Size([])
tensor(-0.8197)


<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Creating the LSTM model**
* LSTM = Long short term memory neural network
	* Allows sequences with single outputs or multiple datapoints
	* Transformers can also be used for time series (as used in text and images)
	* `batch_size = True` - inputting batch size as first parameter, which is useful for using the dataloader that passes batches of sequences with 120 data points in each sequence and 9 features for each data point
	* `num_layers` - number of LSTMs stacked on one another
	* `regressor` is the final linear layer that outputs our single prediction
* `forward()`
	* `flatten_parameters()` - used for quicker distributed training with GPUs, which PyTorch lightning allows for
	* takes the features of the output of final layer of the LSTM, the hidden state and the cell state of the LSTM

In [38]:
class PricePredictionModel(nn.Module):
	
	def __init__(self, 
				 num_features, 
				 num_hidden = 128, 
				 num_layers = 2):

		super().__init__()
		self.num_hidden = num_hidden

		self.lstm = nn.LSTM(
						input_size = num_features,
						hidden_size = num_hidden,
						batch_first = True,
						num_layers = num_layers,
						dropout = 0.2)

		self.regressor = nn.Linear(num_hidden, 1)
	
	def forward(self, inputs):
		self.lstm.flatten_parameters()
		
		# getting states from the LSTM, retrieving the last layer of hidden
		# passing the last layer to the regressor
		_, (hidden, _) = self.lstm(inputs)
		out = hidden[-1]
		return self.regressor(out)

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Creating the lightning module for the model**
* `labels.unsqueeze()` - aligns the dimensions of the output with the output of predictions from the model

In [39]:
class TickerPricePredictor(pl.LightningModule):
	
	def __init__(self, num_features):
		super().__init__()
		
		self.model = PricePredictionModel(num_features)
		self.loss_function = nn.MSELoss()
		
	def forward(self, inputs, labels = None):
		output = self.model(inputs)
		loss = 0
		if labels is not None:
			loss = self.loss_function(output, labels.unsqueeze(dim=1))
		return loss, output
	
	def training_step(self, batch, batch_idx):
		sequences = batch['sequence']
		labels = batch['label']
		loss, outputs = self(sequences, labels)
		self.log('training loss', loss, prog_bar = True, logger = True)
		return loss
	
	def validation_step(self, batch, batch_idx):
		sequences = batch['sequence']
		labels = batch['label']
		loss, outputs = self(sequences, labels)
		self.log('validation loss', loss, prog_bar = True, logger = True)
		return loss
	
	def test_step(self, batch, batch_idx):
		sequences = batch['sequence']
		loss, outputs = self(sequences, labels)
		self.log('test loss', loss, prog_bar = True, logger = True)
		return loss
	
	def configure_optimzers(self):
		return optim.AdamW(self.parameters(), lr = 0.0001)

In [40]:
model = TickerPricePredictor(num_features = train_df.shape[1])

In [41]:
# for item in data_module.train_dataloader():
# 	print(item['sequence'].shape)
# 	print(item['label'].shape)
# 	# print(item['label'])
# 	break

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Text**

In [None]:
for item in data_module.train_dataloader():
	print(item)
	break

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/Users/evancarr/opt/anaconda3/envs/stock_lstm_tutorials/lib/python3.10/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/Users/evancarr/opt/anaconda3/envs/stock_lstm_tutorials/lib/python3.10/multiprocessing/spawn.py", line 126, in _main
    self = reduction.pickle.load(from_parent)
AttributeError: Can't get attribute 'TickerDataset' on <module '__main__' (built-in)>


<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Text**

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Text**

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Text**

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Text**

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Text**

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Text**

<font size = 4><span style = 'background-color: #ddddff; padding: 5px 5px 3px 5px; line-height: 1.5; color:black;border-radius: 3px;'>**Text**