# <center>SPY Classification with Intraday Data Based Variables</center>

For this study, we are going to use classification to assign 'Buy', 'Flat', or 'Sell' to each row of data.

The data is one day of SPY values.

The difference of this study vs my peers is that I am focusing on volume and exploring its value in classifying the data. 
This study includes values that are derived from intraday data, in addition to daily data.  
In fact, the primary focus for this study are the variables that are derived from the intraday data. 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# This import is a custom module we created to support our analysis.
from mlUtilities import ml_utils, data_utils

In [3]:
from importlib import reload

In [4]:
import warnings
warnings.filterwarnings('ignore')

# Load Data

<hr size='3'>

Our primary dataframes will be i_data and d_data--i for intraday and d for daily.
Both variables are pandas DataFrame objects

### Intraday Data Load and Preparation

In [5]:
i_data = pd.read_csv('Data/5m_SPY')

In [6]:
# We perform the feature engineering with this function call
i_data = data_utils.transform_data(i_data, period='intraday')

In [7]:
i_data.head()

Unnamed: 0,Datetime,Open,High,Low,Close,Adj Close,Volume,Vol_Direction,Dir_to_Vol,Vol_Tide
0,2021-11-17 09:30:00-05:00,469.0,469.040009,468.450012,468.519989,468.519989,0.737352,,,
1,2021-11-17 09:35:00-05:00,468.529999,468.540009,468.160004,468.320007,468.320007,0.244347,-1.0,-0.244347,-0.244347
2,2021-11-17 09:40:00-05:00,468.320007,468.524994,468.160004,468.450012,468.450012,0.168764,1.0,0.168764,-0.075583
3,2021-11-17 09:45:00-05:00,468.450012,468.640015,468.130005,468.13501,468.13501,0.2529,-1.0,-0.2529,-0.328483
4,2021-11-17 09:50:00-05:00,468.13501,468.13501,467.579987,467.670013,467.670013,0.278044,-1.0,-0.278044,-0.606527


This End Of Day (EOD) value is the variable we wanted to derive from our intraday data.

In [8]:
eod_vol_tide = i_data.groupby([i_data.Datetime.dt.date])['Dir_to_Vol'].sum()

# Daily Data Load and Preparation

In [9]:
d_data = pd.read_csv('Data/1d_SPY')
d_data['Volume'] = data_utils.min_max_normalize(d_data['Volume'], 0, 5)
d_data = data_utils.transform_data(d_data, eod_tide=eod_vol_tide, create_targets=True)
short_d_data = data_utils.transform_tide(d_data, eod_tide=eod_vol_tide)

Drop na rows.

In [10]:
short_d_data.dropna(inplace=True)

 Lets drop the columns that will are not being tested.

In [11]:
target = short_d_data.pop('Target')
short_d_data.set_index('Date', inplace=True)
study_columns = ['Volume', 'Directional_Vol', 'EOD_Vol_Tide',
                 'Vol_Adv_Dec', 'Tide_Adv_Dec', 'Vol_Diff',
                 'Change_1', 'Change_3', 'Change_5', 'Change_10'
                ]
short_d_data = short_d_data[study_columns]

## Feature Engineering

The following features were engineered.  
This are features that I created from scratch.  
I do not know why the jupyter notebook is not converting the latex notation.

-     Normalize the volume using min max 0 to 5
-     $i$= day
-     $m$ = 1 period within day $i$
-     $k$ = number of five minute periods in a day $i$
-     Vol\_Direction$_i$ = $ \dfrac{Change\_n_i}{|Change\_n_i|} $
-     Directional Vol = Vol\_Direction * Volume
-     Vol Tide$_k$ = $\sum_m^k DirToVol_{m}$ 
-     EOD Vol Tide$_i$ = $\sum_m^{k} VolTide_m$
-     Vol Adv Dec = cumsum(DirectionalVol)
-     Tide Adv Dec = cumsum(EODVolTide)
-     Vol Diff = (Vol Adv Dec) - (Tide Adv Dec) 
-     Change\_n = $\dfrac{Close_i - Close_{i-n}}{Close_i}$

### Thoughts on the use of intraday data

The focus of this study was to see the impact of pulling the volume "tide" from intraday data and comparing 
it to the overall volume for the day.  
The thinking behind it is that during a trading day, the price of a stock can move up and down, much like a tide.  
The strength of that tide is determined by the number of shares (volume) traded over a specific time period.
The tide is then taken and placed into a daily data frame and compared with other variables derived from the daily database.

# Final Data Set

In [25]:
short_d_data.tail()

Unnamed: 0_level_0,Volume,Directional_Vol,EOD_Vol_Tide,Vol_Adv_Dec,Tide_Adv_Dec,Vol_Diff,Change_1,Change_3,Change_5,Change_10
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-02-04,1.318261,1.318261,-0.574958,-2.839388,-7.177636,-2.26443,0.004702,-0.009383,0.015273,0.024476
2022-02-07,0.863057,-0.863057,-3.657778,-3.702445,-10.835414,-0.044667,-0.003209,-0.022062,-0.00589,0.01687
2022-02-08,0.816533,0.816533,0.686838,-2.885912,-10.148576,-3.57275,0.008228,0.009718,-0.004438,0.037908
2022-02-09,0.972171,0.972171,3.437074,-1.913742,-6.711502,-5.350816,0.014636,0.019701,0.000415,0.055748
2022-02-10,1.610881,-1.610881,2.13879,-3.524623,-4.572712,-5.663413,-0.017966,0.004606,0.00609,0.041926


In [13]:
print(short_d_data.shape)
print(short_d_data.describe().T)

(59, 10)
                 count      mean       std        min       25%       50%  \
Volume            59.0  1.094183  0.553837   0.363013  0.691473  1.011084   
Directional_Vol   59.0  0.022518  1.234543  -2.446577 -1.042961  0.460182   
EOD_Vol_Tide      59.0 -0.077504  4.308697 -10.519645 -2.992513 -0.216242   
Vol_Adv_Dec       59.0 -5.342571  1.662669 -10.366840 -6.313487 -5.268837   
Tide_Adv_Dec      59.0 -2.927733  7.919824 -22.431489 -7.739327 -1.207872   
Vol_Diff          59.0 -5.265067  4.193186 -16.414500 -7.865187 -4.728692   
Change_1          59.0 -0.000671  0.011529  -0.023505 -0.008838 -0.000939   
Change_3          59.0 -0.001502  0.020116  -0.040548 -0.014675 -0.002305   
Change_5          59.0 -0.002530  0.026156  -0.057155 -0.019218 -0.000647   
Change_10         59.0 -0.005602  0.031590  -0.084455 -0.023683 -0.000336   

                      75%        max  
Volume           1.352341   3.112158  
Directional_Vol  0.991627   3.112158  
EOD_Vol_Tide     2.769017 

## Target Values/Classes

Our target values will be the classes 'Buy,' 'Flat,' 'Sell'.
The classification will be assigned as follows:

For Change_5 as $C$

Buy if $C >= 0.3$%

Flat if $-0.3\% < C < 0.3$%

Sell if $C <= -0.3$%

# Load Test Data

In [14]:
i_test_data = pd.read_csv('Data/5m_SPY_test')
i_test_data = data_utils.transform_data(i_test_data, period='intraday')
eod_value = i_test_data.groupby([i_test_data.Datetime.dt.date])['Dir_to_Vol'].sum()

In [15]:
d_test_data = pd.read_csv('Data/1d_SPY_test')
d_test_data['Volume'] = data_utils.min_max_normalize(d_test_data['Volume'], 0, 5)
d_test_data = data_utils.transform_data(d_test_data, period='daily', eod_tide=eod_value, create_targets=True)
test_data = data_utils.transform_tide(d_test_data, eod_tide=eod_value)
test_data.dropna(inplace=True)

In [16]:
test_target = test_data.pop('Target')
test_data.set_index('Date', inplace=True)
test_data = test_data[study_columns]

In [17]:
test_data.tail()

Unnamed: 0_level_0,Volume,Directional_Vol,EOD_Vol_Tide,Vol_Adv_Dec,Tide_Adv_Dec,Vol_Diff,Change_1,Change_3,Change_5,Change_10
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2022-03-10,0.990759,0.990759,-2.518018,-9.537802,-20.949954,-7.019784,-0.004516,0.014424,-0.023479,-0.006584
2022-03-11,1.011688,-1.011688,-7.799765,-10.549489,-28.749719,-2.749724,-0.012715,0.009177,-0.027998,-0.040388
2022-03-14,1.014371,-1.014371,-3.062724,-11.56386,-31.812443,-8.501136,-0.007308,-0.024356,-0.005794,-0.044958
2022-03-15,1.155383,1.155383,8.726922,-10.408478,-23.085521,-19.1354,0.02199,0.001622,0.023832,-0.008861
2022-03-16,1.676093,1.676093,11.469683,-8.732384,-11.615838,-20.202067,0.022174,0.037018,0.019209,-0.005184


# Run Naive Bayes

In [18]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score

In [19]:
nB = GaussianNB()
nB.fit(short_d_data, target)

GaussianNB()

## Results

In [20]:
train_prediction = nB.predict(short_d_data)

In [21]:
training_cm = confusion_matrix(target, train_prediction, labels=['Buy', 'Flat', 'Sell'])
training_accuracy = accuracy_score(target, train_prediction)

In [22]:
test_prediction = nB.predict(test_data)
test_cm = confusion_matrix(test_target, test_prediction)
test_accuracy = accuracy_score(test_target, test_prediction)

### Confusion Matrix

In [23]:
# Use kenneths heat matrix for confusion matrix.
# Can it handle 3 classes?
true_column = pd.MultiIndex.from_tuples([('Actual', 'Buy'),('Actual', 'Flat'),('Actual', 'Sell')])
true_index = pd.MultiIndex.from_tuples([('Predicted', 'Buy'),('Predicted', 'Flat'),('Predicted', 'Sell')])
train_cm_as_df = pd.DataFrame(training_cm, columns=true_column,
                        index=true_index)
accuracy_as_df = None
print('Confusion Matrix for Training Data Set:')
print(train_cm_as_df)
print()

test_cm_as_df = pd.DataFrame(test_cm, columns=true_column,
                        index=true_index)
print('Confusion Matrix for Test Data Set:')
print(test_cm_as_df)

Confusion Matrix for Training Data Set:
               Actual          
                  Buy Flat Sell
Predicted Buy      15    3    2
          Flat      1   10    1
          Sell      1    8   18

Confusion Matrix for Test Data Set:
               Actual          
                  Buy Flat Sell
Predicted Buy       4    0    2
          Flat      1    0    1
          Sell      1    0   10


In [24]:
print(f'Training Data Accuracy: {round(training_accuracy, 3)}\n')
print(f'Test Data Accuracy: {round(test_accuracy, 3)}')

Training Data Accuracy: 0.729

Test Data Accuracy: 0.737
