# <center>SPY Classification with Intraday Data Based Variables</center>

For this study, we are going to use classification to assign 'Buy', 'Flat', or 'Sell' to each row of data.

The data is one day of SPY values.

The difference of this study vs my peers is that I am focusing on volume and exploring its value in classifying the data. 
This study includes values that are derived from intraday data, in addition to daily data.  
Actually, the primary focus are the variables that are derived from the intraday data and their impact on the variables derived from there.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from mlUtilities import ml_utils, data_utils

In [2]:
from importlib import reload

In [3]:
import warnings
warnings.filterwarnings('ignore')

# Load Data

<hr size='3'>

Our primary dataframes will be i_data and d_data--i for intraday and d for daily.
Both variables are pandas DataFrame objects

### Intraday Data Load and Preparation

In [4]:
i_data = pd.read_csv('Data/5m_SPY')

In [5]:
i_data = data_utils.transform_data(i_data, period='intraday')

In [6]:
i_data.head()

Unnamed: 0,Datetime,Open,High,Low,Close,Adj Close,Volume,Vol_Direction,Dir_to_Vol,Vol_Tide
0,2021-11-17 09:30:00-05:00,469.0,469.040009,468.450012,468.519989,468.519989,0.737352,,,
1,2021-11-17 09:35:00-05:00,468.529999,468.540009,468.160004,468.320007,468.320007,0.244347,-1.0,-0.244347,-0.244347
2,2021-11-17 09:40:00-05:00,468.320007,468.524994,468.160004,468.450012,468.450012,0.168764,1.0,0.168764,-0.075583
3,2021-11-17 09:45:00-05:00,468.450012,468.640015,468.130005,468.13501,468.13501,0.2529,-1.0,-0.2529,-0.328483
4,2021-11-17 09:50:00-05:00,468.13501,468.13501,467.579987,467.670013,467.670013,0.278044,-1.0,-0.278044,-0.606527


This End Of Day (EOD) value is the variable we wanted to derive from our intraday data.

In [7]:
eod_vol_tide = i_data.groupby([i_data.Datetime.dt.date])['Dir_to_Vol'].sum()

# Daily Data Load and Preparation

## Target Values/Classes

Our target values will be the classes 'Buy,' 'Flat,' 'Sell'.
The classification will be assigned as follows:

For Change_5 as $C$

Buy if $C >= 0.3$

Flat if $-0.3 < C < 0.3$

Sell if $C <= -0.3$

In [8]:
reload(data_utils)

<module 'mlUtilities.data_utils' from '/home/gsandoval/Documents/Classes/depaul/ML_Programming/Project/MLProgramming/mlUtilities/data_utils.py'>

In [9]:
d_data = pd.read_csv('Data/1d_SPY')
d_data['Volume'] = data_utils.min_max_normalize(d_data['Volume'], 0, 5)
d_data = data_utils.transform_data(d_data, eod_tide=eod_vol_tide, create_targets=True)
short_d_data = data_utils.transform_tide(d_data, eod_tide=eod_vol_tide)

Drop na rows.

In [10]:
short_d_data.dropna(inplace=True)

In [11]:
short_d_data.tail()

Unnamed: 0,Date,High,Low,Open,Close,Volume,Adj Close,Directional_Vol,Vol_Adv_Dec,Change_1,Change_3,Change_5,Change_10,EOD_Vol_Tide,Tide_Adv_Dec,Vol_Diff,Target
54,2022-02-04,452.779999,443.829987,446.350006,448.700012,1.318261,448.700012,1.318261,-2.839388,0.004702,-0.009383,0.015273,0.024476,-0.574958,-7.177636,-2.26443,Sell
55,2022-02-07,450.98999,445.850006,449.51001,447.26001,0.863057,447.26001,-0.863057,-3.702445,-0.003209,-0.022062,-0.00589,0.01687,-3.657778,-10.835414,-0.044667,Flat
56,2022-02-08,451.920013,445.220001,446.730011,450.940002,0.816533,450.940002,0.816533,-2.885912,0.008228,0.009718,-0.004438,0.037908,0.686838,-10.148576,-3.57275,Flat
57,2022-02-09,457.880005,455.01001,455.220001,457.540009,0.972171,457.540009,0.972171,-1.913742,0.014636,0.019701,0.000415,0.055748,3.437074,-6.711502,-5.350816,Buy
58,2022-02-10,457.709991,447.200012,451.339996,449.320007,1.610881,449.320007,-1.610881,-3.524623,-0.017966,0.004606,0.00609,0.041926,2.13879,-4.572712,-5.663413,Sell


### Lets drop the columns that will are not being tested.

In [12]:
target = short_d_data.pop('Target')
short_d_data.set_index('Date', inplace=True)
study_columns = ['Volume', 'Directional_Vol', 'EOD_Vol_Tide',
                 'Vol_Adv_Dec', 'Tide_Adv_Dec', 'Vol_Diff',
                 'Change_1', 'Change_3', 'Change_5', 'Change_10'
                ]
short_d_data = short_d_data[study_columns]

In [13]:
short_d_data.head()

Unnamed: 0_level_0,Volume,Directional_Vol,EOD_Vol_Tide,Vol_Adv_Dec,Tide_Adv_Dec,Vol_Diff,Change_1,Change_3,Change_5,Change_10
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2021-11-17,0.37086,-0.37086,-0.868448,-5.224022,-0.868448,-4.355573,-0.002429,0.001862,0.009749,0.007359
2021-11-18,0.408059,0.408059,2.212248,-4.815962,1.343799,-7.02821,0.003396,0.004921,0.012851,0.00604
2021-11-19,0.497991,-0.497991,1.092567,-5.313953,2.436366,-6.40652,-0.001788,-0.000831,0.003467,0.000768
2021-11-22,0.705631,-0.705631,-3.064953,-6.019584,-0.628587,-2.954631,-0.002815,-0.001218,0.0003,-0.0029
2021-11-23,0.711606,0.711606,-0.579285,-5.307978,-1.207872,-4.728692,0.001326,-0.003278,-0.002323,0.001733


In [14]:
print(short_d_data.shape)
print(target.shape)

(59, 10)
(59,)


# Load Test Data

In [15]:
i_test_data = pd.read_csv('Data/5m_SPY_test')
i_test_data = data_utils.transform_data(i_test_data, period='intraday')
eod_value = i_test_data.groupby([i_test_data.Datetime.dt.date])['Dir_to_Vol'].sum()

In [16]:
d_test_data = pd.read_csv('Data/1d_SPY_test')
d_test_data['Volume'] = data_utils.min_max_normalize(d_test_data['Volume'], 0, 5)
d_test_data = data_utils.transform_data(d_test_data, period='daily', eod_tide=eod_value, create_targets=True)
test_data = data_utils.transform_tide(d_test_data, eod_tide=eod_value)
test_data.dropna(inplace=True)

In [17]:
test_target = test_data.pop('Target')
test_data.set_index('Date', inplace=True)
test_data = test_data[study_columns]

# Run Naive Bayes

In [18]:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import confusion_matrix, accuracy_score

In [19]:
nB = GaussianNB()
nB.fit(short_d_data, target)

GaussianNB()

## Results

In [20]:
train_prediction = nB.predict(short_d_data)

In [21]:
training_cm = confusion_matrix(target, train_prediction, labels=['Buy', 'Flat', 'Sell'])
training_accuracy = accuracy_score(target, train_prediction)

In [22]:
test_prediction = nB.predict(test_data)
test_cm = confusion_matrix(test_target, test_prediction)
test_accuracy = accuracy_score(test_target, test_prediction)

### Confusion Matrix

In [30]:
# Use kenneths heat matrix for confusion matrix.
# Can it handle 3 classes?
true_column = pd.MultiIndex.from_tuples([('Actual', 'Buy'),('Actual', 'Flat'),('Actual', 'Sell')])
true_index = pd.MultiIndex.from_tuples([('Predicted', 'Buy'),('Predicted', 'Flat'),('Predicted', 'Sell')])
train_cm_as_df = pd.DataFrame(training_cm, columns=true_column,
                        index=true_index)
accuracy_as_df = None
print('Confusion Matrix for Training Data Set:')
print(cm_as_df)
print()

test_cm_as_df = pd.DataFrame(test_cm, columns=true_column,
                        index=true_index)
print('Confusion Matrix for Test Data Set:')
print(test_cm_as_df)

Confusion Matrix for Training Data Set:
               Actual          
                  Buy Flat Sell
Predicted Buy      15    3    2
          Flat      1   10    1
          Sell      1    8   18

Confusion Matrix for Test Data Set:
               Actual          
                  Buy Flat Sell
Predicted Buy       4    0    2
          Flat      1    0    1
          Sell      1    0   10


In [33]:
print(f'Training Data Accuracy: {round(training_accuracy, 3)}\n')
print(f'Test Data Accuracy: {round(test_accuracy, 3)}')

Training Data Accuracy: 0.729

Test Data Accuracy: 0.737
