# Machine Learning Stock Market Analyzer (Expirements)
Actual trading scripts will be later implemented (potentially through a seperate program). As of right now, this program just aims to use shallow learning models to predict trends of future prices.

Potential Limitations/Bottlenecks: Yahoo API support for small intervals & rate limiting for large datasets

#### Necessary Packages


In [3]:
# -- Data Collection/Cleaning/Preparation -- 
import pandas as pd
import matplotlib as plt
import numpy as np
from typing import List
from sklearn.preprocessing import StandardScaler

''' Shallow Learning Models '''

# Testing Accuracy
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score #Regression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix #Classification

# Simple Linear Regression (Predicting Values)
from sklearn.linear_model import LinearRegression

# SGD Regression (Predicting Values)
from sklearn.linear_model import SGDRegressor

# Ridge Regression (Predicting Values, assuming <100k samples)
from sklearn.kernel_ridge import KernelRidge

# SVR Regression (Predicting Values, kernel dependant on sample size)
from sklearn.svm import SVR

# SGD Classifier (Classifying movement, assuming >100k samples)
from sklearn.linear_model import SGDClassifier #(Be careful with feature scaling)

# Kernel Approximation (Classifying movement, assuming >100k samples)
from sklearn.kernel_approximation import RBFSampler
from sklearn.kernel_approximation import PolynomialCountSketch #*

# Linear SVC (Classifying movement, assuming <100k samples)
from sklearn.svm import SVC

# KNeighbors Classification (Classifying Movement, assuming <100k samples)
from sklearn.neighbors import KNeighborsClassifier #Could also add regressor

#Decision Trees (Predicting Price & Classifying Movement, Good Visual)
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor

''' Deep Learning Models - Will be held in a seperate Jupyter Notebook for organizational purposes''' 
#Will utilize tensorflow, will need to learn more about first
#Recurrent Neural Network & Convolutional Neural Network Hybrid
#Temporal Convolutional Networks
#Long Short-Term Memory Networks (Little Data)
#Gated Recurrent Unit (GRU) Networks (More Data)

' Deep Learning Models - Will be held in a seperate Jupyter Notebook for organizational purposes'

#### Basic Data Collection

In [13]:
# Load the dataset
historical_data = pd.read_csv('ES5m30d.csv')
#TESTING FIRST FOR 15 MINUTE TRADES
historical_data = historical_data.drop(columns='30mSD')

x = historical_data.iloc[:, :-1].values  # All features
target = historical_data.iloc[:, -1].values  # Getting the last value

Converting to numpy array for increased efficiency. Although most models do this internally anyways, explicitly converting can reduce overhead

In [5]:
''' ! We could totally convert to numpy array later but for now its fine as a pandas dataframe, 
tensorflow does it internally anyways !
 
x_df = historical_data[['open', 'prev_high', 'prev_low', 'prev_volume', 'prev_open', 'prev_close']]
y_df = historical_data[['Close']] #Target Value to be Specified

x = x_df.to_numpy()
target = y_df.to_numpy().ravel() #Ensuring it is a 1D array '''

#### Getting Open Price to Predict
Here, we will gather the most recent information for us to predict. This will need to be implemented in the future.

In [6]:
# Get Info Here (We have most_recent to work with previous values)

## Shallow Models - Regression

#### Benchmarking Functions

In [14]:
#Splitting data into consistent training and testing sets
scaler = StandardScaler()
x = scaler.fit_transform(x)
rand_state: int = 42 #Seeding the random state, getting equal splits
x_train, x_test, y_train, y_test = train_test_split(x, target, test_size=0.2, random_state=42)
def test_model(y_pred):
    # start_time = time.time()
    # end_time = time.time()
    # elasped_time = end_time - start_time
    # print(f"Duration of Execution: {elasped_time} seconds")
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    print(f'Mean Squared Error: {mse}')
    print(f'R-Squared: {r2}')

#### Simple Linear Regression Model

In [70]:
#Fitting/Training the model
lg_model = LinearRegression()
lg_model.fit(x_train, y_train)
#Getting & Evaluating Results
y_pred = lg_model.predict(x_test)
test_model(y_pred)

Mean Squared Error: 1.2678294450864929
R-Squared: 0.002285095591512021


#### SGD Regression Model

In [71]:
#Creating & Fitting
sgd_model = SGDRegressor()
sgd_model.fit(x_train, y_train)
#Evaluating the Model
y_pred = sgd_model.predict(x_test)
test_model(y_pred)

Mean Squared Error: 1.2671357956226852
R-Squared: 0.0028309611345138652


#### Kernel Ridge Model
Will very likely not be the best solution, however could serve as an interesting benchmark

In [73]:
kr_model = KernelRidge()
kr_model.fit(x_train, y_train)
#Evaluating
y_pred = kr_model.predict(x_test)
test_model(y_pred)

Mean Squared Error: 1.2700943380430596
R-Squared: 0.0005027442914896652


#### SVR Regression Model
Due to computational expendenture ( > O(n^3)), may not be optimal for exceptionally large datasets (>50k)

In [20]:
svr_model = SVR() #All default values, rbf kernel type
svr_model.fit(x_train, y_train)
#Evaluation
y_pred = svr_model.predict(x_test)
test_model(y_pred) 

Mean Squared Error: 1.2786741679810647
R-Squared: -0.006249129345467619


## Shallow Models - Classification
# NEEDS REFORMATTED TO WORK WITH NEW DATASET. The way I did it doesnt actually make any sense, let alone work.

#### Refining Data for Classification
In order to make classification clearer, a column of which can be classified must be added. For this project, I am thinking one of five scenarios:
- Neutral (.5% change)
- Weak Bullish (.5-4% increase)
- Weak Bearish (.5-4% decrease)
- Strong Bullish (4%+ increase)
- Strong Bearish (4%+ decrease)

Therefore, we need to restructure our data to include this. The specific classification conditions will need to be refined based on the interval. For example, a 3% daily movement could be considered strong, while a 3% movement could be weak over a month interval. Additionally, these will likely need to be redefined with things such as average volume, average volatility, etc.

--- Dev Note ---

Some sort of unsupervised model could be used to group stuff together, which could then potentially be classified? Maybe. 

In [21]:
'''
Determining classification
'''
#This will need to be redone based on the ideas outlined above. For now, however, it can just be hardcoded.
s_bull_indicator: float = 1.04 #Change needed to be considered strong bullish
s_bear_indicator: float = .96
w_bull_indicator: float = 1.005
w_bear_indicator: float = .995
#Any values out of this range will be considered neutral

In [13]:
classified_data = historical_data.copy()
def assign_classifier(x) -> str:
    percent_change: float = x.Close / x.open
    if percent_change >= s_bull_indicator:
        return 's_bull'
    elif percent_change <= s_bear_indicator:
        return 's_bear'
    elif percent_change >= w_bull_indicator:
        return 'w_bull'
    elif percent_change <= w_bear_indicator:
        return 'w_bear'
    else:
        return 'n' #Neutral 
    
classified_data['classifier'] = classified_data.apply(assign_classifier, axis = 1)
classified_data.head(30)
#classified_data.describe()

Unnamed: 0,open,prev_high,prev_low,Close,prev_volume,prev_open,prev_close,classifier
1,0.094211,0.099373,0.098943,0.093781,469033600.0,0.098943,0.098943,n
2,0.087328,0.094211,0.093781,0.086898,175884800.0,0.094211,0.093781,n
3,0.089049,0.087328,0.086898,0.089049,105728000.0,0.087328,0.086898,n
4,0.09163,0.089479,0.089049,0.09163,86441600.0,0.089049,0.089049,n
5,0.097223,0.092061,0.09163,0.097223,73449600.0,0.09163,0.09163,n
6,0.101954,0.097653,0.097223,0.101954,48630400.0,0.097223,0.097223,n
7,0.106257,0.102385,0.101954,0.106257,37363200.0,0.101954,0.101954,n
8,0.111849,0.106687,0.106257,0.111849,46950400.0,0.106257,0.106257,n
9,0.122173,0.112279,0.111849,0.122173,48003200.0,0.111849,0.111849,n
10,0.123894,0.122604,0.122173,0.123894,55574400.0,0.122173,0.122173,n


In [14]:
#Splicing into testing data
x_df = classified_data[['open', 'prev_high', 'prev_low', 'prev_volume', 'prev_open', 'prev_close']]
y_df = classified_data[['classifier']] #Target Value to be Specified

x = x_df.to_numpy()
target = y_df.to_numpy().ravel() #Ensuring it is a 1D array

scaler = StandardScaler()
x = scaler.fit_transform(x)
rand_state: int = 42 #Seeding the random state, getting equal splits
x_train, x_test, y_train, y_test = train_test_split(x, target, test_size=0.2, random_state=42)

def test_accuracy(y_pred):
    accuracy = accuracy_score(y_test, y_pred)
        
    print(f'Accuracy: {accuracy:.2f}')
    print('Classification Report:')
    print(classification_report(y_test, y_pred))
    print('Confusion Matrix:')
    print(confusion_matrix(y_test, y_pred))

#### SVC (SVM) Classification Model
Notes: Testing shows insanely low accuracy

In [15]:
svc_model = SVC(kernel='rbf', C=1.0, gamma='scale') #Default arguments
svc_model.fit(x_train, y_train)
#Predict & Test
y_pred = svc_model.predict(x_test)
test_accuracy(y_pred)

Accuracy: 0.37
Classification Report:
              precision    recall  f1-score   support

           n       0.44      0.25      0.32       584
      s_bear       0.00      0.00      0.00        75
      s_bull       0.00      0.00      0.00        80
      w_bear       0.35      0.79      0.49       734
      w_bull       0.38      0.11      0.17       733

    accuracy                           0.37      2206
   macro avg       0.23      0.23      0.19      2206
weighted avg       0.36      0.37      0.30      2206

Confusion Matrix:
[[144   0   0 377  63]
 [  1   0   0  72   2]
 [  0   0   0  75   5]
 [ 88   0   0 581  65]
 [ 96   0   0 556  81]]


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


#### K-Neighbors Classification Model

In [16]:
k_neighbors_c_model = KNeighborsClassifier()
k_neighbors_c_model.fit(x_train, y_train)
#Predicting and testing
y_pred = k_neighbors_c_model.predict(x_test)
test_accuracy(y_pred)

Accuracy: 0.35
Classification Report:
              precision    recall  f1-score   support

           n       0.35      0.46      0.39       584
      s_bear       0.10      0.05      0.07        75
      s_bull       0.04      0.03      0.03        80
      w_bear       0.36      0.42      0.39       734
      w_bull       0.38      0.25      0.30       733

    accuracy                           0.35      2206
   macro avg       0.24      0.24      0.24      2206
weighted avg       0.34      0.35      0.34      2206

Confusion Matrix:
[[266   8   8 194 108]
 [ 16   4   5  35  15]
 [ 26   3   2  31  18]
 [235  13  12 307 167]
 [223  12  24 289 185]]


#### Decision Tree Classifier & Regressor
Can be used as a good aid for visualization

In [17]:
#Regressor (must be moved prior to data manipulation)
dc_r_model = DecisionTreeRegressor()
dc_r_model.fit(x_train, y_train)
y_pred = dc_r_model.predict(x_test)
test_model(y_pred)

ValueError: could not convert string to float: 'w_bull'

In [20]:
#Classifier
dc_c_model = DecisionTreeClassifier()
dc_c_model.fit(x_train, y_train)
y_pred = dc_c_model.predict(x_test)
test_accuracy(y_pred)
#dc_c_model.decision_path

Accuracy: 0.34
Classification Report:
              precision    recall  f1-score   support

           n       0.35      0.36      0.36       565
      s_bear       0.07      0.07      0.07        85
      s_bull       0.02      0.02      0.02        86
      w_bear       0.33      0.37      0.35       698
      w_bull       0.41      0.35      0.38       762

    accuracy                           0.34      2196
   macro avg       0.24      0.24      0.24      2196
weighted avg       0.34      0.34      0.34      2196

Confusion Matrix:
[[206  14  13 199 133]
 [ 17   6  11  28  23]
 [ 17   6   2  34  27]
 [173  27  35 259 204]
 [171  33  30 258 270]]


<bound method BaseDecisionTree.decision_path of DecisionTreeClassifier()>