Author: Philip Borozenets<p>
This python notebook runs a binary classification neural netwrok analysis on our 'Master CSV' file. A total of 12 models were run, 6 models using the current day stock price, and 6 using the next day stock price. Because we are using the mean, median, and min/max for each days news sentiment scores, we decided to run each of these individually because they were highly correlated to one another. The evaluation metric used is the accuracy score.

In [1]:
#Clears all the files in the notebook so if there were any changes to the csv files on github it will use the latest version
%rm -rf stonks/

In [2]:
## Retrieve csv source data from 'Stonks' github folder
## Source data will be imported in /data/ subfolder
%%capture
!git clone https://github.com/IS737StockPicker/stonks.git

In [3]:
# import required packages
%matplotlib inline
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.neural_network import MLPClassifier #this is the neural network part
import matplotlib.pylab as plt
from sklearn.metrics import accuracy_score

In [4]:
# Special package from the class book to import regression summary statistics
!pip install -U dmba;
from dmba import regressionSummary

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting dmba
  Downloading dmba-0.2.3-py3-none-any.whl (11.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 MB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dmba
Successfully installed dmba-0.2.3
no display found. Using non-interactive Agg backend


In [5]:
# load the data
stocks_df = pd.read_csv('/content/stonks/master_data.csv')
stocks_df

Unnamed: 0,Date,XLE,XLF,XLU,XLI,XLK,XLV,XLY,IYR,AAPL,...,Huff_headline_min,Huff_body_min,NYT_headline_mean,NYT_headline_median,NYT_body_mean,NYT_body_median,NYT_headline_max,NYT_headline_min,NYT_body_max,NYT_body_min
0,2022-03-14,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.80,-0.67,-0.05,0.0,-0.03,0.00,0.88,-0.73,0.91,-0.96
1,2022-03-15,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,...,-0.76,-0.85,-0.09,0.0,0.02,0.00,0.84,-0.86,0.88,-0.91
2,2022-03-16,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.25,-0.56,-0.06,0.0,-0.00,0.00,0.86,-0.80,0.88,-0.90
3,2022-03-17,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.90,-0.49,-0.02,0.0,-0.05,0.00,0.61,-0.84,0.82,-0.89
4,2022-03-18,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.60,-0.76,-0.06,0.0,-0.05,-0.08,0.83,-0.82,0.86,-0.97
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120,2022-09-07,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.78,-0.57,0.09,0.0,-0.07,0.00,0.65,-0.86,0.84,-0.93
121,2022-09-08,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,-0.77,-0.83,-0.04,0.0,-0.02,0.00,0.75,-0.90,0.92,-0.93
122,2022-09-09,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.60,0.00,-0.01,0.0,0.14,0.14,0.81,-0.80,0.85,-0.91
123,2022-09-12,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.79,0.00,-0.10,0.0,-0.06,-0.13,0.75,-0.90,0.89,-0.89


Specifying which columns in the dataframe are our target variables

In [6]:
tickers = stocks_df.iloc[:,:15]
tickers = tickers.drop(columns=['Date'])
tickers

Unnamed: 0,XLE,XLF,XLU,XLI,XLK,XLV,XLY,IYR,AAPL,XLB,XLP,SPY,^DJI,NDX
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
2,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0
3,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
121,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0
122,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
123,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0


Baseline Accuracy Scores for each stock ticker

In [7]:
baseline_acc = {}
for ticker in tickers:
  baseline_acc[ticker] = round(sum(stocks_df[ticker])/125,2)
  print(sum(stocks_df[ticker])/125)

0.544
0.536
0.52
0.592
0.544
0.568
0.504
0.504
0.568
0.48
0.544
0.568
0.536
0.52


Neural Net Classification algorithm run for each of the target variables using 6 iterations of input data (mean/median/min_max for headline sentiment score and body sentiment score)

In [18]:
present_acc_scores={}
baseline_present = {}
#this for loop runs a neural net analysis for each of our selected stock tickers using either the mean,median,or min/max news sentiment scores from our three news sources
for ticker in tickers:
  #seperates the input variables from the master csv file
  mean_headline_var = [ticker,'guardian_headline_mean','Huff_headline_mean','NYT_headline_mean']
  median_body_var = [ticker,'guardian_body_median','Huff_body_median','NYT_body_median']
  mean_body_var = [ticker,'guardian_body_mean','Huff_body_mean','NYT_body_mean']
  median_headline_var = [ticker,'guardian_headline_median','Huff_headline_median','NYT_headline_median']
  min_max_headline_var = [ticker,'guardian_headline_min','guardian_headline_max','Huff_headline_min','Huff_headline_max','NYT_headline_min','NYT_headline_max']
  min_max_body_var = [ticker,'guardian_body_min','guardian_body_max','Huff_body_min','Huff_body_max','NYT_body_min','NYT_body_max']
  model_inputs = [mean_headline_var,mean_body_var,median_headline_var,median_body_var,min_max_headline_var,min_max_body_var]
  input_arr = []
  #run neural net analysis for each of the model inputs
  for inputs in range(len(model_inputs)):
    accuracy_df = stocks_df[model_inputs[inputs]]
    y_nonscaled = accuracy_df[[ticker]]
    X_nonscaled = accuracy_df.drop(columns=[ticker])
    scaleOutput = MinMaxScaler()
    scaleInput = MinMaxScaler()
    y = scaleOutput.fit_transform(y_nonscaled)
    X = scaleInput.fit_transform(X_nonscaled)
    #28% of dataset used for testing which equates to 35 days of stock and news sentiment scores
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.28, random_state=5)
    stock_nnet = MLPClassifier(
      hidden_layer_sizes=(2), 
      activation='logistic', 
      solver='lbfgs', 
      random_state=1)
    stock_nnet.fit(X_train, y_train.ravel())
    y_pred = stock_nnet.predict(X_valid)
    input_arr.append(round(accuracy_score(y_valid, y_pred),2))
    
  #adds the accuracy scores for each of the models to accuracy scores dictionary
  present_acc_scores[ticker] = input_arr
  #adds baseline accuracy scores to the baseline scores dictionary
  baseline_present[ticker] = round(sum(accuracy_df[ticker])/125,2)
  


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

Creating a dataframe from the neural network analysis 

In [19]:
present_outcomes_df = pd.DataFrame.from_dict(present_acc_scores).T

baseline_present_df = pd.DataFrame.from_dict([baseline_present]).T
present_outcomes_df = pd.concat([baseline_present_df, present_outcomes_df], axis=1)
present_outcomes_df.columns = ['baseline_acc','headline_mean_acc', 'body_mean_acc', 'headline_median_acc', 'body_median_acc', 'headline_min_max', 'body_min_max']
present_outcomes_df

Unnamed: 0,baseline_acc,headline_mean_acc,body_mean_acc,headline_median_acc,body_median_acc,headline_min_max,body_min_max
XLE,0.54,0.51,0.4,0.46,0.49,0.57,0.29
XLF,0.54,0.4,0.54,0.63,0.4,0.34,0.43
XLU,0.52,0.46,0.4,0.49,0.49,0.54,0.54
XLI,0.59,0.63,0.6,0.6,0.6,0.54,0.63
XLK,0.54,0.63,0.54,0.51,0.6,0.31,0.54
XLV,0.57,0.57,0.49,0.6,0.6,0.43,0.57
XLY,0.5,0.46,0.54,0.49,0.63,0.49,0.54
IYR,0.5,0.54,0.37,0.69,0.6,0.34,0.51
AAPL,0.57,0.6,0.57,0.51,0.57,0.49,0.49
XLB,0.48,0.51,0.4,0.49,0.23,0.34,0.43




Check to see which model did best by summing up all the accuracy scores and subtracting from the baseline score. The best model was the median headline news sentiment



In [20]:
for outcomes in present_outcomes_df:
  print(sum(present_outcomes_df[outcomes])-sum(present_outcomes_df['baseline_acc']))

0.0
-0.15000000000000036
-0.6499999999999995
0.120000000000001
-0.17999999999999883
-1.1900000000000004
-0.41999999999999993


Importing stock data shifted up by one day in an attempt to predict the next day stock price given previous days input data

In [12]:
future_df = pd.read_csv('/content/stonks/master_data_future.csv')
future_df

Unnamed: 0,Date,XLE,XLF,XLU,XLI,XLK,XLV,XLY,IYR,AAPL,...,Huff_headline_min,Huff_body_min,NYT_headline_mean,NYT_headline_median,NYT_body_mean,NYT_body_median,NYT_headline_max,NYT_headline_min,NYT_body_max,NYT_body_min
0,2022-03-14,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,...,-0.80,-0.67,-0.05,0.0,-0.03,0.00,0.88,-0.73,0.91,-0.96
1,2022-03-15,0.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.76,-0.85,-0.09,0.0,0.02,0.00,0.84,-0.86,0.88,-0.91
2,2022-03-16,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.25,-0.56,-0.06,0.0,-0.00,0.00,0.86,-0.80,0.88,-0.90
3,2022-03-17,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.90,-0.49,-0.02,0.0,-0.05,0.00,0.61,-0.84,0.82,-0.89
4,2022-03-18,1.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,...,-0.60,-0.76,-0.06,0.0,-0.05,-0.08,0.83,-0.82,0.86,-0.97
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
120,2022-09-07,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,-0.78,-0.57,0.09,0.0,-0.07,0.00,0.65,-0.86,0.84,-0.93
121,2022-09-08,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.77,-0.83,-0.04,0.0,-0.02,0.00,0.75,-0.90,0.92,-0.93
122,2022-09-09,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,-0.60,0.00,-0.01,0.0,0.14,0.14,0.81,-0.80,0.85,-0.91
123,2022-09-12,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-0.79,0.00,-0.10,0.0,-0.06,-0.13,0.75,-0.90,0.89,-0.89


Baseline accuracy scores for next day stock ticker dataset 

In [13]:
future_baseline_acc = {}
for ticker in tickers:
  future_baseline_acc[ticker] = round(sum(future_df[ticker])/125,2)
print(future_baseline_acc)

{'XLE': 0.56, 'XLF': 0.54, 'XLU': 0.54, 'XLI': 0.59, 'XLK': 0.54, 'XLV': 0.57, 'XLY': 0.51, 'IYR': 0.5, 'AAPL': 0.58, 'XLB': 0.48, 'XLP': 0.54, 'SPY': 0.58, '^DJI': 0.54, 'NDX': 0.53}


Neural Net Classification algorithm run for each of the target ticker variables using next day stock score and 6 iterations of input data (mean/median/min_max for headline sentiment score and body sentiment score)

In [21]:
future_acc_scores={}
baseline_future = {}
for ticker in tickers:
  #seperates the input variables from the master csv file
  mean_headline_var = [ticker,'guardian_headline_mean','Huff_headline_mean','NYT_headline_mean']
  median_body_var = [ticker,'guardian_body_median','Huff_body_median','NYT_body_median']
  mean_body_var = [ticker,'guardian_body_mean','Huff_body_mean','NYT_body_mean']
  median_headline_var = [ticker,'guardian_headline_median','Huff_headline_median','NYT_headline_median']
  min_max_headline_var = [ticker,'guardian_headline_min','guardian_headline_max','Huff_headline_min','Huff_headline_max','NYT_headline_min','NYT_headline_max']
  min_max_body_var = [ticker,'guardian_body_min','guardian_body_max','Huff_body_min','Huff_body_max','NYT_body_min','NYT_body_max']
  model_inputs = [mean_headline_var,mean_body_var,median_headline_var,median_body_var,min_max_headline_var,min_max_body_var]
  input_arr = []
  #run neural net analysis for each of the model inputs
  for inputs in range(len(model_inputs)):
    accuracy_df = future_df[model_inputs[inputs]]
    y_nonscaled = accuracy_df[[ticker]]
    X_nonscaled = accuracy_df.drop(columns=[ticker])
    scaleOutput = MinMaxScaler()
    scaleInput = MinMaxScaler()
    y = scaleOutput.fit_transform(y_nonscaled)
    X = scaleInput.fit_transform(X_nonscaled)
    X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.28, random_state=5)
    stock_nnet = MLPClassifier(
      hidden_layer_sizes=(2), 
      activation='logistic', 
      solver='lbfgs', 
      random_state=1)
    stock_nnet.fit(X_train, y_train.ravel())
    y_pred = stock_nnet.predict(X_valid)
    input_arr.append(round(accuracy_score(y_valid, y_pred),2))
  #adds the accuracy scores for each of the models to accuracy scores dictionary  
  future_acc_scores[ticker] = input_arr
  #adds the baseline accuracy score for each stock ticker to baseline scores dictionary
  baseline_future[ticker] = round(sum(accuracy_df[ticker])/125,2)
  
  


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

Converting result from previous code block into df

In [22]:
future_outcomes_df = pd.DataFrame.from_dict(future_acc_scores).T
future_baseline_df = pd.DataFrame.from_dict([baseline_future]).T
future_outcomes_df = pd.concat([future_baseline_df, future_outcomes_df], axis=1)
future_outcomes_df.columns = ['future_baseline_acc', 'headline_mean_acc', 'body_mean_acc', 'headline_median_acc', 'body_median_acc', 'headline_min_max', 'body_min_max']
future_outcomes_df


Unnamed: 0,future_baseline_acc,headline_mean_acc,body_mean_acc,headline_median_acc,body_median_acc,headline_min_max,body_min_max
XLE,0.56,0.6,0.6,0.63,0.6,0.37,0.54
XLF,0.54,0.6,0.57,0.51,0.51,0.57,0.37
XLU,0.54,0.51,0.63,0.46,0.69,0.69,0.51
XLI,0.59,0.6,0.54,0.54,0.57,0.51,0.54
XLK,0.54,0.51,0.57,0.54,0.6,0.4,0.4
XLV,0.57,0.54,0.63,0.69,0.66,0.46,0.54
XLY,0.51,0.51,0.43,0.57,0.49,0.37,0.43
IYR,0.5,0.51,0.49,0.51,0.43,0.51,0.54
AAPL,0.58,0.66,0.63,0.57,0.51,0.6,0.57
XLB,0.48,0.54,0.54,0.43,0.54,0.34,0.37


Check to see which model did best by summing up all the accuracy scores and subtracting from the baseline score. The best model was the mean body news sentiment

In [23]:
for outcomes in future_outcomes_df:
  print(sum(future_outcomes_df[outcomes])-sum(future_outcomes_df['future_baseline_acc']))

0.0
-0.05000000000000071
0.1200000000000001
-0.08999999999999986
-0.060000000000001386
-1.1800000000000015
-1.4200000000000008
