The idea of this notebook is not to exactly predict stock prices, but rather stock fluctuations at a daily level, classifying whether we think a stock will rise or fall. My idea for this is quite simple, since stock prices are inherently driven by investors perception of the company we can use data such as the sentiment from recent articles and trading volume to predict the probability of the stock price increasing the following day.

To start we will just use a simple Naive Bayes Classifier to implement the following formula:

$Pr(Stock+ | Sent+, Vol+) \propto Pr(Sent+ | Stock+) \cdot Pr(Vol+ | Stock+) \cdot Pr(Stock+)$

$Stock+$ is a positive change in stock prices.

$Sent+$ is a positive sentiment score. The sentiment will be an average of that days news sentiment 
(further fine-tuning could be done to weight articles based on how impactful they are)

$Vol+$ is an above average trading volume.

Then if we get a high probability, say above 60%, then we would decide to invest in the stock for the following day.

In [1]:
import yfinance as yf
import pandas as pd
import numpy as np
import itertools

In [3]:
aapl_news = pd.read_csv("data\AppleFinalData.csv")
aapl_news = aapl_news.drop(['Open', 'High', 'Low', 'Close', 'Adj Close', 'neg', 'neu', 'pos'], axis=1)
aapl_news.head()

Unnamed: 0,Date,compound
0,2006-12-01,0.7707
1,2006-12-04,0.872
2,2006-12-05,0.0
3,2006-12-06,0.6858
4,2006-12-07,-0.6712


In [83]:
print("first observation date:", min(aapl_news['Date']))
print("last observation date:", max(aapl_news['Date']))

first observation date: 2006-12-01
last observation date: 2016-11-30


This dataset was found on Kaggle. It contains the date, pricing information, and sentiment scores from New York Times articles using the NLTK VADER algorithm. For this project we will be using the compound where 1.00 is the most positive and -1.00 is the most negative possible and not using the pricing data because it may not be accurate due to stock splits. Furthermore, I cannot guarantee the accuracy of this data and how this data was collected, this project is more so a proof of concept / test of my idea. If this idea is viable, it would be best to create a language model to perform sentiment classification.

In [84]:
aapl = yf.Ticker('AAPL')

aapl_data = aapl.history(start='2006-12-01', end='2016-12-01')

aapl_data.reset_index(inplace=True)

aapl_data = aapl_data.drop(['Open', 'High', 'Low', 'Dividends', 'Stock Splits'], axis=1)

aapl_data['Date'] = pd.to_datetime(aapl_data['Date']).dt.date

In [85]:
aapl_news['Date'] = pd.to_datetime(aapl_news['Date']).dt.date

aapl_data = aapl_data.merge(aapl_news, on='Date')

aapl_data['Change'] = aapl_data['Close'].diff()

aapl_data['Change'] = aapl_data['Change'].apply(lambda x: 1 if x > 0 else 0)

aapl_data = aapl_data.drop('Close', axis=1)

Where a negative change is labeled as 0 and a positive change is labeled as 1. We will first do a simple 80% train and test split.

In [95]:
aapl_data.head()

Unnamed: 0,Date,Volume,compound,Change
0,2006-12-01,795079600,0.7707,0
1,2006-12-04,709536800,0.872,0
2,2006-12-05,662838400,0.0,1
3,2006-12-06,638184400,0.6858,0
4,2006-12-07,1004827600,-0.6712,0


Due to the variability of stock prices and their relevant trends we will also test the idea of using a sort of short term memory where only the past year, 6 months, 3 months, and 1 month are used to predict the next outcome using a sliding window and walk-forward validation.

In [202]:
def NaiveBayes(data, window):
    preds = []
    targets = []
    
    windowed_data = []
    for i in range(0, len(data) - window - 1):
        sample = {'data': data.loc[i: i+window], 'target': data.loc[i+window+1]}
        windowed_data.append(sample)
    
    for data in windowed_data:
        posteriors = []
        Y_vals = [0, 1]
        for change in Y_vals:
            pr_change = sum(data['data']['Change'] == change) / len(data['data']['Change'])
            
            vol = data['data'][data['data']['Change'] == change]['Volume']
            mean_vol = np.mean(vol)
            var_vol = np.var(vol)
            exponent = (-((data['target']['Volume'] - mean_vol) ** 2) / (2 * var_vol))
            vol_likelihood = (1 / (np.sqrt(2 * np.pi * var_vol))) * np.exp(exponent)
            
            sent = data['data'][data['data']['Change'] == change]['compound']
            mean_sent = np.mean(sent)
            var_sent = np.var(sent)
            exponent = (-((data['target']['compound'] - mean_sent) ** 2) / (2 * var_sent))
            sent_likelihood = (1 / (np.sqrt(2 * np.pi * var_sent))) * np.exp(exponent)
            
            posteriors.append(vol_likelihood * sent_likelihood * pr_change)
        preds.append(np.argmax(posteriors))
        targets.append(data['target']['Change'])
    return preds, targets

In [242]:
X = aapl_data[['Volume', 'compound']].values

one_month = 21
three_months = 63
six_months = 126
one_year = 252

one_month_preds, one_month_targets = NaiveBayes(aapl_data, one_month)
three_months_preds, three_months_targets = NaiveBayes(aapl_data, three_months)
six_months_preds, six_months_targets = NaiveBayes(aapl_data, six_months)
one_year_preds, one_year_targets = NaiveBayes(aapl_data, one_year)

In [243]:
accuracy = sum(p == t for p, t in zip(one_month_preds, one_month_targets)) / len(one_month_targets)
print(f"One Month Accuracy: {accuracy}")
accuracy = sum(p == t for p, t in zip(three_months_preds, three_months_targets)) / len(three_months_targets)
print(f"Three Months Accuracy: {accuracy}")
accuracy = sum(p == t for p, t in zip(six_months_preds, six_months_targets)) / len(six_months_targets)
print(f"Six Months Accuracy: {accuracy}")
accuracy = sum(p == t for p, t in zip(one_year_preds, one_year_targets)) / len(one_year_targets)
print(f"One Year Accuracy: {accuracy}")

One Month Accuracy: 0.5118236472945892
Three Months Accuracy: 0.5205870362821036
Six Months Accuracy: 0.5280334728033472
One Year Accuracy: 0.5300353356890459


In [244]:
true_positives = sum(p == t == 1 for p, t in zip(one_month_preds, one_month_targets))
predicted_positives = sum(p == 1 for p in one_month_preds)
precision = true_positives / predicted_positives
print(f"One Month Precision: {precision}")

true_positives = sum(p == t == 1 for p, t in zip(three_months_preds, three_months_targets))
predicted_positives = sum(p == 1 for p in three_months_preds)
precision = true_positives / predicted_positives
print(f"Three Months Precision: {precision}")

true_positives = sum(p == t == 1 for p, t in zip(six_months_preds, six_months_targets))
predicted_positives = sum(p == 1 for p in six_months_preds)
precision = true_positives / predicted_positives
print(f"Six Months Precision: {precision}")

true_positives = sum(p == t == 1 for p, t in zip(one_year_preds, one_year_targets))
predicted_positives = sum(p == 1 for p in one_year_preds)
precision = true_positives / predicted_positives
print(f"One Year Precision: {precision}")

One Month Precision: 0.5330960854092527
Three Months Precision: 0.5373134328358209
Six Months Precision: 0.5363528009535161
One Year Precision: 0.5331088664421998


In [245]:
print('First 20 results from oure most precise model:')
for p, t in itertools.islice(zip(three_months_preds, three_months_targets), 20):
    print(f"Prediction: {p}, Target: {t}")

First 20 results from oure most precise model:
Prediction: 0, Target: 1
Prediction: 0, Target: 0
Prediction: 1, Target: 1
Prediction: 0, Target: 0
Prediction: 0, Target: 1
Prediction: 0, Target: 0
Prediction: 0, Target: 1
Prediction: 1, Target: 1
Prediction: 0, Target: 1
Prediction: 0, Target: 1
Prediction: 1, Target: 1
Prediction: 0, Target: 0
Prediction: 0, Target: 1
Prediction: 1, Target: 0
Prediction: 0, Target: 0
Prediction: 0, Target: 1
Prediction: 0, Target: 0
Prediction: 0, Target: 1
Prediction: 1, Target: 1
Prediction: 0, Target: 0


There are some drawbacks of this due to this models simplicity which can be seen in the accuracy and precision only being slightly greater than a coin flip for our best models. We are assuming that the sentiment given the stock change and trading volume given the stock change are independent, hence the name naive bayes, which likely is not the case. With this in mind we will look at some more complex models that may help overcome this drawback such as MCMC where we can use these correlated and non-identically distributed samples in the form of markov chains and still have the Law of Large Numbers and Central Limit Theorem hold.

DO ONE WITH BAYESIAN LOGISTIC REGRESSION AND MCMC

DO ONE WITH BAYESIAN NEURAL NETWORK

ONE WITH LSTM (MENTION HOW THIS IS TYPICALLY GOOD FOR PREDICTING STOCK PRICING BUT WHAT IF WE USE IT TO PREDICT STOCK PRICE CHANGES)