# Machine learning to predict stock price

In [249]:
# Set it up again
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
import seaborn as sns
import sklearn
from afinn import Afinn
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LogisticRegression
%matplotlib inline

Talk about the different data models used for machine learning, hypos of sentiment analysis use in prediction. Compare top1 and top25

In [250]:
reddit_news_df = pd.read_csv("Combined_News_DJIA.csv")
# reddit_news_df.head()

Previously in phase 2 we found out only 3 rows contain null values, so I'm dropping those again.

In [251]:
reddit_news_df = reddit_news_df.dropna()

## Question one. Are sentiment analysis values of the more popular headlines/articles in the r/worldnews subreddit, better stock market predictors than less popular article?
In the following cells I'll train logistic regression models using sentiment analysis of different Top# articles to predict whether the Dow Jones Industrial Average(DJIA) goes up or down.

In [252]:
af = Afinn()

top1_headlines = reddit_news_df['Top1']
top1_headlines = [headline.replace('b"', '').replace('b\'', "'") for headline in top1_headlines]
top1_sentiment_scores = [af.score(headline) for headline in top1_headlines]

reddit_news_df['Top1 Sentiment'] = top1_sentiment_scores

top5_headlines = reddit_news_df['Top5']
top5_headlines = [headline.replace('b"', '').replace('b\'', "'") for headline in top5_headlines]
top5_sentiment_scores = [af.score(headline) for headline in top5_headlines]

reddit_news_df['Top5 Sentiment'] = top5_sentiment_scores

top15_headlines = reddit_news_df['Top15']
top15_headlines = [headline.replace('b"', '').replace('b\'', "'") for headline in top15_headlines]
top15_sentiment_scores = [af.score(headline) for headline in top15_headlines]

reddit_news_df['Top15 Sentiment'] = top15_sentiment_scores

top25_headlines = reddit_news_df['Top25']
top25_headlines = [headline.replace('b"', '').replace('b\'', "'") for headline in top25_headlines]
top25_sentiment_scores = [af.score(headline) for headline in top25_headlines]

reddit_news_df['Top25 Sentiment'] = top25_sentiment_scores

In [253]:
reddit_train = reddit_news_df.loc[:994].copy()
reddit_test = reddit_news_df.loc[944:].copy()

### Using Top1 headlines sentiment analysis values

In [254]:
# print(len(top1_sentiment_scores))
x_train = reddit_train[['Top1 Sentiment']]
x_test = reddit_test[['Top1 Sentiment']]

y_train = reddit_train[['Label']]
y_test = reddit_test[['Label']]

In [255]:
logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())

y_pred = logReg.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logReg.score(x_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.53


In [256]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
# confusion_matrix = confusion_matrix(y_test, y_pred)
print('True negatives:', tn)
print('True positives:', fp)
print('False negatives:', fn)
print('False positives:', tp)
# print(confusion_matrix)
# tn = []
# tn, fp, fn, tp = confusion_matrix(y_test, y_pred)

True negatives: 3
True positives: 488
False negatives: 3
False positives: 551


In [257]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

          0       0.50      0.01      0.01       491
          1       0.53      0.99      0.69       554

avg / total       0.52      0.53      0.37      1045



<b>Accuracy of model using Top1 article sentiment analysis values</b>

As you can see above the accuracy of the model isn't great. A little over half of the DJIA values were predicted correctly.Furthermore, based off of the confusion matrix information, it seems the model is better at predicting when the DJIA is going to up or stay the same than predicting when its going to go down.

### Using Top5 headlines sentiment analysis values

In [258]:
x_train = reddit_train[['Top5 Sentiment']]
x_test = reddit_test[['Top5 Sentiment']]
y_train = reddit_train[['Label']]
y_test = reddit_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
# confusion_matrix = confusion_matrix(y_test, y_pred)
print('True negatives:', tn)
print('True positives:', fp)
print('False negatives:', fn)
print('False positives:', tp)
# print(confusion_matrix)

Accuracy of logistic regression classifier on test set: 0.53
True negatives: 0
True positives: 491
False negatives: 0
False positives: 554


### Using Top15 headlines sentiment analysis values

In [259]:
x_train = reddit_train[['Top15 Sentiment']]
x_test = reddit_test[['Top15 Sentiment']]
y_train = reddit_train[['Label']]
y_test = reddit_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
# confusion_matrix = confusion_matrix(y_test, y_pred)
print('True negatives:', tn)
print('True positives:', fp)
print('False negatives:', fn)
print('False positives:', tp)
# print(confusion_matrix)

Accuracy of logistic regression classifier on test set: 0.52
True negatives: 35
True positives: 456
False negatives: 43
False positives: 511


### Using Top25 headlines sentiment analysis values

In [260]:
x_train = reddit_train[['Top25 Sentiment']]
x_test = reddit_test[['Top25 Sentiment']]
y_train = reddit_train[['Label']]
y_test = reddit_test[['Label']]

logReg = LogisticRegression()
logReg.fit(x_train, y_train.values.ravel())
y_pred = logReg.predict(x_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logReg.score(x_test, y_test)))

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
# confusion_matrix = confusion_matrix(y_test, y_pred)
print('True negatives:', tn)
print('True positives:', fp)
print('False negatives:', fn)
print('False positives:', tp)
# print(confusion_matrix)

Accuracy of logistic regression classifier on test set: 0.53
True negatives: 0
True positives: 491
False negatives: 0
False positives: 554


### Thoughts
So it appears that models that use the sentiment analysis values of articles of different popularity have roughly the same accuracy of predicting the DJIA. One other observation is that sentiment analysis doesn't appear to be a good feature at first glance. All models had an accuracy of 52-53%. Though maybe this is actually good for a single feature. As we know, predicting the stock market is hard and other features need to be considered before disregarding sentiment analysis.

## Question 2. Are there better features we can use to predict the DJIA?

#### Notes below, will be delted

use sentiment analysis on headlines with similar topics

Maybe sentiment analysis of the headlines isn't a good feature to use. ' Then use other features listed here: https://blog.quantinsti.com/machine-learning-logistic-regression-python/

In [261]:
# Maybe set a comparison using and then combine all the results like the cells below:
x5_train = reddit_train[['Top5 Sentiment']]
x5_test = reddit_test[['Top5 Sentiment']]

y5_train = reddit_train[['Label']]
y5_test = reddit_test[['Label']]

### top1 model results

In [262]:
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
confusion_matrix = confusion_matrix(y_test, y_pred)
print('Top1 model True negatives:', tn)
print('Top1 model True positives:', fp)
print('Top1 model False negatives:', fn)
print('Top1 model False positives:', tp)
print(confusion_matrix)

Top1 model True negatives: 0
Top1 model True positives: 491
Top1 model False negatives: 0
Top1 model False positives: 554
[[  0 491]
 [  0 554]]
