We built a tool to predict stock prices from discussion on Reddit's [/r/WallStreetBets](reddit.com/r/WallStreetBets) forum by scraping the previous 100 day's posts and comments from the subreddit, and collecting historical data about stock prices from the [Marketstack API](https://marketstack.com/). Our predictor works by identifying days with similar amounts of change (after computing [z-score](https://www.statisticshowto.com/probability-and-statistics/z-score/)) in the amount of mentions of a stock's symbol. To validate our method, we compare the predicted change in the price (rise, fall, or no change) with the actual change in price. Overall, our model struggles to achieve accurate results. We suggest that it does so because of the difficulty of predicting something with as many factors as the movement of a stock's price based solely off the change in the number of times it is mentioned on one forum.

# Data Description

## Reddit posts and comments
After scraping and cleaning the posts and comments, we have DataFrames containing posts and comments on WallStreetBets from the last 100 days. In order to try to cut down on the amount of spam and joke posts, we filtered posts by the user-assigned flairs that mark them either as `Discussion`, where users discuss things that are happening in the stock market, or `DD`, where the author of the post shares their analysis of a particular stock.

In [None]:
import pandas as pd

df_posts = pd.read_csv('posts_last_100_days.csv', index_col = 'Unnamed: 0')
df_posts.head()

In [None]:
df_comments = pd.read_csv('comments_last_100_days.csv', index_col = 'Unnamed: 0')
df_comments.head()

We then computed the number of times each stock was mentioned per day

In [None]:
df_symbol_counts = pd.read_csv('symbol_counts.csv', index_col=[0,1])
df_symbol_counts.head()

This allows us to see how many times each stock symbol was mentioned on each day that we have data for:

In [None]:
df_counts_tsla = df_symbol_counts.unstack(level=1).loc[['TSLA']]

df_counts_tsla.head()

For the model, we converted the raw number of mentions to the zscore, in order to have a standardized metric for how much the discussion on each day varies. When we apply this, the DataFrame above looks like this:

In [None]:
# convert mentions to zscores
# transpose df_counts_final to make zscore operation easieer, will transpose back later
df_counts_tsla_t = df_counts_tsla.T

# find zscore of mentions of each day
from scipy.stats import zscore
df_counts_tsla_t = df_counts_tsla_t.apply(zscore)

# undo transpose
df_counts_tsla = df_counts_tsla_t.T
df_counts_tsla.head()

## Symbols with the highest variance in discussion
After finding how many times each stock was mentioned each day, we found the stocks with the 10 highest amounts of change in the number of mentions. We used standard deviation for this

In [None]:
df_count_unstack = df_symbol_counts.unstack(level=1)
df_symbol_counts_t = df_count_unstack.T

In [None]:
# figure out which symbols have the most variance in mentions
std_series = df_symbol_counts_t.std()

# get top 10 symbols with highest standard deviations
high_std_symbol = std_series.sort_values(ascending=False)[:10]

# list of symbols with highest variance
symbol_list = high_std_symbol.index.tolist()

print(symbol_list)

## Marketstack Data

To collect data from Marketstack's API, we wrote a function that would return a specified stock's price history as a DataFrame.

In [None]:
import requests

def ticker_history(ticker):
    """ returns trading data of a stock over the last 100 trading days. 
    Args:
        ticker (string): ticker symbol representing inputed stock
        
    Returns:
        df_history (DataFrame): pd DF containing end of day data 
    """
    # define api acess key
    params = {'access_key': 'fb8442e7289f98db111f57de2a4c1d75'}
    
    # call to api with inputed parameters
    api_result = requests.get(f'https://api.marketstack.com/v1/tickers/{ticker}/eod', params)
    
    # convert data to json format
    api_response = api_result.json()
    
    # DataFrame composed of only end of day stock data
    df_history = pd.DataFrame(api_response['data']['eod'])
    
    return df_history

In [None]:
df_tsla_history = ticker_history('tsla')
df_tsla_history.head()

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold

from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

# Method

## 5-NN regressor
To predict price movements, we use an 5-NN regressor. In essence, we estimate the predicted movement by looking at the 5 days with the most similar zscores. We chose 5 as the number of neighbors after experimenting with different numbers of neighbors to find a balance between result accuracy and skewed predictions (i.e the model predicting everything as one category)

We decided to use the K-Nearest Neighbors algorithm because it is the closest to our original idea, of finding a correlation between days with similar amounts of change in discussion and movements in stock prices.

## Estimation 1 - SNDL
We chose to train our first model on SNDL's stock because that was the stock with the highest variance in the number of daily mentions, so we thought that it had the best potential to show a relationship between price movements and changes in discussion. However, it also had many days where it was not mentioned at all, as can be seen below:

In [None]:
df_counts_sndl = df_count_unstack.loc[['SNDL']]

df_counts_sndl.head(1)

As mentioned previously, we computed the z-score of each day's number of mentions in order to have a standardized metric for how much each day's level of discussion varies from the mean, rather than just using the raw number of mentions. After computing the z-score for each day and joining the number of mentions with the percent change in SNDL's closing price, we had the below DataFrame, which we used to train our K-Nearest Neighbors classsifier.

In [None]:
# convert mentions to z-scores
# transpose df_counts_final to make zscore operation easieer, will transpose back later
df_counts_sndl_t = df_counts_sndl.T

# find zscore of mentions of each day
from scipy.stats import zscore
df_counts_sndl_t = df_counts_sndl_t.apply(zscore)

# undo transpose
df_counts_sndl = df_counts_sndl_t.T
df_counts_sndl.T.head()
# transpose df_counts_sndl so date is recorded in columns, like df_sndl_history
df_counts_sndl.T
# remove heirarchical index
df_sndl_no_multi_index = df_counts_sndl.T.reset_index()
df_sndl = df_sndl_no_multi_index.reset_index()

# remove unneccesary columns
del df_sndl['level_0']
del df_sndl['index']

df_sndl.head()

In [None]:
# Actual Price movements -- if the close price rose or fell each day
# get price history
df_sndl_history = ticker_history('sndl')

# make sure that it is sorted by date so that percent change in price is a meaningful measure
df_sndl_history['date'] = pd.to_datetime(df_sndl_history['date']).dt.date
df_sndl_history = df_sndl_history.sort_values(by='date')

# create column for percent change as a percentage
df_sndl_history['close % change'] = df_sndl_history['close'].pct_change()

# create categories from percent change -- rise, fall, no change
df_sndl_history.loc[df_sndl_history['close % change'] > 0, 'rise/fall'] = 'rise'
df_sndl_history.loc[df_sndl_history['close % change'] < 0, 'rise/fall'] = 'fall'
df_sndl_history.loc[df_sndl_history['close % change'] == 0, 'rise/fall'] = 'no change'

df_sndl_history.head()

In [None]:
# combine datasets (reddit + marketstack)
# ensure both date columns are datetime.date objects for consistency
df_sndl['date'] = pd.to_datetime(df_sndl['date']).dt.date
df_sndl_history['date'] = pd.to_datetime(df_sndl_history['date']).dt.date


# remove dates that aren't in both dataframes -- account for markets being closed on weekends and other factors
date_list = df_sndl['date'].tolist()

df_sndl_history = df_sndl_history[df_sndl_history['date'].isin(date_list)]

sndl_date_list = df_sndl_history['date'].tolist()
df_sndl = df_sndl[df_sndl['date'].isin(sndl_date_list)]

df_sndl_history = df_sndl_history.sort_values(by='date')
df_sndl = df_sndl.sort_values(by='date')

# add price movements column - we can just assign a copy since we know both dataframes 
# only contain the same dates and are sorted in the same way
df_sndl['price movement'] = df_sndl_history['rise/fall'].values

df_sndl.head()

In [None]:
import warnings
# ignore warnings
warnings.filterwarnings('ignore')

# split columns into x (features) and y (labels)
x_feat_list = ['SNDL']
y_feat = 'price movement'


# get x and y for classifiers
x = df_sndl.loc[:, x_feat_list].values
y_true = df_sndl.loc[:, y_feat].values

# we will be using 10 fold cross validation
n_splits = 10

# dictionary for confusion matrix
conf_matrix_sndl_dict = {}

# number of neighbors to use
k = 5

# initialize knn_classifier
knn_classifier = KNeighborsClassifier(n_neighbors=k)

# cross validation
kfold = StratifiedKFold(n_splits=n_splits)

# y_pred is empty array same size as y_true
y_pred = np.empty_like(y_true)

# for plotting
x_test_total = np.empty_like(x)

for train_idx, test_idx in kfold.split(x, y_true):
    # training data
    x_train = x[train_idx, :]
    y_true_train = y_true[train_idx]

    # testing data
    x_test = x[test_idx, :]
    y_true_test = y_true[test_idx]
    
    x_test_total[test_idx] = x_test

    # train on training data
    knn_classifier.fit(x_train, y_true_train)

    # estimat each song's genre
    y_pred[test_idx] = knn_classifier.predict(x_test)


# generate confusion matrix
conf_matrix_sndl = confusion_matrix(y_true=y_true, y_pred=y_pred)

In [None]:
# plot confusion matrix for k = 5
conf_matrix_sndl_plot = ConfusionMatrixDisplay(conf_matrix_sndl, display_labels=np.unique(y_true))
conf_matrix_sndl_plot.plot()
plt.suptitle('K=5 NN Classifier for SNDL\'s price movement');
# save figure
plt.savefig('sndl_k5_conf_matrix')

In [None]:
fig, ax = plt.subplots(1, 2, figsize=(12,10))
# order for violins
order = ['rise', 'no change', 'fall']

# variables for plotting actual values
price_movement_actual_sndl = np.array(df_sndl['price movement'].values)
z_score_actual_sndl = np.array(df_sndl['SNDL'].values, dtype=float)

# plot actual values
sns.violinplot(x=price_movement_actual_sndl, 
               y=z_score_actual_sndl, 
               ax=ax[0], 
               order=order)

# variables for plotting predicted values
price_movement_pred_sndl = y_pred
z_score_pred_sndl = x_test_total.reshape(52,)
sns.violinplot(x=price_movement_pred_sndl, 
               y=z_score_pred_sndl, 
               ax=ax[1], 
               order=order)

# labelling
ax[0].set_title("Actual distribution of SNDL daily mention z-scores per category")
plt.ylabel('z-score')
ax[1].set_title("Predicted distribution of SNDL daily mention z-scores per category")

# formatting and saving
fig.tight_layout()
fig.savefig('sndl_violinplot')

## Estimation 2 - TSLA
After seeing how our initial model's poor performance when it tried to predict SNDL's price movements, we decided to try training a new model, this time on TSLA's price movements and daily mentions. We chose to use TSLA as it had a similarly high variance to SNDL, but it had more daily mentions overall, as can be seen below.

In [None]:
df_counts_tsla = df_count_unstack.loc[['TSLA']]

df_counts_tsla.head(10)

In [None]:
# convert mentions to z-scores
# transpose df_counts_final to make zscore operation easieer, will transpose back later
df_counts_tsla_t = df_counts_tsla.T

# find zscore of mentions of each day
from scipy.stats import zscore
df_counts_tsla_t = df_counts_tsla_t.apply(zscore)

# undo transpose
df_counts_tsla = df_counts_tsla_t.T

# transpose df_counts_sndl so date is recorded in columns, like df_sndl_history
df_counts_tsla.T
# remove heirarchical index
df_tsla_no_multi_index = df_counts_tsla.T.reset_index()
df_tsla = df_tsla_no_multi_index.reset_index()

# remove unneccesary columns
del df_tsla['level_0']
del df_tsla['index']

df_tsla.head()

In [None]:
# Actual Price movements -- if the close price rose or fell each day
# get price history
df_tsla_history = ticker_history('tsla')

# make sure that it is sorted by date so that percent change in price is a meaningful measure
df_tsla_history['date'] = pd.to_datetime(df_tsla_history['date']).dt.date
df_tsla_history = df_tsla_history.sort_values(by='date')

# create column for percent change as a percentage
df_tsla_history['close % change'] = df_tsla_history['close'].pct_change()

# create categories from percent change -- rise, fall, no change
df_tsla_history.loc[df_tsla_history['close % change'] > 0, 'rise/fall'] = 'rise'
df_tsla_history.loc[df_tsla_history['close % change'] < 0, 'rise/fall'] = 'fall'
df_tsla_history.loc[df_tsla_history['close % change'] == 0, 'rise/fall'] = 'no change'

df_tsla_history.head()

In [None]:
# combine datasets (reddit + marketstack)
# ensure both date columns are datetime.date objects for consistency
df_tsla['date'] = pd.to_datetime(df_tsla['date']).dt.date
df_tsla_history['date'] = pd.to_datetime(df_tsla_history['date']).dt.date


# remove dates that aren't in both dataframes -- account for markets being closed on weekends and other factors
date_list = df_tsla['date'].tolist()

df_tsla_history = df_tsla_history[df_tsla_history['date'].isin(date_list)]

tsla_date_list = df_tsla_history['date'].tolist()
df_tsla = df_tsla[df_tsla['date'].isin(tsla_date_list)]

df_tsla_history = df_tsla_history.sort_values(by='date')
df_tsla = df_tsla.sort_values(by='date')

# add price movements column - we can just assign a copy since we know both dataframes 
# only contain the same dates and are sorted in the same way
df_tsla['price movement'] = df_tsla_history['rise/fall'].values

df_tsla.head()

In [None]:
# split columns into x (features) and y (labels)
x_feat_list = ['TSLA']
y_feat = 'price movement'


# get x and y for classifiers
x = df_tsla.loc[:, x_feat_list].values
y_true = df_tsla.loc[:, y_feat].values

# we will be using 10 fold cross validation
n_splits = 10

# dictionary for confusion matrix
conf_matrix_tsla_dict = {}

# number of neighbors to use
k = 5

# initialize knn_classifier
knn_classifier = KNeighborsClassifier(n_neighbors=k)

# cross validation
kfold = StratifiedKFold(n_splits=n_splits)

# y_pred is empty array same size as y_true
y_pred = np.empty_like(y_true)

# for plotting
x_test_total = np.empty_like(x)

for train_idx, test_idx in kfold.split(x, y_true):
    # training data
    x_train = x[train_idx, :]
    y_true_train = y_true[train_idx]

    # testing data
    x_test = x[test_idx, :]
    y_true_test = y_true[test_idx]
    
    x_test_total[test_idx] = x_test

    # train on training data
    knn_classifier.fit(x_train, y_true_train)

    # estimat each song's genre
    y_pred[test_idx] = knn_classifier.predict(x_test)


# generate confusion matrix
conf_matrix_tsla = confusion_matrix(y_true=y_true, y_pred=y_pred)

In [None]:
# plot confusion matrix for k = 5
conf_matrix_tsla_plot = ConfusionMatrixDisplay(conf_matrix_tsla, display_labels=np.unique(y_true))
conf_matrix_tsla_plot.plot()
plt.suptitle('K=5 NN Classifier for TSLA\'s price movement');
# save figure
plt.savefig('tsla_k5_conf_matrix')

In [None]:
# set up subplots
fig, ax = plt.subplots(1, 2, figsize=(12,10))

# order for violinplots
order = ['rise', 'no change', 'fall']

# true values for plotting
price_movement_actual = np.array(df_tsla['price movement'].values)
z_score_actual = np.array(df_tsla['TSLA'].values, dtype=float)

# plot actual values
sns.violinplot(x=price_movement_actual, 
               y=z_score_actual, 
               ax=ax[0], 
               order=order)

# predicted values for plotting
price_movement_pred = y_pred
z_score_pred = x_test_total.reshape(52,)

# plot predicted valies
sns.violinplot(x=price_movement_pred, 
               y=z_score_pred, 
               ax=ax[1],
               order=order)

# labeling
ax[0].set_title("Actual distribution of TSLA daily mention z-scores per category")
plt.ylabel('z-score')
ax[1].set_title("Predicted distribution of TSLA daily mention z-scores per category")

# formatting and saving
fig.tight_layout()
fig.savefig('tsla_violinplot')

# Discussion

As can be seen above, neither of our models were able to consistently make good predictions about the movement of \\$SNDL or \\$TSLA's price solely based on the change in amount of discussion on /r/WallStreetBets. By the time we got to the point in our project where we were training our models and validating their results, we had expected an outcome similar to this for several reasons, namely:
- Stock prices can be highly volatile, and predicting their movements solely from the change in discussion on one subreddit did not give us enough information to be able to make an accurate prediction 
- We only considered the number of times each stock symbol was mentioned, without the context of whether it was being discussed positively or negatively, so on days where the price drops considerably, there could have been more negative discussion, while on days where the price rose considerable, there could have been more positive discussion, but our model has no way of being aware of that distinction
- /r/WallStreetBets as a whole is more focused on memes and jokes than serious discussion of the stock market, and while there definitely are instances of high-quality analysis of the stock market, and people who invest intelligently, filtering out the memes and the spam posts proved to be more of a challenge than we were expecting.

Additionally, early on in our data collection, we placed a focus on collecting all available information from the subreddit, meaning as many posts as we could reasonably scrape, and all the comments of those posts, which meant that, due to the amount of time taken to scrape and clean this data, we did not have data going back more than a few months, so it is possible that if we had focused on scraping fewer posts per day, but from a larger time period, we may have achieved better results, as we would have had a more robust dataset.

We also made the decision to only look at stocks listed on the NASDAQ, again due to the time taken to count how many times each symbol was mentioned, so it is possible that we would have been able to achieve better results for a popular stock like \\$GME, that isn't listed on the NASDAQ.

To remedy these shortcomings in a future version of this project, we would like to source information from a wider range of investing-related subreddits that are more serious that /r/WallStreetBets to see if more serious discussion would yield better results. 

We would also like to account for the fact that discussion of a stock has the potential to increase both when the price rises and when the price falls by including information about the sentiment of the day's comments, for example if they are overwhelmingly positive, overwhelmingly negative, or somewhere in between.

### Takeaway

We do not believe that anyone should be picking which stocks to buy or sell based on the results of our project due to the low accuracies of both our models, however, it could provide a good basis for someone looking to scrape online discussion for another project, whether it was about the stock market or not. Before doing so, however, they should carefully consider the source of the information that they are scraping, specifically whether the people who post on the site or forum tend to make serious remarks or jokes, as this could potentially have an effect on the quality of their final results.

As far as our own work is concerned, we do not believe that there are any major ethical concerns with either the data we collected, as it is all posted publically under the users' usernames, rather than their real names, and we do not consider the actual content of either the posts or comments beyond how many times they mention a particular stock's ticker. Similarly, we do not believe that there are ethical concerns with our results, as we are recomending that they not be used as the basis for buying and selling stocks due to the low accuracy.