In [58]:
import os
import sys
import datetime as dt

os.chdir("../Automated/")
from DataGathering import RedditScraper
from ChangePointAnalysis import ChangePointAnalysis
from NeuralNets import CreateNeuralNets
from matplotlib import pyplot as plt
import numpy as np

subreddit = "WallStreetBets"

# Data Scraping

For analyzing wallstreetbets data, we recommend downloading full.csv from [url] and putting it in ../Data/subreddit_wallstreetbets.

If you want to scrape a different subreddit, you can use the following file. You will need API.env with appropriate credentials in /Automated/

In [2]:

start = dt.datetime(2020, 1, 1)
end =  dt.datetime(2020, 1, 30)

if not os.path.exists(f"../Data/subreddit_{subreddit}/full.csv"):
    print("Did not find scraped data, scraping.")

    RedditScraper.scrape_data(subreddits = [subreddit], start = start, end = end)

# Change Point Analysis

The next cell will open full.csv , compute the words that are among the top daily_words most popular words on any day, and then run the change point analysis model on each of them.


The first time this is a run, a cleaned up version of the dataframe will be created for ease of processing.




In [None]:
up_to = 1 # Only calculate change points for up_to of the popular words. Set to None to do all of them.
daily_words = 2 # Get the daily_words most popular posts on each day.


# Compute the changepoints
ChangePointAnalysis.changepointanalysis([subreddit], up_to = up_to, daily_words = daily_words)


Computing the changepoints:
working on  WallStreetBets
['log', 'kodak', 'short', 'mnmd', 'monday', 'snap', 'month', 'wsb', 'next', 'slv', 'questrade', 'rh', 'gold', 'iran', 'mascot', 'ev', 'covid', 'rkt', 'dd', 'fed', 'biden', 'may', 'apes', 'nikola', 'robinhood', 'nkla', 'elon', 'options', 'buy', 'coronavirus', 'clov', 'msft', 'azn', 'year', 'july', 'suicide', 'citron', 'trevor', 'merry', 'gme', 'detroit', 'plug', 'debate', 'mt', 'tomorrow', 'wkhs', 'test', 'bezos', 'new', 'trading', 'spy', 'two', 'one', 'twitter', 'election', 'get', 'christmas', 'tendies', 'sndl', 'tsla', 'puts', 'oil', 'sos', 'virus', 'war', 'moon', 'spce', 'k', 'week', 'today', 'hold', 'dogecoin', 'uso', 'amd', 'trump', 'earnings', 'mvis', 'jpow', 'im', 'amc', 'bear', 'currency', 'day', 'market', 'pltr', 'tesla', 'kodk', 'stock', 'gang', 'go', 'stocks', 'bb', 'money', 'closed', 'nio', 'coin', 'prpl', 'calls', 'like', 'fortnite'] pop_words
C:\Users\lnajt\Documents\GitHub\ErdosInstitute\ErdosInstitute-SIG_Project\Aut

  trace = pm.sample(steps, tune=tune, step = step)
Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>Metropolis: [delta]
>Metropolis: [change_point_beta]
>Metropolis: [tau_beta]
>Metropolis: [beta_2]
>Metropolis: [alpha_2]
>Metropolis: [beta_1]
>Metropolis: [alpha_1]


Sampling 4 chains for 5_000 tune and 30_000 draw iterations (20_000 + 120_000 draws total) took 395 seconds.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.


Change Point Guess 0.25
{'change_point_confidence': [0.25], 'mus': [(0.009015622219098773, 0.011524003066574464)], 'mu_diff': [0.0025083808474756913], 'tau_map': ['2020-01-01'], 'tau_std': [137.09623688628048], 'entropy': [5.218872264011381], 'change_point_guess': [0.25]}
appending
working on kodak


  trace = pm.sample(steps, tune=tune, step = step)
Multiprocess sampling (4 chains in 4 jobs)
CompoundStep
>Metropolis: [delta]
>Metropolis: [change_point_beta]
>Metropolis: [tau_beta]
>Metropolis: [beta_2]
>Metropolis: [alpha_2]
>Metropolis: [beta_1]
>Metropolis: [alpha_1]


After running, these files will in ../Data/subreddit_subreddit/Changepoints/Metropolis_30000Draws_5000Tune

(The final folder corresponds to the parameters of the Markov chain used by pymc3 for the inference.)

For instance: 
    
![title](../Data/subreddit_WallStreetBets/Changepoints/Metropolis_30000Draws_5000Tune/ChangePoint_virus.png)

### Brief explanation of how this works:

The Bayesian model is as follows:

1. A coin is flipped with probability p.
2. If the coin comes up heads, then there is a change point. Otherwise, there is no change point.
3. It is assumed that the frequency random variable consists of independent draws from a beta distribution. If the coin decided there would be no change point, it is the same beta distribution at all times. Otherwise, it is a different beta on the different sides of the change points.

The posterior distribution of p is the models confidence that there is a change point, and the posterior distribution of tau represents its guess about when it occured.

Of course, this is not a realistic picture of the process; the independence of the different draws from the betas is especially unlike the data. However, it appears to be good enough to discover change points.

As currently written, it only handles one change point, however this can be improved.

# Neural Nets

The following code will train a neural net that predicts, given a submission's title text and time of posting, whether that submission's score will be above the median. 

We use pre-trained GloVe word embeddings in order to convert the title text into a vector that can be used in the neural net. These word embeddings are tuned along with the model parameters as the model is being trained. 

This technique and the neural net's architecture are taken from a blog post of Max Woolf, https://minimaxir.com/2017/06/reddit-deep-learning/.


In [5]:
model, accuracies, word_tokenizer, df = CreateNeuralNets.buildnets(['wallstreetbets'])[0]

Starting Post Classification Model.


  exec(code_obj, self.user_global_ns, self.user_ns)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Using the dummy classifier (assuming all posts are less than or equal to the median), the accuracy is: 
0.5047719867012919
The accuracy of the model on the validation set is: 
0.7247766852378845
The accuracy of the model on the test set is: 
0.7211379408836365


## Predicted popularity as a time series

We now show how the predicted popularity of a post depends on the day on which it was posted. 
We plot the prediction for the same title, "GME GME GME GME GME GME", as if it were posted at noon each day. 
It is interesting to note that the variance seems to decrease after the GameStop short squeeze of early 2021. 

In [60]:
text = "GME GME GME GME GME GME"
CreateNeuralNets.timeseries(df, text, model, word_tokenizer)

This will produce a picture like the following:
![title](../Data/subreddit_WallStreetBets/6_GME.png)


## Workshopping example
Here we start with a potential title (to be posted at noon on April 1, 2021) and attempt to improve it based on the model's prediction. 

In [43]:
#this is the date information for April 1, 2021. 
#Note we normalize so the earliest year in our data set (2020) 
#and the earliest day of the year correspond to the number 0
input_hour = np.array([12])
input_dayofweek = np.array([3])
input_minute = np.array([0])
input_dayofyear = np.array([91])
input_year = np.array([0])
input_info=[input_hour,input_dayofweek, input_minute, input_dayofyear, input_year]

In [19]:
#given a list of potential titles, predict the success of each one
def CheckPopularity(potential_titles):
    for title in potential_titles:
        print(model.predict([CreateNeuralNets.encode_text(title,word_tokenizer)] + input_info)[0][0][0])

In [21]:
potential_titles = ["Buy TSLA", "Buy TSLA! I like the stock", "Buy TSLA! Elon likes the stock",
                    "TSLA is the next GME. Elon likes the stock", 
                    "TSLA is the next GME. To the moon! Elon likes the stock"]

In [22]:
CheckPopularity(potential_titles)

0.9536921
0.957647
0.9620316
0.98298347
0.983858


We see that the predicted popularity increases after each change we make. 

Disclaimer: we are investigating a last-minute issue. These probabilities are higher than expected from earlier experimentation, and there is the possibility of a bug in our code. 