# Wall Street Bets (WSB) sub-reddit post analysis

[Wall Street Bets](https://www.reddit.com/r/wallstreetbets/) according to [Wikipedia](https://en.wikipedia.org/wiki/R/wallstreetbets) is:

 > A subreddit where participants discuss stock and option trading. It has become notable for its profane nature, aggressive trading strategies, and role in the GameStop short squeeze that caused losses on short positions in U.S. firms topping US$70 billion in a few days in early 2021. The subreddit is famous for its colorful jargon and terms.

As mentioned from Wiki, we will examine the short squeeze that occured from around the 12th of Jan 2021 to the 4th of Feb 2021 on Gamestop (NYSE: GME) and AMC Entertainment Holdings (NYSE: AMC). 

For more information see [this Bloomberg article](https://www.bloomberg.com/news/features/2021-02-04/gamestop-gme-how-wallstreetbets-and-robinhood-created-bonkers-stock-market) and [this blog post](https://www.wallstreetbets.shop/blogs/news/dissecting-the-unique-lingo-and-terminology-used-in-the-subreddit-r-wallstreetbets) from the WSB offical merch shop on terms/ emojis used on the sub-reddit.


 

In [1]:
import pandas as pd
import numpy as numpy
import matplotlib.pyplot as plt 
import seaborn as sns 
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize


In [2]:
wsb = pd.read_csv('data/reddit_wsb.csv')
wsb.head()

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,"It's not about the money, it's about sending a...",55,l6ulcx,https://v.redd.it/6j75regs72e61,6,1611863000.0,,2021-01-28 21:37:41
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1611862000.0,,2021-01-28 21:32:10
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1611862000.0,The CEO of NASDAQ pushed to halt trading “to g...,2021-01-28 21:30:35
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1611862000.0,,2021-01-28 21:28:57
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1611862000.0,,2021-01-28 21:26:56


In [3]:
wsb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35110 entries, 0 to 35109
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   title      35110 non-null  object 
 1   score      35110 non-null  int64  
 2   id         35110 non-null  object 
 3   url        35110 non-null  object 
 4   comms_num  35110 non-null  int64  
 5   created    35110 non-null  float64
 6   body       18122 non-null  object 
 7   timestamp  35110 non-null  object 
dtypes: float64(1), int64(2), object(5)
memory usage: 2.1+ MB


Encode timestamp as date and extract date elements.

In [8]:
wsb['timestamp'] = pd.to_datetime(wsb['timestamp'])
wsb['date'] = pd.to_datetime(wsb['timestamp']).dt.date
wsb['date'] = pd.to_datetime(wsb['date'])
wsb['weekday'] = pd.to_datetime(wsb['timestamp']).dt.weekday
wsb['hour'] = pd.to_datetime(wsb['timestamp']).dt.hour

In [9]:
# Sanity check
wsb.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35110 entries, 0 to 35109
Data columns (total 11 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   title      35110 non-null  object        
 1   score      35110 non-null  int64         
 2   id         35110 non-null  object        
 3   url        35110 non-null  object        
 4   comms_num  35110 non-null  int64         
 5   created    35110 non-null  float64       
 6   body       18122 non-null  object        
 7   timestamp  35110 non-null  datetime64[ns]
 8   date       35110 non-null  datetime64[ns]
 9   weekday    35110 non-null  int64         
 10  hour       35110 non-null  int64         
dtypes: datetime64[ns](2), float64(1), int64(4), object(4)
memory usage: 2.9+ MB


In [10]:
# Examine missing data
for column in wsb.columns:
    print(f'Column {column}', f'has {100 * sum(wsb[column].isnull())/len(wsb):.2f}% missing data')
    print()

Column title has 0.00% missing data

Column score has 0.00% missing data

Column id has 0.00% missing data

Column url has 0.00% missing data

Column comms_num has 0.00% missing data

Column created has 0.00% missing data

Column body has 48.39% missing data

Column timestamp has 0.00% missing data

Column date has 0.00% missing data

Column weekday has 0.00% missing data

Column hour has 0.00% missing data



From the code above we see that the only column with missing values is the 'body' column with ~48% missing data.

Before going any further we will define some of the terms/words/abbreviations that we may see in these posts:
- GME --> The ticker code for Gamestop
- AMC --> Ticker code for AMC Entertainment Holdings
- Robinhood (RH) --> A US brokerage firm offering 'free' trading on ETFs and equities. Millienial focused and is very popular with Gen Y and Z. [See here](https://robinhood.com/us/en/)
- NOK --> Ticker code for Nokia. This stock was also a target of the WSB community

For more info see [this blog post](https://www.wallstreetbets.shop/blogs/news/dissecting-the-unique-lingo-and-terminology-used-in-the-subreddit-r-wallstreetbets).



In [23]:
sns.histplot(x='weekday',hue='weekday', data=wsb)
plt.xticks(['Mon','Tues','Wed','Thurs','Fri','Sat','Sun'])
plt.show()

ConversionError: Failed to convert value(s) to axis units: ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']