# INFO 2950 - Project Phase IV
---

By: David Fleurantin (djf252) and Meredith Hu (mmh264)
</br>
GitHub: https://github.com/DavidFleurantin/INFO-2950-Final-Project
</br>
Cornell Box Link: https://cornell.box.com/s/hz8d86e2q8apoj28yz374zz9o8ukzvk6

In [1]:
# load libraries
import sys
import pandas as pd
import numpy as np
from IPython.display import Image
from IPython.core.display import HTML 
pd.options.mode.chained_assignment = None  # default='warn'
Image(url= "https://cdn.shopify.com/s/files/1/0072/7315/2579/articles/wallstreet_blog_grande.jpg?v=1598894414")

---
## Appendix: Data Cleaning Description


This project employes one primary dataset which can be found on [Cornell Box](https://cornell.box.com/s/hz8d86e2q8apoj28yz374zz9o8ukzvk6)
- #### Reddit posts and necessary metadata sourced from r/wallstreetbets over a 2 month span (Jan 28 - Present)
Below is a documentation of every step that will take raw data file(s) and turn them into analysis-ready data that will be utilized throughout the remainder of our project.

#### r/wallstreetbets Posts and Metadata

The resulting dataset will be sourced directly from [Kaggle](https://www.kaggle.com/gpreda/reddit-wallstreetsbets-posts). An attempt was first made to source the data from Reddit's own api. However, this approach proved to be complicated as reddit does not allow access to posts too far back or allow direct query search. Likewise, the process to collect 1000s of posts would have resulted in too many api calls. The [Kaggle](https://www.kaggle.com/gpreda/reddit-wallstreetsbets-posts) dataset direct solves this issue and allowed to directly focus on the relevant time frame that we wanted that captured the phenomenon of GameStop's volatile and fluctuating stock price.

After downloading the csv from [Kaggle](https://www.kaggle.com/gpreda/reddit-wallstreetsbets-posts), let's convert it to a pandas dataframe ('reddit_data' and print out the first few rows to better gauge what we are dealing with.

In [2]:
## load data
reddit_data = pd.read_csv("reddit_wsb.csv")

print([x for x in reddit_data.columns])
reddit_data.head()

['title', 'score', 'id', 'url', 'comms_num', 'created', 'body', 'timestamp']


Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,"It's not about the money, it's about sending a...",55,l6ulcx,https://v.redd.it/6j75regs72e61,6,1611863000.0,,2021-01-28 21:37:41
1,Math Professor Scott Steiner says the numbers ...,110,l6uibd,https://v.redd.it/ah50lyny62e61,23,1611862000.0,,2021-01-28 21:32:10
2,Exit the system,0,l6uhhn,https://www.reddit.com/r/wallstreetbets/commen...,47,1611862000.0,The CEO of NASDAQ pushed to halt trading ‚Äúto g...,2021-01-28 21:30:35
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,l6ugk6,https://sec.report/Document/0001193125-21-019848/,74,1611862000.0,,2021-01-28 21:28:57
4,"Not to distract from GME, just thought our AMC...",71,l6ufgy,https://i.redd.it/4h2sukb662e61.jpg,156,1611862000.0,,2021-01-28 21:26:56


The data appears to be nicely organized. However, the 'created', 'url', and 'id' columns would be useless in our planned analysis. Let's remove these from the dataframe.

In [3]:
reddit_data = reddit_data.drop(columns=['url', 'id', 'created'])

reddit_data.head()

Unnamed: 0,title,score,comms_num,body,timestamp
0,"It's not about the money, it's about sending a...",55,6,,2021-01-28 21:37:41
1,Math Professor Scott Steiner says the numbers ...,110,23,,2021-01-28 21:32:10
2,Exit the system,0,47,The CEO of NASDAQ pushed to halt trading ‚Äúto g...,2021-01-28 21:30:35
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,74,,2021-01-28 21:28:57
4,"Not to distract from GME, just thought our AMC...",71,156,,2021-01-28 21:26:56


Reddit gives users the option to provide a body to whatever topic that they post. Sometimes posts that generate alot of traffic have no body at all! The 'body' column is not much of particular direct interest and would. However we felt that it was necessary to capture the length of each post body if it was provided using numerical values.   

In [4]:
reddit_data['body'][2]

'The CEO of NASDAQ pushed to halt trading ‚Äúto give investors a chance to recalibrate their positions‚Äù.\n\n[https://mobile.twitter.com/Mediaite/status/1354504710695362563](https://mobile.twitter.com/Mediaite/status/1354504710695362563)\n\nNow SEC is investigating, brokers are disallowing buying more calls. This is the institutions flat out admitting they will change the rules to bail out the rich but if it happens to us, we get a ‚Äúwell shucks you should have known investing is risky! have you tried cutting out avocados and coffee, maybe doing Uber on the side?‚Äù\n\nWe may have collectively driven up enough sentiment in wall street to make other big players go long on GME with us (we do not have the money to move the stock as much as it did alone). we didn‚Äôt hurt wall street as a whole, just a few funds went down while others went up and profited off the shorts the same as us. The media wants to pin the blame on us.\n\nIt should be crystal clear that this is a rigged game by now

In [5]:
print("Length of the post body is {}".format(len(reddit_data['body'][2])))

Length of the post body is 1319


We can start by converting each string to character length using the [str.len()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.len.html) function for pandas series. Notice for row 3 that the post body is also 1319 like before. We will create a new column, `body_len`, to store this information.

In [6]:
reddit_data['body_len'] = reddit_data['body'].str.len()

reddit_data.head()

Unnamed: 0,title,score,comms_num,body,timestamp,body_len
0,"It's not about the money, it's about sending a...",55,6,,2021-01-28 21:37:41,
1,Math Professor Scott Steiner says the numbers ...,110,23,,2021-01-28 21:32:10,
2,Exit the system,0,47,The CEO of NASDAQ pushed to halt trading ‚Äúto g...,2021-01-28 21:30:35,1319.0
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,74,,2021-01-28 21:28:57,
4,"Not to distract from GME, just thought our AMC...",71,156,,2021-01-28 21:26:56,


Next we can convert the NaN values in the `body_len` column back into 0s. This will indicate that the post has no body. For the NaN values in the `body` column we will convert them to empty strings.

In [7]:
reddit_data['body_len'] = reddit_data['body_len'].fillna(0)

reddit_data['body'] = reddit_data['body'].fillna("")

reddit_data.head()

Unnamed: 0,title,score,comms_num,body,timestamp,body_len
0,"It's not about the money, it's about sending a...",55,6,,2021-01-28 21:37:41,0.0
1,Math Professor Scott Steiner says the numbers ...,110,23,,2021-01-28 21:32:10,0.0
2,Exit the system,0,47,The CEO of NASDAQ pushed to halt trading ‚Äúto g...,2021-01-28 21:30:35,1319.0
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,74,,2021-01-28 21:28:57,0.0
4,"Not to distract from GME, just thought our AMC...",71,156,,2021-01-28 21:26:56,0.0


Let's now examine the 'timestamp' column.

In [8]:
time = reddit_data['timestamp'][0]
print(type(time))

<class 'str'>


Each timestamp is a string object, it would be more useful to convert each time to a datetime object as it would make for easier comparison.

In [9]:
reddit_data['timestamp'] = pd.to_datetime(reddit_data['timestamp'], format = '%Y-%m-%d %H:%M:%S')

In [10]:
time = reddit_data['timestamp'][0]
print(type(time))

<class 'pandas._libs.tslibs.timestamps.Timestamp'>


Let's do a simple comparison to test out the power of datetime.

In [11]:
time_2 = reddit_data['timestamp'][1]
time > time_2

True

In [12]:
reddit_data.head()

Unnamed: 0,title,score,comms_num,body,timestamp,body_len
0,"It's not about the money, it's about sending a...",55,6,,2021-01-28 21:37:41,0.0
1,Math Professor Scott Steiner says the numbers ...,110,23,,2021-01-28 21:32:10,0.0
2,Exit the system,0,47,The CEO of NASDAQ pushed to halt trading ‚Äúto g...,2021-01-28 21:30:35,1319.0
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,74,,2021-01-28 21:28:57,0.0
4,"Not to distract from GME, just thought our AMC...",71,156,,2021-01-28 21:26:56,0.0


Our next plan was to also conduct a simple boolean search in order to make sure that the posts we analyze are relevant to GME/Gamestop since our dataset contains over 40,000 posts as shown below. We achieved this by searching each post title to see if it contained either 'GME, GameStop or Game Stop'.

In [13]:
reddit_data.describe()

Unnamed: 0,score,comms_num,body_len
count,42553.0,42553.0,42553.0
mean,1377.25453,203.011774,494.531008
std,8381.8057,2441.833305,1494.372728
min,0.0,0.0,0.0
25%,1.0,2.0,0.0
50%,25.0,11.0,0.0
75%,166.0,44.0,303.0
max,348241.0,93268.0,34984.0


In [14]:
#https://stackoverflow.com/questions/22909082/pandas-converting-string-object-to-lower-case-and-checking-for-string
reddit_data_gme_only = reddit_data[reddit_data["title"].str.contains("(?i)gme|Gamestop|Game Stop")]

In [15]:
reddit_data_gme_only.describe()

Unnamed: 0,score,comms_num,body_len
count,8809.0,8809.0,8809.0
mean,1819.785787,343.753548,526.592349
std,9491.172973,3693.024629,1501.850955
min,0.0,0.0,0.0
25%,1.0,2.0,0.0
50%,37.0,15.0,0.0
75%,248.0,62.0,323.0
max,225870.0,93268.0,34984.0


Using this boolean search method, we cut down the data set to 1/5 of its original since while also ensuring that the posts were relevant about GameStop.

In [16]:
reddit_data_gme_only.head()

Unnamed: 0,title,score,comms_num,body,timestamp,body_len
1,Math Professor Scott Steiner says the numbers ...,110,23,,2021-01-28 21:32:10,0.0
3,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,74,,2021-01-28 21:28:57,0.0
4,"Not to distract from GME, just thought our AMC...",71,156,,2021-01-28 21:26:56,0.0
8,Currently Holding AMC and NOK - Is it retarded...,200,161,,2021-01-28 21:19:16,0.0
11,GME Premarket üçÅ Musk approved üéÆüõëüíé‚úã,562,97,,2021-01-28 21:17:28,0.0


Now let's reorder the index for each post.

In [17]:
reddit_data_gme_only.reset_index(drop=True)

Unnamed: 0,title,score,comms_num,body,timestamp,body_len
0,Math Professor Scott Steiner says the numbers ...,110,23,,2021-01-28 21:32:10,0.0
1,NEW SEC FILING FOR GME! CAN SOMEONE LESS RETAR...,29,74,,2021-01-28 21:28:57,0.0
2,"Not to distract from GME, just thought our AMC...",71,156,,2021-01-28 21:26:56,0.0
3,Currently Holding AMC and NOK - Is it retarded...,200,161,,2021-01-28 21:19:16,0.0
4,GME Premarket üçÅ Musk approved üéÆüõëüíé‚úã,562,97,,2021-01-28 21:17:28,0.0
...,...,...,...,...,...,...
8804,GME Blood Money - Hedgies waited until stimmie...,75,19,&#x200B;\n\nhttps://preview.redd.it/sdyvo26u19...,2021-03-16 06:14:46,126.0
8805,GME update 3/15: honestly.... -$362K?? c'mon h...,6821,491,,2021-03-16 06:12:34,0.0
8806,Hey Elon When you take us to the moon... Let‚Äòs...,180,10,,2021-03-16 06:00:15,0.0
8807,Did y‚Äôall really think the hedgies would just ...,19577,1725,"All weekend, aside from creating a WSB Zoo, (w...",2021-03-16 05:50:51,3066.0


The r/wallstreetbets dataset is ready for use! We can export the dataframe as a csv for later use. **(Note code commented out)**

In [19]:
reddit_data_gme_only.to_csv(r'.\reddit_wsb_gme.csv', index_label=False)