## 02 Data Collection

In [1]:
import pandas as pd
import datetime as dt
import re
import numpy as np
from pmaw import PushshiftAPI

In [2]:
#PMAW is a wrapper for the Pushshift API which uses multithreading to retrieve Reddit comments and submissions. 
api = PushshiftAPI()

Initial tests suggests that the two groups differ in the number of posts by a factor of over 5 times on average, hence the traffic of the group with less posts (r/stocks) is considered when choosing the timeframe of the scraping process. As there are on average 1,600 -2,000 posts in r/Stocks every month, a one-month timeframme is taken to gurantee enough posts from r/stocks. The timeframe was choosen to be 2022 Jan 1st to 2022 Feb 1st. 

In [3]:
#set the parameters for scraping. 
limit=10_000 #this limit is to limit the number of posts scrapped from r/wallstreetbets while keeping the imbalanced nature between the two classes
after=int(dt.datetime(2022,1,1,0,0).timestamp()) 
before=int(dt.datetime(2022,2,1,0,0).timestamp()) #2022 Jan 1st - 2022 Feb 1st.
before,after

(1656604800, 1609430400)

Next, the search_submissions method will be used for scraping the posts, as it does not give disruption after certain number of requests. The documentation for search_submissions in PMAW library is at https://github.com/mattpodolak/pmaw/blob/master/README.md#pushshiftapi

In [4]:
%%time
submissions = api.search_submissions(subreddit="wallstreetbets",limit=limit,after=after,before=before,safe_exit=True)
df1 = pd.DataFrame(submissions)

CPU times: total: 469 ms
Wall time: 476 ms


In [5]:
%%time
submission2=api.search_submissions(subreddit='stocks',limit=limit, after=after,before=before,safe_exit=True)
df2=pd.DataFrame(submission2)

CPU times: total: 78.1 ms
Wall time: 82.2 ms


In [6]:
df1.head(2)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,media,secure_media,media_embed,secure_media_embed,edited,banned_by,author_cakeday,live_audio,gilded,poll_data
0,[],False,Yeet_Me_N_The_Trash,,[],,text,t2_7087ibf9,False,False,...,,,,,,,,,,
1,[],False,Major-Attempt-8687,,[],,text,t2_9vzajejo,False,False,...,,,,,,,,,,


In [7]:
df1.shape, df2.shape

((10000, 85), (1639, 82))

The number of posts in these two subreddits differ a lot, we will be working with an imbalanced classification problem later.

In [1]:
df1.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'removed_by_category', 'retrieved_on', 'score', 'selftext',
       'send_replies', 'spoiler', 'stickied', 'subreddit', 'subred

The titles of the posts are stored in the 'title' column, while the comments are stored in 'selftext' column.

In [9]:
# To extract the title and comments and put into a 'text' column, and label the data
df1['text']=df1['title']+df1['selftext']
df1['label']='wallstreetbet'

In [10]:
df2['text']=df2['title']+df2['selftext']
df2['label']='stock'

In [11]:
# keep only the text and lable columns
df1=df1[['label','text']]
df2=df2[['label','text']]

In [12]:
print(f'percentage of empty text in r/wallstreetbets: {df1["text"].isna().sum()/df1.shape[0]*100}%.')
print(f'percentage of empty text in r/stock: {np.round(df2["text"].isna().sum()/df2.shape[0]*100,2)}%.')

percentage of empty text in r/wallstreetbets: 0.62%.
percentage of empty text in r/stock: 0.92%.


The percentage of null values in text is quite small, we can drop these rows.

In [13]:
# stack the two dataframe together 
df_stack=pd.concat([df1,df2], axis=0)

In [14]:
df_stack.dropna(subset=['text'], inplace=True)

In [15]:
df_stack.shape

(11562, 2)

In [17]:
df=df_stack
%store df

Stored 'df' (DataFrame)


In [6]:
df.to_csv('./data/unprocessed_data.csv')