## 02 Data Collection

In [1]:
import pandas as pd
import datetime as dt
import re
import numpy as np
from pmaw import PushshiftAPI

In [2]:
#PMAW is a wrapper for the Pushshift API which uses multithreading to retrieve Reddit comments and submissions. 
api = PushshiftAPI()

Initial tests suggests that the two groups differ in the number of posts quite a lot, hence the traffic of the group with less posts (r/stocks) is considered when choosing the timeframe of the scraping process. As there are on average 4000 posts in r/Wallstreetbets and 1000 posts in r/Stocks every week, a one-month timeframme is taken to gurantee enough posts from r/stocks. The timeframe was choosen to be 2022 Jan 1st to 2022 Feb 1st. 

In [3]:
#set the parameters for scraping. 
after=int(dt.datetime(2022,1,1,0,0).timestamp()) 
before=int(dt.datetime(2022,2,1,0,0).timestamp()) #2022 Jan 1st - 2022 Feb 1st.
before,after

(1643644800, 1640966400)

Next, the search_submissions method will be used for scraping the posts, as it does not give disruption after certain number of requests. The documentation for search_submissions in PMAW library is at https://github.com/mattpodolak/pmaw/blob/master/README.md#pushshiftapi

In [4]:
%%time
submissions = api.search_submissions(subreddit="wallstreetbets",after=after,before=before,safe_exit=True)
df1 = pd.DataFrame(submissions)

CPU times: total: 3.81 s
Wall time: 1min 49s


The wall time for scapping of 30-day data is below 2 min, this is acceptable. 

In [5]:
%%time
submission2=api.search_submissions(subreddit='stocks',after=after,before=before,safe_exit=True)
df2=pd.DataFrame(submission2)

CPU times: total: 516 ms
Wall time: 7.84 s


In [6]:
df1.head(2)

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_is_blocked,author_patreon_flair,...,media_embed,secure_media,secure_media_embed,gallery_data,media_metadata,author_cakeday,poll_data,live_audio,banned_by,tournament_data
0,[],False,inkollusekar,,[],,text,t2_1dt0l3xs,False,False,...,,,,,,,,,,
1,[],False,Janto_2021,,"[{'e': 'text', 't': 'Two Brains, Uses Neither'}]","Two Brains, Uses Neither",richtext,t2_88ab00y8,False,False,...,,,,,,,,,,


In [7]:
df1.shape, df2.shape

((22631, 84), (5085, 70))

The number of posts in these two subreddits differ a lot, we will be working with an imbalanced classification problem later.

In [8]:
raw_data=pd.concat([df1,df2], axis=0)
raw_data.to_csv('./data/raw_data.csv')

In [9]:
df1.columns

Index(['all_awardings', 'allow_live_comments', 'author',
       'author_flair_css_class', 'author_flair_richtext', 'author_flair_text',
       'author_flair_type', 'author_fullname', 'author_is_blocked',
       'author_patreon_flair', 'author_premium', 'awarders', 'can_mod_post',
       'contest_mode', 'created_utc', 'domain', 'full_link', 'gildings', 'id',
       'is_created_from_ads_ui', 'is_crosspostable', 'is_meta',
       'is_original_content', 'is_reddit_media_domain', 'is_robot_indexable',
       'is_self', 'is_video', 'link_flair_background_color',
       'link_flair_css_class', 'link_flair_richtext', 'link_flair_template_id',
       'link_flair_text', 'link_flair_text_color', 'link_flair_type', 'locked',
       'media_only', 'no_follow', 'num_comments', 'num_crossposts', 'over_18',
       'parent_whitelist_status', 'permalink', 'pinned', 'pwls',
       'retrieved_on', 'score', 'selftext', 'send_replies', 'spoiler',
       'stickied', 'subreddit', 'subreddit_id', 'subreddit_sub

The titles of the posts are stored in the 'title' column, while the comments are stored in 'selftext' column.

In [25]:
#check on the nan values of 'title' column
df1.isna().sum()['title']

0

In [30]:
# check on the nan values of 'selftext' column
df1.isna().sum()['selftext']

11298

In [28]:
df2.isna().sum()['title']

0

In [31]:
df2.isna().sum()['selftext']

2041

The PMAW API has pre-filltered the data and all 'title' information are not empty. There are lots of nan values in 'selftext' column though. But since we will use the combination of 'title' and 'selftext' as the text corpus, those nan values of 'selftext' will not be nan values any more with 'title' text, hence we will keep these rows. 

Next we keep only the textual data from these two columns.

In [11]:
# To extract the title and comments and put into a 'text' column, and label the data
df1['text']=df1['title']+df1['selftext']
df1['label']='wallstreetbet'

In [12]:
df2['text']=df2['title']+df2['selftext']
df2['label']='stock'

In [13]:
# keep only the text and lable columns
df1=df1[['label','text']]
df2=df2[['label','text']]

In [14]:
print(f'percentage of empty text in r/wallstreetbets: {df1["text"].isna().sum()/df1.shape[0]*100}%.')
print(f'percentage of empty text in r/stock: {np.round(df2["text"].isna().sum()/df2.shape[0]*100,2)}%.')

percentage of empty text in r/wallstreetbets: 0.06628076532190358%.
percentage of empty text in r/stock: 0.2%.


The percentage of null values in text is quite small, we can drop these rows.

In [15]:
# stack the two dataframe together 
df_stack=pd.concat([df1,df2], axis=0)

In [16]:
df_stack.dropna(subset=['text'], inplace=True)

In [17]:
df_stack.shape

(27691, 2)

In [18]:
df=df_stack
%store df

Stored 'df' (DataFrame)


In [33]:
df.to_csv('./data/textual_data_with_labels.csv')