# Approach<br>


1. Use PMAW to extract title and information of WallStreetBets Daily and Weekend Discussion Threads
2. Use PRAW to extract comments (with some obvious initial preprocessing steps) from those posts
3. Create a dataframe to store the data that we would need and export it to a CSV file

# Why This Approach?<br>


* PRAW can extract most recent X lines of submissions and comments but not within a time period that we can specify
* PMAW can extract submissions within a time period that we can specify, but it cannot extract comments at all
* Use both packages to complement their features 






In [1]:
# PRAW supports Python 3.6+
!python --version

Python 3.7.12


In [2]:
pip install pmaw praw

Collecting pmaw
  Downloading pmaw-2.1.0-py3-none-any.whl (25 kB)
Collecting praw
  Downloading praw-7.4.0-py3-none-any.whl (167 kB)
[K     |████████████████████████████████| 167 kB 6.4 MB/s 
Collecting update-checker>=0.18
  Downloading update_checker-0.18.0-py3-none-any.whl (7.0 kB)
Collecting websocket-client>=0.54.0
  Downloading websocket_client-1.2.1-py2.py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.3 MB/s 
[?25hCollecting prawcore<3,>=2.1
  Downloading prawcore-2.3.0-py3-none-any.whl (16 kB)
Installing collected packages: websocket-client, update-checker, prawcore, praw, pmaw
Successfully installed pmaw-2.1.0 praw-7.4.0 prawcore-2.3.0 update-checker-0.18.0 websocket-client-1.2.1


In [3]:
from pmaw import PushshiftAPI # extracts information (but cannot extract comments) from Reddit posts
from datetime import datetime # converts UTC-formatted time (a piece of information extracted by PMAW) to month-day-year
import random # creates the small sample of WSB posts to extract comments for midterm
import praw # extracts comments from WSB posts; posts can be narrowed down using PMAW
import pandas as pd # creates and stores information in dataframe that will be exported to CSV 

In [4]:
pip freeze

absl-py==0.12.0
alabaster==0.7.12
albumentations==0.1.12
altair==4.1.0
appdirs==1.4.4
argcomplete==1.12.3
argon2-cffi==21.1.0
arviz==0.11.4
astor==0.8.1
astropy==4.3.1
astunparse==1.6.3
atari-py==0.2.9
atomicwrites==1.4.0
attrs==21.2.0
audioread==2.1.9
autograd==1.3
Babel==2.9.1
backcall==0.2.0
beautifulsoup4==4.6.3
bleach==4.1.0
blis==0.4.1
bokeh==2.3.3
Bottleneck==1.3.2
branca==0.4.2
bs4==0.0.1
CacheControl==0.12.6
cached-property==1.5.2
cachetools==4.2.4
catalogue==1.0.0
certifi==2021.5.30
cffi==1.14.6
cftime==1.5.1
chardet==3.0.4
charset-normalizer==2.0.6
clang==5.0
click==7.1.2
cloudpickle==1.3.0
cmake==3.12.0
cmdstanpy==0.9.5
colorcet==2.0.6
colorlover==0.3.0
community==1.0.0b1
contextlib2==0.5.5
convertdate==2.3.2
coverage==3.7.1
coveralls==0.5
crcmod==1.7
cufflinks==0.17.3
cvxopt==1.2.7
cvxpy==1.0.31
cycler==0.10.0
cymem==2.0.5
Cython==0.29.24
daft==0.0.4
dask==2.12.0
datascience==0.10.6
debugpy==1.0.0
decorator==4.4.2
defusedxml==0.7.1
descartes==1.1.0
dill==0.3.4
distributed=

In [5]:
api = PushshiftAPI()

# creates a 'submission' iterable object to search for Daily Discussion Threads with the following query - can edit query
# submission, posts, threads are interchangeable for Reddit
daily_submissions = api.search_submissions(after=int(datetime(2021, 1, 1).timestamp()), # (YYYY, MM, DD); returns data starting at this date
                                           subreddit='wallstreetbets',
                                           title='Daily Discussion Thread for 2021', # what's entered is loosely searched (think Google search)
                                           fields=['created_utc','id','num_comments','title','url'], # remove this line if you want to see all fields after running next cell (for-loop)
                                           sort_type='created_utc',
                                           sort='desc')

INFO:pmaw.PushshiftAPIBase:253 result(s) available in Pushshift
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 10 - Batches: 1 - Items Remaining: 0


In [6]:
# to store the Daily Discussion Threads (dictionaries) initially
posts_raw = []

# enter text in the [] to filter out unwanted posts with titles that include any of the text
for submission in daily_submissions:
  if any(substring in submission['title'] for substring in ['Unpinned','Part','Pt.','#2','version','30th']) == False:
    posts_raw.append(submission)
    print(submission)

{'created_utc': 1617098411, 'id': 'mgcgcy', 'num_comments': 19033, 'title': 'Daily Discussion Thread for March 30, 2021', 'url': 'https://www.reddit.com/r/wallstreetbets/comments/mgcgcy/daily_discussion_thread_for_march_30_2021/'}
{'created_utc': 1617012015, 'id': 'mfm3fg', 'num_comments': 19550, 'title': 'Daily Discussion Thread for March 29, 2021', 'url': 'https://www.reddit.com/r/wallstreetbets/comments/mfm3fg/daily_discussion_thread_for_march_29_2021/'}
{'created_utc': 1615975213, 'id': 'm6wyk2', 'num_comments': 14678, 'title': 'Daily Discussion Thread for March 17, 2021', 'url': 'https://www.reddit.com/r/wallstreetbets/comments/m6wyk2/daily_discussion_thread_for_march_17_2021/'}
{'created_utc': 1615888812, 'id': 'm65k4x', 'num_comments': 24039, 'title': 'Daily Discussion Thread for March 16, 2021', 'url': 'https://www.reddit.com/r/wallstreetbets/comments/m65k4x/daily_discussion_thread_for_march_16_2021/'}
{'created_utc': 1615802413, 'id': 'm5hbpf', 'num_comments': 25356, 'title': 

In [7]:
# number of Daily Discussion Threads
daily_count = len(posts_raw)
daily_count

198

In [8]:
# creates a 'submission' iterable object to search for Weekend Discussion Threads with the following query - can edit query
weekend_submissions = api.search_submissions(after=int(datetime(2021, 1, 1).timestamp()), # (YYYY, MM, DD); returns data at this start date
                                             subreddit='wallstreetbets',
                                             title='Weekend Discussion Thread for 2021', # what's entered is loosely searched (think Google search)
                                             fields=['created_utc','id','num_comments','title','url'], # remove this line if want to see all fields after running next cell (for-loop)
                                             sort_type='created_utc',
                                             sort='desc')

INFO:pmaw.PushshiftAPIBase:43 result(s) available in Pushshift
INFO:pmaw.PushshiftAPIBase:Total:: Success Rate: 100.00% - Requests: 10 - Batches: 1 - Items Remaining: 0


In [9]:
# enter text in the [] to filter out unwanted posts with titles that include any of the text
for submission in weekend_submissions:
  if any(substring in submission['title'] for substring in ['discussion','-']) == False:
      posts_raw.append(submission)
      print(submission)

{'created_utc': 1611954020, 'id': 'l8420s', 'num_comments': 65450, 'title': 'Weekend Discussion Thread for the Weekend of January 29, 2021', 'url': 'https://www.reddit.com/r/wallstreetbets/comments/l8420s/weekend_discussion_thread_for_the_weekend_of/'}
{'created_utc': 1611349217, 'id': 'l2wx2p', 'num_comments': 56548, 'title': 'Weekend Discussion Thread for the Weekend of January 22, 2021', 'url': 'https://www.reddit.com/r/wallstreetbets/comments/l2wx2p/weekend_discussion_thread_for_the_weekend_of/'}
{'created_utc': 1610744415, 'id': 'ky3qg6', 'num_comments': 46933, 'title': 'Weekend Discussion Thread for the Weekend of January 15, 2021', 'url': 'https://www.reddit.com/r/wallstreetbets/comments/ky3qg6/weekend_discussion_thread_for_the_weekend_of/'}
{'created_utc': 1610139619, 'id': 'ktbnq4', 'num_comments': 49137, 'title': 'Weekend Discussion Thread for the Weekend of January 08, 2021', 'url': 'https://www.reddit.com/r/wallstreetbets/comments/ktbnq4/weekend_discussion_thread_for_the_we

In [10]:
# number of Weekend Discussion Threads
len(posts_raw) - daily_count

40

In [11]:
# removes duplicate threads and keeps the one with the larger number of comments
# posts_1 and posts_2 are copies of posts_raw in case posts_raw is needed for reference
# posts_final is the list of all threads with their selected information to be used for extracting comments next
posts_1 = posts_2 = sorted(posts_raw, key=lambda sub: sub['created_utc'])
posts_final = []

for submission_1 in posts_1:
  for submission_2 in posts_2:
    if submission_1['title'] == submission_2['title'] and submission_1['id'] != submission_2['id']:
      if submission_1['num_comments'] > submission_2['num_comments']:
        posts_final.append(submission_1)
      else:
        posts_final.append(submission_2)
      posts_1.remove(submission_1)
      posts_2.remove(submission_2)    
  if submission_1 in posts_1 and submission_1 not in posts_final:
    posts_final.append(submission_1) 

posts_final

[{'created_utc': 1609498816,
  'id': 'ko9i5u',
  'num_comments': 5629,
  'title': 'Daily Discussion Thread for January 01, 2021',
  'url': 'https://www.reddit.com/r/wallstreetbets/comments/ko9i5u/daily_discussion_thread_for_january_01_2021/'},
 {'created_utc': 1609534812,
  'id': 'koiz6w',
  'num_comments': 34799,
  'title': 'Weekend Discussion Thread for the Weekend of January 01, 2021',
  'url': 'https://www.reddit.com/r/wallstreetbets/comments/koiz6w/weekend_discussion_thread_for_the_weekend_of/'},
 {'created_utc': 1609758019,
  'id': 'kq6j3j',
  'num_comments': 26928,
  'title': 'Daily Discussion Thread for January 04, 2021',
  'url': 'https://www.reddit.com/r/wallstreetbets/comments/kq6j3j/daily_discussion_thread_for_january_04_2021/'},
 {'created_utc': 1609930813,
  'id': 'krlpdb',
  'num_comments': 34480,
  'title': 'Daily Discussion Thread for January 06, 2021',
  'url': 'https://www.reddit.com/r/wallstreetbets/comments/krlpdb/daily_discussion_thread_for_january_06_2021/'},
 {'

In [12]:
# number of Daily and Weekend Discussion Threads without duplicates
len(posts_final)

224

In [13]:
# takes 5 samples of threads posted from 1/1/2021 to current time with at least 100 comments (some threads have single-digit number of comments)
# posts_to_sample = [submission for submission in posts_final if submission['num_comments'] >= 100]
# samples = random.sample(posts_to_sample, 5)
# samples

In [14]:
# establish connection to Reddit API using PRAW to create 'submission' iterable object for extracting comments in next cell
reddit = praw.Reddit(client_id='gnbaQGt7tQHNFF1QCie0dA',
                     client_secret='qeG8GsbZRuEm5PHl0fIl43ehK0grkA',
                     user_agent='WSB Scraper (by u/Aggressive-Risotto)')

In [16]:
# dataframe to store data and that will be exported to CSV 
df = pd.DataFrame(columns=['Submission Title','Comment','Date Posted','Author','Score'])

# data stored in this order - thread title, comment, date posted for comment, comment author, comment score
# filtered out - comments made by VisualMod (MOD), deleted comments, comments with 0 score
# score is the sum of upvotes (+) and downvotes (-)
for post in posts_final:
  submission = reddit.submission(id=post['id'])
  submission.comments.replace_more(limit=0)
  for top_level_comment in submission.comments:
    if top_level_comment.author != 'VisualMod' and top_level_comment != '[deleted]' and top_level_comment.score > 0:
      df = df.append({'Submission Title': submission.title,
                      'Comment': top_level_comment.body,
                      'Date Posted': datetime.fromtimestamp(top_level_comment.created_utc).strftime('%m-%d-%Y'),
                      'Author': top_level_comment.author,
                      'Score': top_level_comment.score},
                     ignore_index=True)

df

# specifically to check total number of comments 
df.shape

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/latest/getting_started/multiple_instances.html#discord-bots-and-asynchronous-environments for more info.

It is strongly recommended to use Async PRAW: https://asyncpraw.readthedocs.io.
See https://praw.readthedocs.io/en/l

(100704, 5)

In [17]:
# find the CSV file in 'Files' left panel, download it if needed
df.to_csv('WSB_Comments.csv', index=False)