# Reddit Scrapper

In [1]:
#dependencies (in all cases i assume you have pandas/numpy installed) 
!pip install psaw
!pip install praw #optional 

Collecting psaw
  Downloading https://files.pythonhosted.org/packages/01/fe/e2f43241ff7545588d07bb93dd353e4333ebc02c31d7e0dc36a8a9d93214/psaw-0.1.0-py3-none-any.whl
Installing collected packages: psaw
Successfully installed psaw-0.1.0
Collecting praw
[?25l  Downloading https://files.pythonhosted.org/packages/48/a8/a2e2d0750ee17c7e3d81e4695a0338ad0b3f231853b8c3fa339ff2d25c7c/praw-7.2.0-py3-none-any.whl (159kB)
[K     |████████████████████████████████| 163kB 4.3MB/s 
[?25hCollecting prawcore<3,>=2
  Downloading https://files.pythonhosted.org/packages/7d/df/4a9106bea0d26689c4b309da20c926a01440ddaf60c09a5ae22684ebd35f/prawcore-2.0.0-py3-none-any.whl
Collecting update-checker>=0.18
  Downloading https://files.pythonhosted.org/packages/0c/ba/8dd7fa5f0b1c6a8ac62f8f57f7e794160c1f86f31c6d0fb00f582372a3e4/update_checker-0.18.0-py3-none-any.whl
Collecting websocket-client>=0.54.0
[?25l  Downloading https://files.pythonhosted.org/packages/4c/5f/f61b420143ed1c8dc69f9eaec5ff1ac36109d52c80de49d66

In [55]:
import pandas as pd
import numpy as np
import datetime as dt
from psaw import PushshiftAPI
import os

## Scrap Reddit 

##### Praw is the most popular tool to scrap Reddit, however it is best with realtime data. For historical data PushshiftAPI is best to use. 
##### Submissions are the main "posts", comments are replies to the submissions. 
##### Following script scraps chosen subreddiat, and returns two dataframes one with comments in the subreddit and one with submissions in the subreddit.  Scripts stores both dataframes as pickle files. Since some comments are added to previous submissions (added before the set endDate) not all comments have submission in "submission title" field. However all comments in column 'link_id' have id of the submission.  Both dataframes can be easily combined to "rebuild threads". 

Arguments: 

startDate - starting date of query, format: DD-MM-YYYY

endDate - the query will download everything before this date, format: DD-MM-YYYY

subreddit - name of the subreddit you want to scrap

fileName1 - name of the first file without extension (it stores submissions) 

fileName2 - name of the file without extension (it stores comments) 

hs-starting hour 

ms-starting minute

he-end hour

me-end minute 

In [38]:
def scrapReddit(startDate,endDate,subreddit,fileName1,fileName2,hs,ms,he,me):

    api = PushshiftAPI()

    start_d=startDate[:2]
    end_d=endDate[:2]

    start_m=startDate[3:5]
    end_m=endDate[3:5]

    start_y=startDate[6:10]
    end_y=endDate[6:10]

    start_time = int(dt.datetime(int(start_y), int(start_m), int(start_d),int(hs),int(ms)).timestamp())
    end_time = int(dt.datetime(int(end_y), int(end_m), int(end_d), int(he),int(me)).timestamp())

    searchSubmissions=api.search_submissions(after=start_time, before=end_time, subreddit=subreddit)
    searchComments=api.search_comments(after=start_time, before=end_time, subreddit=subreddit)
    print(searchComments)
    df1 = pd.DataFrame([obj.d_ for obj in searchSubmissions])
    df2 = pd.DataFrame([obj.d_ for obj in searchComments])

    for i in range(len(df2)):
        df2.loc[i,'link_id']=df2.loc[i,'link_id'][3:]

    rows=[]
    for i in range (len(df2)):
        row=df1.loc[df1['id'] == df2.loc[i,'link_id'],'title']
        if len(row)<1:
            rows.append('no submission title')
        else:
            rows.append(row)
    df2['submission_title']=rows
  
    print("Submissions dataframe shape:")
    print(df1.shape)
    
    try:
        saveName1=fileName1+'.pkl'
        df1.to_pickle(saveName1)
        print("Submissions file saved")
    except:
        print("there was problem with saving the file")
        print("just in case you can still use the variable you stored response in to save it")
    
    print("Comments dataframe shape:")
    print(df2.shape)
    
    try:
        saveName2=fileName2+'.pkl'
        df2.to_pickle(saveName2)
        print("Comments file saved")
    except:
        print("there was problem with saving the file")
        print("just in case you can still use the variable you stored response in to save it")
    

    return df1,df2

## readRedditData
##### Takes path where data is stored as argument and returns data loaded into pandas dataframe

Arguments: 

path - path to the file 

In [7]:
def readRedditData(path):
    reddit_df=pd.read_pickle(path)
    print('shape of dataframe:')
    print(reddit_df.shape)
    return reddit_df

## intervalScrapReddit
##### Use it to scrap popular subreddits or for longer periods of time. It divides downloded data, so the api can handle it. It stores all files in directory name you provided, if there isn't such directory method will create it. You can specify interval. Choose from month, day,6 hours, hour, 30 minutes, 10 minutes. 

##### **For every interval you can see the shape of dataframes saved and the message if they were saved correctly. Pay attention to it. If shape is (0,1) and it reapeats multiple time, It means that period was either so busy, that you need to lower interval or that there were actually no posts. If you want to find out what period was it do it this way:**
1. Find starting timestamp (use datetime to timestamp converter) and find the interval when it broke. (Interval numbers are displayed starting with 0) 

2. Calcuate single interval timestamp (multiply seconds*minutes*hours*days*months) E.g. 6 hours interval would be equal to: 60*60*6

3. Then use this pattern to find the timestamp when things broke 
NumOfIntervals=Intevarval_Number+1
Step_timestamp=60*60*6 (example)
startTimeStamp=convertedStartDate_andHourMinutes
brokenTimeStamp=startTimeStamp+NumOfIntervals*Step_timestamp

4. Use timestamp to datetime converter to convert it to date and time 

**UPDATE For every interval it prints startDateTime and endDateTime so you don't need count anymore**

Arguments: 

startDate - starting date of query, format: DD-MM-YYYY

endDate - the query will download everything before this date, format: DD-MM-YYYY

subreddit - name of the subreddit you want to scrap

fileName1 - name of the first file without extension (it stores submissions) 

fileName2 - name of the file without extension (it stores comments) 

hs-starting hour 

ms-starting minute

he-end hour

me-end minute 

interval-Period of time for which algorithm downloads data (e.g. 1 hour, between the start date and end date, algorithm will download and store data for every hour) 
**Choose from following: "month", "day","6hours","hour","30minutes","10minutes"

directoryName - directory where all the files are stored (can be existing one or enter name for new one)


In [83]:
#To do: add different intervals option (month, day,6 hours, hour, 30 minutes, 10 minutes)
def intervalScrapReddit(startDate,endDate,subreddit,fileName1,fileName2,hs,ms,he,me,interval,directoryName):
  try:
    os.mkdir(directoryName)
  except OSError:
    print ("Creation of the directory failed or it already exists")
  else:
    print ("Successfully created the directory ")

  start_d=startDate[:2]
  end_d=endDate[:2]

  start_m=startDate[3:5]
  end_m=endDate[3:5]

  start_y=startDate[6:10]
  end_y=endDate[6:10]

  start_time = int(dt.datetime(int(start_y), int(start_m), int(start_d),int(hs),int(ms)).timestamp())
  end_time = int(dt.datetime(int(end_y), int(end_m), int(end_d), int(he),int(me)).timestamp())
  if interval=="month":
    stepTimestamp=60*60*24*30
  elif interval=="day":
    stepTimestamp=60*60*24
  elif interval=="6hours":
    stepTimestamp=60*60*6
  elif interval=="30minutes":
    stepTimestamp=60*30
  elif interval=="10minutes":
    stepTimestamp=60*10
  else:
    #one hour by default
    stepTimestamp=60*60
  name=0
  previous=start_time
  for i in range(stepTimestamp,end_time,stepTimestamp):

    fileNameSub=fileName1+str(name)
    fileNameCom=fileName2+str(name)
    endDate=start_time+i
    print("end date:"+str(endDate))
    print("startDate:"+str(previous))
    message=scrapReddit2(previous,endDate,subreddit,fileNameSub,fileNameCom,directoryName)
    print("Interval:"+str(name)+"\n"+message)
    previous=endDate
    if previous>=end_time:
      print("finished")
      break
    name+=1


def scrapReddit2(timestamp,timestamp2,subreddit,fileName1,fileName2,directoryName):

    api = PushshiftAPI()

    start_time = timestamp
    end_time = timestamp2

    searchSubmissions=api.search_submissions(after=start_time, before=end_time, subreddit=subreddit)
    searchComments=api.search_comments(after=start_time, before=end_time, subreddit=subreddit)
    df1 = pd.DataFrame([obj.d_ for obj in searchSubmissions])
    df2 = pd.DataFrame([obj.d_ for obj in searchComments])

    for i in range(len(df2)):
        df2.loc[i,'link_id']=df2.loc[i,'link_id'][3:]

    rows=[]
    for i in range (len(df2)):
        row=df1.loc[df1['id'] == df2.loc[i,'link_id'],'title']
        if len(row)<1:
            rows.append('no submission title')
        else:
            rows.append(row)
    df2['submission_title']=rows
  
    print("Submissions dataframe shape:")
    print(df1.shape)
    message=""
    try:
        saveName1=directoryName+'/'+fileName1+'.pkl'
        df1.to_pickle(saveName1)
        message+="Submissions Saved correctly \n"
    except:
        message+="Submissions not saved \n"
    
    print("Comments dataframe shape:")
    print(df2.shape)
    
    try:
        saveName2=directoryName+'/'+fileName2+'.pkl'
        df2.to_pickle(saveName2)
        message+="Comments Saved correctly \n"
    except:
        message+="Comments not saved \n"

    return message

In [111]:
#Load interval

def loadIntervalReddit(path):
  fileNames= arr = os.listdir(path)

  print(fileNames)
  fileNames.sort()
  print(fileNames)
  for i in range(len(fileNames)):
    p=path+"/"+fileNames[i]
    print(fileNames[i])
    
    if i>(len(fileNames)/2)-1:
      dfTemp=readRedditData(p)
      print("temp")
      print(dfTemp.shape)
      dfSub.append(dfTemp)
    elif i==0:
      dfTemp=readRedditData(p)
      print("temp")
      dfCom=dfTemp
      print(dfCom.shape)
    elif i==(len(fileNames)/2)-1:
      dfTemp=readRedditData(p)
      print("temp")
      dfSub=dfTemp
      print(dfSub.shape)
    else:
      dfTemp=readRedditData(p)
      print('temp')
      print(dfTemp.shape)
      dfCom.append(dfTemp)

  print("Submissions dataframe shape:")
  print(dfSub.shape)
  print("Comments dataframe shape:")
  print(dfCom.shape)
  return dfSub, dfCom
      



# CLARIFICATION

1.Which method to use

Pushfit api is limited to 200 requests per minute and reddit limits amount of data it sends back, therefore we can't download large amounts of data at once. If that's the case you will probably see comments dataframe to have shape zero. Use the interval way of downloading data, it will make separate requests for each interval and the amount will be much smaller divided into many files in single directory, which can be easily combined into one dataframe. 

2.Error

generator object PushshiftAPIMinimal._search at 0x7fc963e46bd0>
/usr/local/lib/python3.7/dist-packages/psaw/PushshiftAPI.py:192: UserWarning: Got non 200 code 429
  warnings.warn("Got non 200 code %s" % response.status_code)
/usr/local/lib/python3.7/dist-packages/psaw/PushshiftAPI.py:180: UserWarning: Unable to connect to pushshift.io. Retrying after backoff.
  warnings.warn("Unable to connect to pushshift.io. Retrying after backoff.

  **If you get this error it means api made too many requests and it's waiting to send next one, please wait if that's the case and in 2-3 minutes it should work (2-3 because it needs also time to download data and show you output)**

3.Contents of Dataframe

The scrapper gets all the data it can, but of course not everything is relevant to us. The reason i didn't add "drop columns" there is that sometimes in process of research it turns out something is actually relevant or we need other columns to clean data/filter something out. E.g. We need language column to get rid of rows in language we're not interested in. 

Scrapper gets data for comments and submissions, it's the way api works. Submission is the "main post" someone added. Comments other users replies to the main post (submission). By default there's no information in comments dataframe about which submission comment refers to, only id of submission. Therefore what i did is i added the submission to the comments dataframe as separate columns. 

Most relevant columns: 

1) Comments Dataframe 

-author 

-submission_title (as the name suggests) 

-body (the actual text of the comment)

-created (timestamp)

-subreddit (name of subreddit) 

-score (how many likes comment received




2) Submissions Dataframe: 

-score (number of likes submission received)

-subreddit (name of subreddit) 

-created (timestamp)

-author 

-title (title of subreddit)

-selftext (content of main post)


## Example use 1 (small amount of data)

#### Get all comments and submissions from subreddit "learnmachinelearning" from 25th January to 5th Febuary

In [53]:
startDate='25-01-2021'
endDate='05-02-2021'
subreddit='learnmachinelearning'
fileName1='submissions'
fileName2='comments'
startHour=10 
startMinute=5
endHour=13
endMinute=30
dfSub,dfCom=scrapReddit(startDate,endDate,subreddit,fileName1,fileName2,startHour,startMinute,endHour,endMinute)
dfSub.head()

<generator object PushshiftAPIMinimal._search at 0x7fc963f95cd0>




Submissions dataframe shape:
(406, 78)
Submissions file saved
Comments dataframe shape:
(578, 38)
Comments file saved
(406, 78)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_richtext,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,created,link_flair_template_id,link_flair_text,media_metadata,thumbnail_height,thumbnail_width,post_hint,preview,removed_by_category,url_overridden_by_dest,media,media_embed,secure_media,secure_media_embed,gilded,author_flair_background_color,author_flair_text_color,edited,crosspost_parent,crosspost_parent_list,is_gallery
0,[],False,andw1235,,[],,text,t2_zp1j5,False,False,[],False,False,1612472490,self.learnmachinelearning,https://www.reddit.com/r/learnmachinelearning/...,{},lcq1dr,True,False,False,False,True,True,False,,[],dark,text,False,False,True,0,0,False,all_ads,/r/learnmachinelearning/comments/lcq1dr/gibbs_...,False,6,1612675680,1,"Here's a simple explanation, python example co...",True,False,False,learnmachinelearning,t5_3cqa1,211904,public,self,Gibbs sampling,0,[],1.0,https://www.reddit.com/r/learnmachinelearning/...,all_ads,6,1612472000.0,,,,,,,,,,,,,,,,,,,,
1,[],False,lwilson747,,[],,text,t2_36ixj,False,False,[],False,False,1612467552,self.learnmachinelearning,https://www.reddit.com/r/learnmachinelearning/...,{},lco1an,True,False,False,False,True,True,False,#dadada,"[{'e': 'text', 't': 'Tutorial'}]",dark,richtext,False,False,True,0,0,False,all_ads,/r/learnmachinelearning/comments/lco1an/comput...,False,6,1612671880,2,"AI colleagues, this professional certificate s...",True,False,False,learnmachinelearning,t5_3cqa1,211900,public,https://b.thumbs.redditmedia.com/zYROceRZK04iX...,Computer Science for Artificial Intelligence (...,0,[],1.0,https://www.reddit.com/r/learnmachinelearning/...,all_ads,6,1612468000.0,8aeee882-d289-11ea-b4f0-0ed750cbd99b,Tutorial,"{'380ewvrbkif61': {'e': 'Image', 'id': '380ewv...",57.0,140.0,,,,,,,,,,,,,,,
2,[],False,tylersuard,,[],,text,t2_57i0bime,False,False,[],False,False,1612466065,self.learnmachinelearning,https://www.reddit.com/r/learnmachinelearning/...,{},lcnfu4,True,False,False,False,True,True,False,,[],dark,text,False,False,False,0,0,False,all_ads,/r/learnmachinelearning/comments/lcnfu4/aipowe...,False,6,1612670746,3,[https://colab.research.google.com/drive/1JyAg...,True,False,False,learnmachinelearning,t5_3cqa1,211900,public,self,AI-Powered Sarcastic News Headlines Generator,0,[],1.0,https://www.reddit.com/r/learnmachinelearning/...,all_ads,6,1612466000.0,,,,,,self,"{'enabled': False, 'images': [{'id': 'nkhh65uj...",,,,,,,,,,,,,
3,[],False,korfich,,[],,text,t2_jthip,False,False,[],False,False,1612464359,self.learnmachinelearning,https://www.reddit.com/r/learnmachinelearning/...,{},lcmqky,True,False,False,False,True,True,False,#7193ff,"[{'e': 'text', 't': 'Project'}]",light,richtext,False,False,True,0,0,False,all_ads,/r/learnmachinelearning/comments/lcmqky/projec...,False,6,1612669426,3,Just finished this project! An unfiltered revi...,True,False,False,learnmachinelearning,t5_3cqa1,211898,public,self,[Project] Find Neapolitan pizza with AI help,0,[],1.0,https://www.reddit.com/r/learnmachinelearning/...,all_ads,6,1612464000.0,e21fa83e-accf-11e9-ab9f-0ec7c4b24e8e,Project,,,,,,,,,,,,,,,,,,
4,[],False,ex1us,,[],,text,t2_j1u0r,False,False,[],False,False,1612460837,self.learnmachinelearning,https://www.reddit.com/r/learnmachinelearning/...,{},lclcz1,True,False,False,False,True,True,False,#ea0027,"[{'e': 'text', 't': 'Help'}]",light,richtext,False,False,True,4,0,False,all_ads,/r/learnmachinelearning/comments/lclcz1/is_it_...,False,6,1612666696,0,"If someone is wearing a surgical mask, most fa...",True,False,False,learnmachinelearning,t5_3cqa1,211891,public,self,Is it possible to do face recognition with jus...,0,[],0.5,https://www.reddit.com/r/learnmachinelearning/...,all_ads,6,1612461000.0,af5dfa18-accf-11e9-9669-0ec668ea0cbc,Help,,,,,,,,,,,,,,,,,,


In [54]:
print(dfCom.shape)
dfCom.head()

(578, 38)


Unnamed: 0,all_awardings,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,body,collapsed_because_crowd_control,comment_type,created_utc,gildings,id,is_submitter,link_id,locked,no_follow,parent_id,permalink,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,top_awarded_type,total_awards_received,treatment_tags,created,author_cakeday,submission_title
0,[],,brodimcbroface,,,[],,,,text,t2_2rt62mlc,False,False,[],I've got a sub-question related to this. How f...,,,1612373336,{},glvtwbt,False,lbpu27,False,False,t3_lbpu27,/r/learnmachinelearning/comments/lbpu27/are_hi...,1612681337,3,True,False,learnmachinelearning,t5_3cqa1,,0,[],1612373000.0,,62 Are high level maths like Calc required ...
1,[],,[deleted],,,,,,dark,,,,,[],[removed],,,1612373324,{},glvtv6k,False,lbf4w9,False,True,t3_lbf4w9,/r/learnmachinelearning/comments/lbf4w9/wanted...,1612681321,1,True,False,learnmachinelearning,t5_3cqa1,,0,[],1612373000.0,,82 Wanted to share a free course on Machine...
2,[],,RastputinsBeard,,,[],,,,text,t2_8dt84hp1,False,False,[],Remindme! 1 day,,,1612373185,{},glvtigx,False,lbf4w9,False,True,t3_lbf4w9,/r/learnmachinelearning/comments/lbf4w9/wanted...,1612681128,1,True,False,learnmachinelearning,t5_3cqa1,,0,[],1612373000.0,,82 Wanted to share a free course on Machine...
3,[],,useful4nothin,,,[],,,,text,t2_6k8omr81,False,False,[],"RemindMeRepeat! 3 Days ""Machine Learning Tutor...",,,1612370543,{},glvmxoi,False,lbf4w9,False,False,t3_lbf4w9,/r/learnmachinelearning/comments/lbf4w9/wanted...,1612677743,1,True,False,learnmachinelearning,t5_3cqa1,,0,[],1612371000.0,,82 Wanted to share a free course on Machine...
4,[],,th30rum,,,[],,,,text,t2_8fzb9io,False,False,[],"Know basic of multivariate calc, be really goo...",,,1612369471,{},glvkash,False,lbpu27,False,False,t3_lbpu27,/r/learnmachinelearning/comments/lbpu27/are_hi...,1612676383,5,True,False,learnmachinelearning,t5_3cqa1,,0,[],1612369000.0,,62 Are high level maths like Calc required ...


In [None]:
#load comments
dfCom=readRedditData('comments.pkl')
dfCom.head()

In [51]:
startDate='23-02-2021'
endDate='23-02-2021'
subreddit='wallstreetbets'
fileName1='bitcoin_submission'
fileName2='bitcoin_comments'
dfSub,dfCom=scrapReddit(startDate,endDate,subreddit,fileName1,fileName2,10,2,11,3)
print(dfSub.shape)
dfSub.head()

<generator object PushshiftAPIMinimal._search at 0x7fc963e5f1d0>




Submissions dataframe shape:
(85, 76)
Submissions file saved
Comments dataframe shape:
(3437, 40)
Comments file saved
(85, 76)


Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,can_mod_post,contest_mode,created_utc,domain,full_link,gildings,id,is_crosspostable,is_meta,is_original_content,is_reddit_media_domain,is_robot_indexable,is_self,is_video,link_flair_background_color,link_flair_css_class,link_flair_richtext,link_flair_template_id,link_flair_text,link_flair_text_color,link_flair_type,locked,media_only,no_follow,num_comments,num_crossposts,over_18,parent_whitelist_status,permalink,pinned,pwls,removed_by_category,retrieved_on,score,selftext,send_replies,spoiler,stickied,subreddit,subreddit_id,subreddit_subscribers,subreddit_type,suggested_sort,thumbnail,title,total_awards_received,treatment_tags,upvote_ratio,url,whitelist_status,wls,created,thumbnail_height,thumbnail_width,url_overridden_by_dest,author_flair_background_color,author_flair_text_color,edited,post_hint,preview,banned_by,media,media_embed,secure_media,secure_media_embed
0,[],False,yesriskk,,[],,text,t2_3nm3oose,False,False,[],False,False,1614078132,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,{},lqfskl,False,False,False,False,False,True,False,#800080,question,"[{'e': 'text', 't': 'Discussion'}]",96f6c79e-b853-11e5-a4cb-0ebdf030e05d,Discussion,light,richtext,False,False,True,0,0,False,some_ads,/r/wallstreetbets/comments/lqfskl/bee_is_way_t...,False,7,moderator,1614235753,3,[removed],True,False,False,wallstreetbets,t5_2th52,9263678,public,confidence,default,BEE is way to underrated,0,[],1.0,https://www.reddit.com/r/wallstreetbets/commen...,some_ads,7,1614078000.0,,,,,,,,,,,,,
1,[],False,Financial-Future-875,,[],,text,t2_9tzx7ugs,False,False,[],False,False,1614078123,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,{},lqfshq,False,False,False,False,False,True,False,#800080,question,"[{'e': 'text', 't': 'Discussion'}]",96f6c79e-b853-11e5-a4cb-0ebdf030e05d,Discussion,light,richtext,False,False,True,0,0,False,some_ads,/r/wallstreetbets/comments/lqfshq/a_lot_of_red...,False,7,moderator,1614235748,1,[removed],True,False,False,wallstreetbets,t5_2th52,9263676,public,confidence,default,A lot of red premarkets,0,[],1.0,https://www.reddit.com/r/wallstreetbets/commen...,some_ads,7,1614078000.0,,,,,,,,,,,,,
2,[],False,dbdbdb1999,,[],,text,t2_r4gel43,False,False,[],False,False,1614078110,i.redd.it,https://www.reddit.com/r/wallstreetbets/commen...,{},lqfsbr,False,False,False,True,False,False,False,#014980,meme,"[{'e': 'text', 't': 'Meme'}]",0513bea8-4f64-11e9-886d-0e2b4fe7300c,Meme,light,richtext,False,False,True,0,0,False,some_ads,/r/wallstreetbets/comments/lqfsbr/i_like_the_s...,False,7,moderator,1614235738,1,,True,False,False,wallstreetbets,t5_2th52,9263672,public,confidence,default,I like the stock,0,[],1.0,https://i.redd.it/qgfh02bel7j61.jpg,some_ads,7,1614078000.0,88.0,140.0,https://i.redd.it/qgfh02bel7j61.jpg,,,,,,,,,,
3,"[{'award_sub_type': 'GLOBAL', 'award_type': 'g...",True,AutoModerator,,[],,text,t2_6l4z3,False,True,[],False,False,1614078012,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,{'gid_1': 1},lqfr5e,True,False,False,False,True,True,False,#800080,question,"[{'e': 'text', 't': 'Discussion'}]",96f6c79e-b853-11e5-a4cb-0ebdf030e05d,Discussion,light,richtext,False,False,True,1269,0,False,some_ads,/r/wallstreetbets/comments/lqfr5e/unpinned_dai...,False,7,,1614235666,0,Your daily trading discussion thread. Please k...,False,False,False,wallstreetbets,t5_2th52,9263649,public,new,self,Unpinned Daily Discussion Thread for February ...,1,[],0.47,https://www.reddit.com/r/wallstreetbets/commen...,some_ads,7,1614078000.0,,,,,,,,,,,,,
4,"[{'award_sub_type': 'PREMIUM', 'award_type': '...",True,OPINION_IS_UNPOPULAR,,"[{'e': 'text', 't': 'top notch guava flavored ...",top notch guava flavored mango eggplant,richtext,t2_bd6q5,False,True,[],False,False,1614078012,self.wallstreetbets,https://www.reddit.com/r/wallstreetbets/commen...,{'gid_1': 18},lqfr4s,True,False,False,False,True,True,False,#ffd635,daily,"[{'e': 'text', 't': 'Daily Discussion'}]",7a32c644-8394-11e8-87f6-0ee6340c53d4,Daily Discussion,dark,richtext,False,False,False,75971,1,False,some_ads,/r/wallstreetbets/comments/lqfr4s/daily_discus...,False,7,,1614235665,1122,[JPow Semiannual Monetary Policy Report to the...,False,False,False,wallstreetbets,t5_2th52,9263649,public,new,self,"Daily Discussion Thread for February 23, 2021",48,[],0.9,https://www.reddit.com/r/wallstreetbets/commen...,some_ads,7,1614078000.0,,,,,dark,1614094000.0,self,"{'enabled': False, 'images': [{'id': '_vlYgaOL...",,,,,


# Example 2 

##It takes very long to run!!! Below this, you can find example three that is quicker to run. 

It scraps all comments and submissions for very busy subreddit "wallstreetbets" and saves all the files to new directory. 



In [85]:
startDate='21-02-2021'
endDate='23-02-2021'
subreddit='wallstreetbets'
startHour=10 
startMinute=5
endHour=13
endMinute=30
directoryName="Test"
interval="30minutes"
fileName1='test_submission'
fileName2='test_comments'
intervalScrapReddit(startDate,endDate,subreddit,fileName1,fileName2,startHour,startMinute,endHour,endMinute,interval,directoryName)

Creation of the directory failed or it already exists
end date:1613903700
startDate:1613901900




Submissions dataframe shape:
(13, 68)
Comments dataframe shape:
(442, 40)
Interval:0
Submissions Saved correctly 
Comments Saved correctly 

end date:1613905500
startDate:1613903700
Submissions dataframe shape:
(29, 68)
Comments dataframe shape:
(503, 40)
Interval:1
Submissions Saved correctly 
Comments Saved correctly 

end date:1613907300
startDate:1613905500


KeyboardInterrupt: ignored

# Example 3

Same as example 2, just much shorter 

It scraps all comments and submissions for very busy subreddit "wallstreetbets" and saves all the files to new directory. Afterwards all the files are combined into one dataframe. 



In [116]:
startDate='23-02-2021'
endDate='23-02-2021'
subreddit='wallstreetbets'
startHour=10 
startMinute=5
endHour=13
endMinute=30
directoryName="Test2"
interval="30minutes"
fileName1='test2_submission'
fileName2='test2_comments'
intervalScrapReddit(startDate,endDate,subreddit,fileName1,fileName2,startHour,startMinute,endHour,endMinute,interval,directoryName)

Creation of the directory failed or it already exists
end date:1614076500
startDate:1614074700




Submissions dataframe shape:
(37, 69)
Comments dataframe shape:
(1576, 40)
Interval:0
Submissions Saved correctly 
Comments Saved correctly 

end date:1614078300
startDate:1614076500
Submissions dataframe shape:
(44, 77)
Comments dataframe shape:
(1896, 40)
Interval:1
Submissions Saved correctly 
Comments Saved correctly 

end date:1614080100
startDate:1614078300
Submissions dataframe shape:
(50, 77)
Comments dataframe shape:
(1871, 40)
Interval:2
Submissions Saved correctly 
Comments Saved correctly 

end date:1614081900
startDate:1614080100
Submissions dataframe shape:
(73, 75)
Comments dataframe shape:
(1738, 40)
Interval:3
Submissions Saved correctly 
Comments Saved correctly 

end date:1614083700
startDate:1614081900
Submissions dataframe shape:
(43, 76)
Comments dataframe shape:
(1845, 40)
Interval:4
Submissions Saved correctly 
Comments Saved correctly 

end date:1614085500
startDate:1614083700
Submissions dataframe shape:
(51, 78)
Comments dataframe shape:
(1938, 40)
Interval:5

In [117]:
dfSub,dfCom=loadIntervalReddit("Test2")
dfSub.head()

['test2_comments0.pkl', 'test2_comments1.pkl', 'test2_submission0.pkl', 'test2_comments3.pkl', 'test2_comments6.pkl', 'test2_comments5.pkl', 'test2_submission5.pkl', 'test2_submission3.pkl', 'test2_submission2.pkl', 'test2_comments2.pkl', 'test2_submission6.pkl', 'test2_submission1.pkl', 'test2_comments4.pkl', 'test2_submission4.pkl']
['test2_comments0.pkl', 'test2_comments1.pkl', 'test2_comments2.pkl', 'test2_comments3.pkl', 'test2_comments4.pkl', 'test2_comments5.pkl', 'test2_comments6.pkl', 'test2_submission0.pkl', 'test2_submission1.pkl', 'test2_submission2.pkl', 'test2_submission3.pkl', 'test2_submission4.pkl', 'test2_submission5.pkl', 'test2_submission6.pkl']
test2_comments0.pkl
shape of dataframe:
(1576, 40)
temp
(1576, 40)
test2_comments1.pkl
shape of dataframe:
(1896, 40)
temp
(1896, 40)
test2_comments2.pkl
shape of dataframe:
(1871, 40)
temp
(1871, 40)
test2_comments3.pkl
shape of dataframe:
(1738, 40)
temp
(1738, 40)
test2_comments4.pkl
shape of dataframe:
(1845, 40)
temp
(1

Unnamed: 0,all_awardings,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,body,collapsed_because_crowd_control,comment_type,created_utc,gildings,id,is_submitter,link_id,locked,no_follow,parent_id,permalink,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,top_awarded_type,total_awards_received,treatment_tags,created,distinguished,author_cakeday,media_metadata,submission_title
0,[],,duathman,,,[],,,,text,t2_fw3r0,False,False,[],My eardrums are going to explode I’m listening...,,,1614087299,{},gogi772,False,lqfr4s,False,True,t3_lqfr4s,/r/wallstreetbets/comments/lqfr4s/daily_discus...,1614241605,3,True,False,wallstreetbets,t5_2th52,,0,[],1614087000.0,,,,no submission title
1,[],,CASUL_Chris,,,"[{'e': 'text', 't': 'A fucking hero'}]",,A fucking hero,dark,richtext,t2_e7bkj,False,True,[],Fck I lied. I have January 2022 calls for PLTR,,,1614087299,{},gogi76v,False,lqhrqq,False,True,t1_gogi4qg,/r/wallstreetbets/comments/lqhrqq/prayer_threa...,1614241605,2,True,False,wallstreetbets,t5_2th52,,0,[],1614087000.0,,,,no submission title
2,[],,deegethesqueege,,,[],,,,text,t2_11yag4,False,False,[],"All in an effort to take the focus off Melvin,...",,,1614087296,{},gogi707,False,lpweob,False,True,t3_lpweob,/r/wallstreetbets/comments/lpweob/robin_the_ho...,1614241602,1,True,False,wallstreetbets,t5_2th52,,0,[],1614087000.0,,,,no submission title
3,[],,Ireallydontknowbuddy,,,[],,,,text,t2_5vf4yiab,False,False,[],Damn dude that's some balls. Yeah I blew up my...,,,1614087296,{},gogi705,False,lq4udf,False,True,t1_gog7y7m,/r/wallstreetbets/comments/lq4udf/bought_a_car...,1614241602,1,True,False,wallstreetbets,t5_2th52,,0,[],1614087000.0,,,,no submission title
4,[],,_SirCalibur_,,,[],,,,text,t2_qee5g,False,False,[],then i will loose with him \nI will stand wit...,,,1614087296,{},gogi6zc,False,lqhrqq,False,False,t1_gogh3lz,/r/wallstreetbets/comments/lqhrqq/prayer_threa...,1614241602,6,True,False,wallstreetbets,t5_2th52,,0,[],1614087000.0,,,,no submission title


In [118]:
dfCom.head()

Unnamed: 0,all_awardings,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,author_fullname,author_patreon_flair,author_premium,awarders,body,collapsed_because_crowd_control,comment_type,created_utc,gildings,id,is_submitter,link_id,locked,no_follow,parent_id,permalink,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,top_awarded_type,total_awards_received,treatment_tags,created,distinguished,media_metadata,author_cakeday,submission_title
0,[],,baturu,,,[],,,,text,t2_osvqv,False,True,[],"Blessing in disguise, I'm looking for the best...",,,1614076499,{},gog3sct,False,lpl1je,False,False,t1_gog39t8,/r/wallstreetbets/comments/lpl1je/gme_megathre...,1614234688,3,True,False,wallstreetbets,t5_2th52,,0,[],1614076000.0,,,,no submission title
1,[],,el-papes,,,[],,,,text,t2_5dg4qr0,False,True,[],Nah theres always been a shortage and demand i...,,,1614076499,{},gog3scl,False,lpzquu,False,False,t1_gog3mir,/r/wallstreetbets/comments/lpzquu/what_are_you...,1614234688,1,True,False,wallstreetbets,t5_2th52,,0,[],1614076000.0,,,,no submission title
2,[],,Xx360StalinScopedxX,,,[],,,,text,t2_mxh29,False,False,[],Yeah ring the register next time so you can us...,,,1614076498,{},gog3sax,False,lpzquu,False,False,t1_gog3jlj,/r/wallstreetbets/comments/lpzquu/what_are_you...,1614234688,2,True,False,wallstreetbets,t5_2th52,,0,[],1614076000.0,,,,no submission title
3,[],,Krahndaddy,,,[],,,,text,t2_1x0bhm1p,False,False,[],Is it safe to say I'm part of 🐻 gang yet? Or i...,,,1614076497,{},gog3sao,False,lpzquu,False,False,t3_lpzquu,/r/wallstreetbets/comments/lpzquu/what_are_you...,1614234688,2,True,False,wallstreetbets,t5_2th52,,0,[],1614076000.0,,,,no submission title
4,[],,Jessper,,,[],,,,text,t2_7u0lp,False,False,[],You sound like you're celebrating someone wipi...,,,1614076497,{},gog3s9y,False,lpzquu,False,False,t1_gog3omh,/r/wallstreetbets/comments/lpzquu/what_are_you...,1614234688,3,True,False,wallstreetbets,t5_2th52,,0,[],1614076000.0,,,,no submission title
