## Pushshift API Reddit comment scraper
In this notebook we will scrape Reddit comments from Pushshift's API, Pushshift is a database for social media and is particularly known for having a broad Reddit database. Pushshift's API is more popular for scraping larger amounts of data, while Reddit's API restricts the scraping per time period and such makes it impossible to scrape all data from a larger time period. For more information about Pushshift, the database and the API, check out their website (https://pushshift.io/).

The following code will consist out of iteratively requesting comment data from the subreddit's Dot and Polkadot, the two most popular subreddit's about Polkadot with a each having around 17k members at the time of writing. The API returns all the newest comments ascendingly from a the 'after date', specified in the link. A following request will take the date from the most recent scraped post and add this to the request. In this way the API returns all post between a timeframe without repition. From the requested data, we will extract the usefull features and save that to a df. For every month of data, the df will be saved to a csv file iteratively, this continues until the last request will return a comment that exceeds the threshold date. The scraped data will contain data between 18-08-2020 (23:00 UTC) to 18-02-2021 (23:59 UTC).


In [1]:
import requests
import ujson as json
import re
import time
import pandas as pd


In [2]:
# This function handles a single request by inserting the parameters to the URL.
# Continuously the selected features are saved to a df and returned.

def get_comments(after_date, subreddit = "dot", size = 1000):        
    #retrieves comments from api.pushshift.io
    PUSHSHIFT_REDDIT_API = \
    f"https://api.pushshift.io/reddit/search/comment/?subreddit={subreddit}&sort=asc&sort_type=created_utc&after={after_date}&size={size}"
    
    r = requests.get(PUSHSHIFT_REDDIT_API, timeout=30)

    # Check the status code, if successful, process the data to DataFrame
    if r.status_code == 200:
        response = json.loads(r.text)
        data = response['data']
        good_columns = ['author', 'body', 'created_utc', 'id', 'permalink', 'retrieved_on', 'score', 'subreddit']
        df = pd.DataFrame(data)[good_columns]
        return df


In [3]:
# This function will call the 'get_comments' function iteratively until a threshold (end_epoch) is reached.
# The function returns a df with all combined returns of 'get_comments'
# Finally, the possible comments that were returned that exceeded the end_epoch, are discarted. 

def reddit_comments_timeframe(start_epoch, end_epoch, subreddit):
    #Calls get_comments() iteratively to load all data into DataFrame within timeframe.
    df = get_comments(start_epoch, subreddit = subreddit)
    counter = 1
    print(f"{counter} th loop")
    new_epoch = str(df.iloc[-1,2])
    time.sleep(.5)
    while int(new_epoch) < int(end_epoch):
        df = df.append(get_comments(new_epoch, subreddit = subreddit))
        new_epoch = str(df.iloc[-1,2])
        counter += 1
        print(f"{counter} th loop, {abs(int(new_epoch) - int(end_epoch))} second to check for comments")
        time.sleep(.5)
    return df[df['created_utc'] < int(end_epoch)]
    

In [4]:
# The following lists contain the csv file names that will be used to save the data to later.
csv_names_list_dot = ['Reddit_august_dot.csv', 'Reddit_september_dot.csv', 'Reddit_october_dot.csv', 'Reddit_november_dot.csv',\
                      'Reddit_december_dot.csv', 'Reddit_january_dot.csv', 'Reddit_february_dot.csv']
csv_names_list_Polkadot = ['Reddit_august_Polkadot.csv', 'Reddit_september_Polkadot.csv', 'Reddit_october_Polkadot.csv',\
                           'Reddit_november_Polkadot.csv', 'Reddit_december_Polkadot.csv', 'Reddit_january_Polkadot.csv', 'Reddit_february_Polkadot.csv']

# The follwing list contains the start and end time expressed in Unix epoch. 
# If i is the start time, i+1 is the end time. Therefore the list contains 8 items to form 7 timeframes.
epoch_times = ["1597791600","1598918400", "1601510400", "1604188800","1606780800", "1609459200", "1612137600", "1613692800"]


In [5]:
for i, x in enumerate(csv_names_list_dot):
    df = reddit_comments_timeframe(epoch_times[i], epoch_times[i+1], "dot")
    df.to_csv(x, sep=',', index=False)
    df = reddit_comments_timeframe(epoch_times[i], epoch_times[i+1], "Polkadot")
    df.to_csv(csv_names_list_Polkadot[i], sep=',', index=False)

1 th loop
2 th loop, -705162 to go
3 th loop, -525895 to go
4 th loop, -370231 to go
5 th loop, -267736 to go
6 th loop, -142252 to go
7 th loop, 180251 to go
1 th loop
2 th loop, -210527 to go
3 th loop, 815497 to go
1 th loop
2 th loop, -2161485 to go
3 th loop, -1921859 to go
4 th loop, -1754981 to go
5 th loop, -1401389 to go
6 th loop, -1016329 to go
7 th loop, -596352 to go
8 th loop, -198037 to go
9 th loop, 290757 to go
1 th loop
2 th loop, 75084 to go
1 th loop
2 th loop, -2001460 to go
3 th loop, -1534065 to go
4 th loop, -1312709 to go
5 th loop, -863528 to go
6 th loop, -343082 to go
7 th loop, 105598 to go
1 th loop
2 th loop, 1381597 to go
1 th loop
2 th loop, -1581976 to go
3 th loop, -596740 to go
4 th loop, 198618 to go
1 th loop
2 th loop, -188390 to go
3 th loop, 662139 to go
1 th loop
2 th loop, -1966756 to go
3 th loop, -1265713 to go
4 th loop, -792680 to go
5 th loop, -178147 to go
6 th loop, 12911 to go
1 th loop
2 th loop, -1292021 to go
3 th loop, -402348 to g