## Pushshift API Reddit comment scraper
In this notebook we will scrape Reddit comments from Pushshift's API, Pushshift is a database for social media and is particularly known for having a broad Reddit database. Pushshift's API is more popular for scraping larger amounts of data, while Reddit's API restricts the scraping per time period and such makes it impossible to scrape all data from a larger time period. For more information about Pushshift, the database and the API, check out their website (https://pushshift.io/).

The following code will consist out of iteratively requesting comment data from Cardano's subreddit, the most popular subreddit about Cardano with around 237k members at the time of writing. The API returns all the newest comments ascendingly from a the 'after date', specified in the link. A following request will take the date from the most recent scraped post and add this to the request. In this way the API returns all post between a timeframe without repition. From the requested data, we will extract the usefull features and save that to a df. For every month of data, the df will be saved to a csv file iteratively, this continues until the last request will return a comment that exceeds the threshold date. The scraped data will contain data between 01-01-2019 (00:00 UTC) to 28-02-2021 (23:59 UTC).


In [1]:
import requests
import ujson as json
import re
import time
import pandas as pd


In [2]:
# This function handles a single request by inserting the parameters to the URL.
# Continuously the selected features are saved to a df and returned.

def get_comments(after_date, subreddit = "cardano", size = 50):        
    #retrieves comments from api.pushshift.io
    PUSHSHIFT_REDDIT_API = \
    f"https://api.pushshift.io/reddit/search/comment/?subreddit={subreddit}&sort=asc&sort_type=created_utc&after={after_date}&size={size}"
    
    r = requests.get(PUSHSHIFT_REDDIT_API, timeout=40)

    # Check the status code, if successful, process the data to DataFrame
    if r.status_code == 200:
        response = json.loads(r.text)
        data = response['data']
        good_features =   ["author", "body", "created_utc", "id", "score", "subreddit"]
        df = pd.DataFrame(data)[good_features]
        return df


In [3]:
# This function will call the 'get_comments' function iteratively until a threshold (end_epoch) is reached.
# The function returns a df with all combined returns of 'get_comments'
# Finally, the possible comments that were returned that exceeded the end_epoch, are discarted. 

def reddit_comments_timeframe(start_epoch, end_epoch, subreddit):
    #Calls get_comments() iteratively to load all data into DataFrame within timeframe.
    df = get_comments(start_epoch, subreddit = subreddit)
    counter = 1
    print(f"{counter} st loop")
    new_epoch = str(df.iloc[-1,2])
    time.sleep(.5)
    while int(new_epoch) < int(end_epoch):
        df = df.append(get_comments(new_epoch, subreddit = subreddit))
        new_epoch = str(df.iloc[-1,2])
        counter += 1
        print(f"{counter} th loop, {abs(int(new_epoch) - int(end_epoch))} seconds in time period to check for comments")
        time.sleep(1.25)
    return df[df['created_utc'] < int(end_epoch)]
    

In [4]:
# The following lists contain the csv file names that will be used to save the data to later.
csv_names_list = ['R_ada_march_18.csv', 'R_ada_april_18.csv', 'R_ada_may_18.csv', 'R_ada_june_18.csv', 'R_ada_july_18.csv', 'R_ada_august_18.csv',\
                  'R_ada_september_18.csv', 'R_ada_october_18.csv', 'R_ada_november_18.csv', 'R_ada_december_18.csv','R_ada_january_19.csv',\
                  'R_ada_febrauri_19.csv', 'R_ada_march_19.csv', 'R_ada_april_19.csv', 'R_ada_may_19.csv', 'R_ada_june_19.csv', 'R_ada_july_19.csv',\
                  'R_ada_august_19.csv', 'R_ada_september_19.csv', 'R_ada_october_19.csv', 'R_ada_november_19.csv', 'R_ada_december_19.csv',\
                  'R_ada_january_20.csv', 'R_ada_febrauri_20.csv', 'R_ada_march_20.csv', 'R_ada_april_20.csv', 'R_ada_may_20.csv', 'R_ada_june_20.csv',\
                  'R_ada_july_20.csv', 'R_ada_august_20.csv', 'R_ada_september_20.csv', 'R_ada_october_20.csv', 'R_ada_november_20.csv',\
                  'R_ada_december_20.csv', 'R_ada_january_21.csv', 'R_ada_february_21.csv', 'R_ada_march_21.csv','R_ada_april_21.csv', 'R_ada_may_21.csv']


# The follwing list contains the start and end time expressed in Unix epoch. 
# If i is the start time, i+1 is the end time. Therefore the list contains 8 items to form 7 timeframes.
epoch_times = ["1519862400", "1522540800", "1525132800", "1527811200", "1530403200", "1533081600", "1535760000", "1538352000", "1541030400", "1543622400",\
               "1546300800", "1548979200", "1551398400", "1554076800", "1556668800", "1559347200", "1561939200", "1564617600", "1567296000", "1569888000",\
               "1572566400", "1575158400", "1577836800", "1580515200", "1583020800", "1585699200", "1588291200", "1590969600", "1593561600", "1596240000",\
               "1598918400", "1601510400", "1604188800","1606780800", "1609459200", "1612137600", "1614556800", "1617235200", "1619827200","1622505600"]


In [18]:
for i, x in enumerate(csv_names_list):
    df = reddit_comments_timeframe(epoch_times[i], epoch_times[i+1], "cardano")
    df.to_csv(x, sep=',', index=False)

1 st loop
2 th loop, 1255081 seconds in time period to check for comments
3 th loop, 1252010 seconds in time period to check for comments
4 th loop, 1250607 seconds in time period to check for comments
5 th loop, 1247974 seconds in time period to check for comments
6 th loop, 1244626 seconds in time period to check for comments
7 th loop, 1241793 seconds in time period to check for comments
8 th loop, 1238206 seconds in time period to check for comments
9 th loop, 1236102 seconds in time period to check for comments
10 th loop, 1234032 seconds in time period to check for comments
11 th loop, 1231988 seconds in time period to check for comments
12 th loop, 1230062 seconds in time period to check for comments
13 th loop, 1227918 seconds in time period to check for comments
14 th loop, 1224728 seconds in time period to check for comments
15 th loop, 1222728 seconds in time period to check for comments
16 th loop, 1220466 seconds in time period to check for comments
17 th loop, 1217490 sec

128 th loop, 987553 seconds in time period to check for comments
129 th loop, 986185 seconds in time period to check for comments
130 th loop, 982675 seconds in time period to check for comments
131 th loop, 979361 seconds in time period to check for comments
132 th loop, 977512 seconds in time period to check for comments
133 th loop, 975079 seconds in time period to check for comments
134 th loop, 972390 seconds in time period to check for comments
135 th loop, 968920 seconds in time period to check for comments
136 th loop, 966923 seconds in time period to check for comments
137 th loop, 964211 seconds in time period to check for comments
138 th loop, 962842 seconds in time period to check for comments
139 th loop, 959486 seconds in time period to check for comments
140 th loop, 956500 seconds in time period to check for comments
141 th loop, 954458 seconds in time period to check for comments
142 th loop, 952415 seconds in time period to check for comments
143 th loop, 950909 secon

255 th loop, 787122 seconds in time period to check for comments
256 th loop, 786697 seconds in time period to check for comments
257 th loop, 786074 seconds in time period to check for comments
258 th loop, 785586 seconds in time period to check for comments
259 th loop, 785131 seconds in time period to check for comments
260 th loop, 784185 seconds in time period to check for comments
261 th loop, 783348 seconds in time period to check for comments
262 th loop, 782367 seconds in time period to check for comments
263 th loop, 781553 seconds in time period to check for comments
264 th loop, 780881 seconds in time period to check for comments
265 th loop, 780159 seconds in time period to check for comments
266 th loop, 779075 seconds in time period to check for comments
267 th loop, 778416 seconds in time period to check for comments
268 th loop, 777501 seconds in time period to check for comments
269 th loop, 776605 seconds in time period to check for comments
270 th loop, 775724 secon

382 th loop, 600820 seconds in time period to check for comments
383 th loop, 598290 seconds in time period to check for comments
384 th loop, 597109 seconds in time period to check for comments
385 th loop, 595970 seconds in time period to check for comments
386 th loop, 594410 seconds in time period to check for comments
387 th loop, 592459 seconds in time period to check for comments
388 th loop, 590390 seconds in time period to check for comments
389 th loop, 589430 seconds in time period to check for comments
390 th loop, 587787 seconds in time period to check for comments
391 th loop, 586365 seconds in time period to check for comments
392 th loop, 584975 seconds in time period to check for comments
393 th loop, 583896 seconds in time period to check for comments
394 th loop, 582603 seconds in time period to check for comments
395 th loop, 581566 seconds in time period to check for comments
396 th loop, 580763 seconds in time period to check for comments
397 th loop, 579777 secon

509 th loop, 421981 seconds in time period to check for comments
510 th loop, 421114 seconds in time period to check for comments
511 th loop, 420433 seconds in time period to check for comments
512 th loop, 419342 seconds in time period to check for comments
513 th loop, 418357 seconds in time period to check for comments
514 th loop, 416821 seconds in time period to check for comments
515 th loop, 416053 seconds in time period to check for comments
516 th loop, 415224 seconds in time period to check for comments
517 th loop, 414159 seconds in time period to check for comments
518 th loop, 413118 seconds in time period to check for comments
519 th loop, 412328 seconds in time period to check for comments
520 th loop, 411374 seconds in time period to check for comments
521 th loop, 410466 seconds in time period to check for comments
522 th loop, 409541 seconds in time period to check for comments
523 th loop, 408668 seconds in time period to check for comments
524 th loop, 407671 secon

636 th loop, 239739 seconds in time period to check for comments
637 th loop, 238405 seconds in time period to check for comments
638 th loop, 236532 seconds in time period to check for comments
639 th loop, 234943 seconds in time period to check for comments
640 th loop, 233713 seconds in time period to check for comments
641 th loop, 231529 seconds in time period to check for comments
642 th loop, 229353 seconds in time period to check for comments
643 th loop, 228209 seconds in time period to check for comments
644 th loop, 226812 seconds in time period to check for comments
645 th loop, 225649 seconds in time period to check for comments
646 th loop, 223641 seconds in time period to check for comments
647 th loop, 222248 seconds in time period to check for comments
648 th loop, 221238 seconds in time period to check for comments
649 th loop, 220265 seconds in time period to check for comments
650 th loop, 219128 seconds in time period to check for comments
651 th loop, 218331 secon

763 th loop, 142486 seconds in time period to check for comments
764 th loop, 141888 seconds in time period to check for comments
765 th loop, 141271 seconds in time period to check for comments
766 th loop, 140856 seconds in time period to check for comments
767 th loop, 140316 seconds in time period to check for comments
768 th loop, 139685 seconds in time period to check for comments
769 th loop, 138953 seconds in time period to check for comments
770 th loop, 138289 seconds in time period to check for comments
771 th loop, 137701 seconds in time period to check for comments
772 th loop, 137318 seconds in time period to check for comments
773 th loop, 136792 seconds in time period to check for comments
774 th loop, 136150 seconds in time period to check for comments
775 th loop, 135590 seconds in time period to check for comments
776 th loop, 135052 seconds in time period to check for comments
777 th loop, 134385 seconds in time period to check for comments
778 th loop, 133764 secon

KeyError: "None of [Index(['author', 'body', 'created_utc', 'id', 'score', 'subreddit'], dtype='object')] are in the [columns]"