## Pushshift API Reddit comment scraper
In this notebook we will scrape Reddit comments from Pushshift's API, Pushshift is a database for social media and is particularly known for having a broad Reddit database. Pushshift's API is more popular for scraping larger amounts of data, while Reddit's API restricts the scraping per time period and such makes it impossible to scrape all data from a larger time period. For more information about Pushshift, the database and the API, check out their website (https://pushshift.io/).

The following code will consist out of iteratively requesting comment data from Cardano's subreddit, the most popular subreddit about Cardano with around 237k members at the time of writing. The API returns all the newest comments ascendingly from a the 'after date', specified in the link. A following request will take the date from the most recent scraped post and add this to the request. In this way the API returns all post between a timeframe without repition. From the requested data, we will extract the usefull features and save that to a df. For every month of data, the df will be saved to a csv file iteratively, this continues until the last request will return a comment that exceeds the threshold date. The scraped data will contain data between 01-01-2019 (00:00 UTC) to 28-02-2021 (23:59 UTC).


In [1]:
import requests
import ujson as json
import re
import time
import pandas as pd


In [2]:
# This function handles a single request by inserting the parameters to the URL.
# Continuously the selected features are saved to a df and returned.

def get_comments(after_date, subreddit = "cardano", size = 1000):        
    #retrieves comments from api.pushshift.io
    PUSHSHIFT_REDDIT_API = \
    f"https://api.pushshift.io/reddit/search/comment/?subreddit={subreddit}&sort=asc&sort_type=created_utc&after={after_date}&size={size}"
    
    r = requests.get(PUSHSHIFT_REDDIT_API, timeout=30)

    # Check the status code, if successful, process the data to DataFrame
    if r.status_code == 200:
        response = json.loads(r.text)
        data = response['data']
        good_columns = ['author', 'body', 'created_utc', 'id', 'permalink', 'retrieved_on', 'score', 'subreddit']
        df = pd.DataFrame(data)[good_columns]
        return df


In [3]:
# This function will call the 'get_comments' function iteratively until a threshold (end_epoch) is reached.
# The function returns a df with all combined returns of 'get_comments'
# Finally, the possible comments that were returned that exceeded the end_epoch, are discarted. 

def reddit_comments_timeframe(start_epoch, end_epoch, subreddit):
    #Calls get_comments() iteratively to load all data into DataFrame within timeframe.
    df = get_comments(start_epoch, subreddit = subreddit)
    counter = 1
    print(f"{counter} st loop")
    new_epoch = str(df.iloc[-1,2])
    time.sleep(.5)
    while int(new_epoch) < int(end_epoch):
        df = df.append(get_comments(new_epoch, subreddit = subreddit))
        new_epoch = str(df.iloc[-1,2])
        counter += 1
        print(f"{counter} th loop, {abs(int(new_epoch) - int(end_epoch))} seconds in time period to check for comments")
        time.sleep(.5)
    return df[df['created_utc'] < int(end_epoch)]
    

In [4]:
# The following lists contain the csv file names that will be used to save the data to later.
csv_names_list = ['R_ada_january_19.csv', 'R_ada_febrauri_19.csv', 'R_ada_march_19.csv', 'R_ada_april_19.csv', 'R_ada_may_19.csv', 'R_ada_june_19.csv',\
                  'R_ada_july_19.csv', 'R_ada_august_19.csv', 'R_ada_september_19.csv', 'R_ada_october_19.csv', 'R_ada_november_19.csv',\
                  'R_ada_december_19.csv', 'R_ada_january_20.csv', 'R_ada_febrauri_20.csv', 'R_ada_march_20.csv', 'R_ada_april_20.csv', 'R_ada_may_20.csv',\
                  'R_ada_june_20.csv', 'R_ada_july_20.csv', 'R_ada_august_20.csv', 'R_ada_september_20.csv', 'R_ada_october_20.csv',\
                  'R_ada_november_20.csv', 'R_ada_december_20.csv', 'R_ada_january_21.csv', 'R_ada_february_21.csv']


# The follwing list contains the start and end time expressed in Unix epoch. 
# If i is the start time, i+1 is the end time. Therefore the list contains 8 items to form 7 timeframes.
epoch_times = ["1546300800", "1548979200", "1551398400", "1554076800", "1556668800", "1559347200", "1561939200", "1564617600", "1567296000", "1569888000",\
               "1572566400", "1575158400", "1577836800", "1580515200", "1583020800", "1585699200", "1588291200", "1590969600", "1593561600", "1596240000",\
               "1598918400", "1601510400", "1604188800","1606780800", "1609459200", "1612137600", "1614556800"]


In [5]:
for i, x in enumerate(csv_names_list):
    df = reddit_comments_timeframe(epoch_times[i], epoch_times[i+1], "cardano")
    df.to_csv(f"F:/Thesis database/Cardano/reddit/{x}", sep=',', index=False)

1 st loop
2 th loop, 2542724 seconds in time period to check for comments
3 th loop, 2466168 seconds in time period to check for comments
4 th loop, 2435776 seconds in time period to check for comments
5 th loop, 2382073 seconds in time period to check for comments
6 th loop, 2343406 seconds in time period to check for comments
7 th loop, 2280925 seconds in time period to check for comments
8 th loop, 2211804 seconds in time period to check for comments
9 th loop, 2159039 seconds in time period to check for comments
10 th loop, 2116455 seconds in time period to check for comments
11 th loop, 2080285 seconds in time period to check for comments
12 th loop, 2031178 seconds in time period to check for comments
13 th loop, 1987366 seconds in time period to check for comments
14 th loop, 1936373 seconds in time period to check for comments
15 th loop, 1910370 seconds in time period to check for comments
16 th loop, 1856124 seconds in time period to check for comments
17 th loop, 1824182 sec

9 th loop, 2043234 seconds in time period to check for comments
10 th loop, 1969150 seconds in time period to check for comments
11 th loop, 1928714 seconds in time period to check for comments
12 th loop, 1894135 seconds in time period to check for comments
13 th loop, 1827381 seconds in time period to check for comments
14 th loop, 1792595 seconds in time period to check for comments
15 th loop, 1739645 seconds in time period to check for comments
16 th loop, 1684992 seconds in time period to check for comments
17 th loop, 1636770 seconds in time period to check for comments
18 th loop, 1558271 seconds in time period to check for comments
19 th loop, 1483643 seconds in time period to check for comments
20 th loop, 1424072 seconds in time period to check for comments
21 th loop, 1360814 seconds in time period to check for comments
22 th loop, 1244617 seconds in time period to check for comments
23 th loop, 1200996 seconds in time period to check for comments
24 th loop, 1153954 second

33 th loop, 649173 seconds in time period to check for comments
34 th loop, 606702 seconds in time period to check for comments
35 th loop, 525415 seconds in time period to check for comments
36 th loop, 459265 seconds in time period to check for comments
37 th loop, 385445 seconds in time period to check for comments
38 th loop, 352799 seconds in time period to check for comments
39 th loop, 288189 seconds in time period to check for comments
40 th loop, 226941 seconds in time period to check for comments
41 th loop, 173532 seconds in time period to check for comments
42 th loop, 122694 seconds in time period to check for comments
43 th loop, 94763 seconds in time period to check for comments
44 th loop, 13396 seconds in time period to check for comments
45 th loop, 54750 seconds in time period to check for comments
1 st loop
2 th loop, 2519720 seconds in time period to check for comments
3 th loop, 2453981 seconds in time period to check for comments
4 th loop, 2388746 seconds in tim

28 th loop, 82058 seconds in time period to check for comments
29 th loop, 27489 seconds in time period to check for comments
1 st loop
2 th loop, 2416623 seconds in time period to check for comments
3 th loop, 2375629 seconds in time period to check for comments
4 th loop, 2337264 seconds in time period to check for comments
5 th loop, 2260993 seconds in time period to check for comments
6 th loop, 2174507 seconds in time period to check for comments
7 th loop, 2084935 seconds in time period to check for comments
8 th loop, 2029921 seconds in time period to check for comments
9 th loop, 1974225 seconds in time period to check for comments
10 th loop, 1884002 seconds in time period to check for comments
11 th loop, 1795666 seconds in time period to check for comments
12 th loop, 1716374 seconds in time period to check for comments
13 th loop, 1646587 seconds in time period to check for comments
14 th loop, 1587593 seconds in time period to check for comments
15 th loop, 1543730 seconds

11 th loop, 2251463 seconds in time period to check for comments
12 th loop, 2226684 seconds in time period to check for comments
13 th loop, 2204113 seconds in time period to check for comments
14 th loop, 2183571 seconds in time period to check for comments
15 th loop, 2167296 seconds in time period to check for comments
16 th loop, 2122007 seconds in time period to check for comments
17 th loop, 2084092 seconds in time period to check for comments
18 th loop, 2053274 seconds in time period to check for comments
19 th loop, 2013640 seconds in time period to check for comments
20 th loop, 1960989 seconds in time period to check for comments
21 th loop, 1923813 seconds in time period to check for comments
22 th loop, 1883101 seconds in time period to check for comments
23 th loop, 1842400 seconds in time period to check for comments
24 th loop, 1815357 seconds in time period to check for comments
25 th loop, 1779475 seconds in time period to check for comments
26 th loop, 1754028 secon

28 th loop, 525142 seconds in time period to check for comments
29 th loop, 469183 seconds in time period to check for comments
30 th loop, 429796 seconds in time period to check for comments
31 th loop, 355294 seconds in time period to check for comments
32 th loop, 222651 seconds in time period to check for comments
33 th loop, 102522 seconds in time period to check for comments
34 th loop, 73376 seconds in time period to check for comments
35 th loop, 28011 seconds in time period to check for comments
36 th loop, 6907 seconds in time period to check for comments
1 st loop
2 th loop, 2491155 seconds in time period to check for comments
3 th loop, 2424517 seconds in time period to check for comments
4 th loop, 2348316 seconds in time period to check for comments
5 th loop, 2251447 seconds in time period to check for comments
6 th loop, 2176704 seconds in time period to check for comments
7 th loop, 2097607 seconds in time period to check for comments
8 th loop, 2045586 seconds in time

20 th loop, 2044525 seconds in time period to check for comments
21 th loop, 2019773 seconds in time period to check for comments
22 th loop, 1988495 seconds in time period to check for comments
23 th loop, 1972520 seconds in time period to check for comments
24 th loop, 1937902 seconds in time period to check for comments
25 th loop, 1925069 seconds in time period to check for comments
26 th loop, 1908016 seconds in time period to check for comments
27 th loop, 1881330 seconds in time period to check for comments
28 th loop, 1848944 seconds in time period to check for comments
29 th loop, 1834313 seconds in time period to check for comments
30 th loop, 1808937 seconds in time period to check for comments
31 th loop, 1768458 seconds in time period to check for comments
32 th loop, 1721451 seconds in time period to check for comments
33 th loop, 1684263 seconds in time period to check for comments
34 th loop, 1652067 seconds in time period to check for comments
35 th loop, 1629173 secon

59 th loop, 1367759 seconds in time period to check for comments
60 th loop, 1348132 seconds in time period to check for comments
61 th loop, 1311334 seconds in time period to check for comments
62 th loop, 1281966 seconds in time period to check for comments
63 th loop, 1247119 seconds in time period to check for comments
64 th loop, 1231517 seconds in time period to check for comments
65 th loop, 1210766 seconds in time period to check for comments
66 th loop, 1172131 seconds in time period to check for comments
67 th loop, 1148741 seconds in time period to check for comments
68 th loop, 1126630 seconds in time period to check for comments
69 th loop, 1092829 seconds in time period to check for comments
70 th loop, 1064876 seconds in time period to check for comments
71 th loop, 1028601 seconds in time period to check for comments
72 th loop, 978136 seconds in time period to check for comments
73 th loop, 953432 seconds in time period to check for comments
74 th loop, 906187 seconds 

64 th loop, 1841634 seconds in time period to check for comments
65 th loop, 1832453 seconds in time period to check for comments
66 th loop, 1825819 seconds in time period to check for comments
67 th loop, 1816945 seconds in time period to check for comments
68 th loop, 1806833 seconds in time period to check for comments
69 th loop, 1791536 seconds in time period to check for comments
70 th loop, 1770743 seconds in time period to check for comments
71 th loop, 1758374 seconds in time period to check for comments
72 th loop, 1746097 seconds in time period to check for comments
73 th loop, 1737159 seconds in time period to check for comments
74 th loop, 1722689 seconds in time period to check for comments
75 th loop, 1688081 seconds in time period to check for comments
76 th loop, 1673424 seconds in time period to check for comments
77 th loop, 1661411 seconds in time period to check for comments
78 th loop, 1653646 seconds in time period to check for comments
79 th loop, 1644012 secon

29 th loop, 1615555 seconds in time period to check for comments
30 th loop, 1563293 seconds in time period to check for comments
31 th loop, 1506760 seconds in time period to check for comments
32 th loop, 1477505 seconds in time period to check for comments
33 th loop, 1418053 seconds in time period to check for comments
34 th loop, 1385206 seconds in time period to check for comments
35 th loop, 1333281 seconds in time period to check for comments
36 th loop, 1308126 seconds in time period to check for comments
37 th loop, 1256674 seconds in time period to check for comments
38 th loop, 1235074 seconds in time period to check for comments
39 th loop, 1146114 seconds in time period to check for comments
40 th loop, 1112242 seconds in time period to check for comments
41 th loop, 1067955 seconds in time period to check for comments
42 th loop, 1021078 seconds in time period to check for comments
43 th loop, 966389 seconds in time period to check for comments
44 th loop, 911103 seconds

27 th loop, 1395234 seconds in time period to check for comments
28 th loop, 1370033 seconds in time period to check for comments
29 th loop, 1338484 seconds in time period to check for comments
30 th loop, 1321615 seconds in time period to check for comments
31 th loop, 1300664 seconds in time period to check for comments
32 th loop, 1275340 seconds in time period to check for comments
33 th loop, 1247059 seconds in time period to check for comments
34 th loop, 1225822 seconds in time period to check for comments
35 th loop, 1192757 seconds in time period to check for comments
36 th loop, 1148568 seconds in time period to check for comments
37 th loop, 1121730 seconds in time period to check for comments
38 th loop, 1083500 seconds in time period to check for comments
39 th loop, 1047193 seconds in time period to check for comments
40 th loop, 1010834 seconds in time period to check for comments
41 th loop, 976500 seconds in time period to check for comments
42 th loop, 957715 seconds

74 th loop, 725692 seconds in time period to check for comments
75 th loop, 716259 seconds in time period to check for comments
76 th loop, 706204 seconds in time period to check for comments
77 th loop, 690822 seconds in time period to check for comments
78 th loop, 653262 seconds in time period to check for comments
79 th loop, 629308 seconds in time period to check for comments
80 th loop, 606876 seconds in time period to check for comments
81 th loop, 573416 seconds in time period to check for comments
82 th loop, 532808 seconds in time period to check for comments
83 th loop, 509444 seconds in time period to check for comments
84 th loop, 488249 seconds in time period to check for comments
85 th loop, 462663 seconds in time period to check for comments
86 th loop, 446890 seconds in time period to check for comments
87 th loop, 428234 seconds in time period to check for comments
88 th loop, 396793 seconds in time period to check for comments
89 th loop, 376072 seconds in time perio

91 th loop, 1652650 seconds in time period to check for comments
92 th loop, 1642141 seconds in time period to check for comments
93 th loop, 1630947 seconds in time period to check for comments
94 th loop, 1611122 seconds in time period to check for comments
95 th loop, 1595160 seconds in time period to check for comments
96 th loop, 1581690 seconds in time period to check for comments
97 th loop, 1571309 seconds in time period to check for comments
98 th loop, 1563291 seconds in time period to check for comments
99 th loop, 1548395 seconds in time period to check for comments
100 th loop, 1532508 seconds in time period to check for comments
101 th loop, 1514501 seconds in time period to check for comments
102 th loop, 1504974 seconds in time period to check for comments
103 th loop, 1499977 seconds in time period to check for comments
104 th loop, 1492394 seconds in time period to check for comments
105 th loop, 1485366 seconds in time period to check for comments
106 th loop, 147773

217 th loop, 144025 seconds in time period to check for comments
218 th loop, 126683 seconds in time period to check for comments
219 th loop, 112175 seconds in time period to check for comments
220 th loop, 105422 seconds in time period to check for comments
221 th loop, 96248 seconds in time period to check for comments
222 th loop, 90291 seconds in time period to check for comments
223 th loop, 85082 seconds in time period to check for comments
224 th loop, 77541 seconds in time period to check for comments
225 th loop, 68352 seconds in time period to check for comments
226 th loop, 57517 seconds in time period to check for comments
227 th loop, 47812 seconds in time period to check for comments
228 th loop, 38224 seconds in time period to check for comments
229 th loop, 33882 seconds in time period to check for comments
230 th loop, 27335 seconds in time period to check for comments
231 th loop, 20504 seconds in time period to check for comments
232 th loop, 12705 seconds in time p

111 th loop, 1665583 seconds in time period to check for comments
112 th loop, 1663806 seconds in time period to check for comments
113 th loop, 1661876 seconds in time period to check for comments
114 th loop, 1658119 seconds in time period to check for comments
115 th loop, 1630724 seconds in time period to check for comments
116 th loop, 1627942 seconds in time period to check for comments
117 th loop, 1623556 seconds in time period to check for comments
118 th loop, 1621175 seconds in time period to check for comments
119 th loop, 1619378 seconds in time period to check for comments
120 th loop, 1617135 seconds in time period to check for comments
121 th loop, 1615180 seconds in time period to check for comments
122 th loop, 1613198 seconds in time period to check for comments
123 th loop, 1611637 seconds in time period to check for comments
124 th loop, 1609426 seconds in time period to check for comments
125 th loop, 1606980 seconds in time period to check for comments
126 th loo

236 th loop, 1464298 seconds in time period to check for comments
237 th loop, 1462977 seconds in time period to check for comments
238 th loop, 1461820 seconds in time period to check for comments
239 th loop, 1460429 seconds in time period to check for comments
240 th loop, 1458576 seconds in time period to check for comments
241 th loop, 1456686 seconds in time period to check for comments
242 th loop, 1454544 seconds in time period to check for comments
243 th loop, 1452691 seconds in time period to check for comments
244 th loop, 1450435 seconds in time period to check for comments
245 th loop, 1447822 seconds in time period to check for comments
246 th loop, 1445185 seconds in time period to check for comments
247 th loop, 1442593 seconds in time period to check for comments
248 th loop, 1440464 seconds in time period to check for comments
249 th loop, 1438093 seconds in time period to check for comments
250 th loop, 1435778 seconds in time period to check for comments
251 th loo

361 th loop, 1204371 seconds in time period to check for comments
362 th loop, 1201984 seconds in time period to check for comments
363 th loop, 1200005 seconds in time period to check for comments
364 th loop, 1198711 seconds in time period to check for comments
365 th loop, 1196845 seconds in time period to check for comments
366 th loop, 1194156 seconds in time period to check for comments
367 th loop, 1190482 seconds in time period to check for comments
368 th loop, 1187642 seconds in time period to check for comments
369 th loop, 1184666 seconds in time period to check for comments
370 th loop, 1180125 seconds in time period to check for comments
371 th loop, 1177045 seconds in time period to check for comments
372 th loop, 1174231 seconds in time period to check for comments
373 th loop, 1170916 seconds in time period to check for comments
374 th loop, 1168368 seconds in time period to check for comments
375 th loop, 1165650 seconds in time period to check for comments
376 th loo

487 th loop, 844512 seconds in time period to check for comments
488 th loop, 842068 seconds in time period to check for comments
489 th loop, 838887 seconds in time period to check for comments
490 th loop, 836690 seconds in time period to check for comments
491 th loop, 832700 seconds in time period to check for comments
492 th loop, 829652 seconds in time period to check for comments
493 th loop, 826309 seconds in time period to check for comments
494 th loop, 824123 seconds in time period to check for comments
495 th loop, 821664 seconds in time period to check for comments
496 th loop, 819763 seconds in time period to check for comments
497 th loop, 818046 seconds in time period to check for comments
498 th loop, 816119 seconds in time period to check for comments
499 th loop, 813681 seconds in time period to check for comments
500 th loop, 811991 seconds in time period to check for comments
501 th loop, 809681 seconds in time period to check for comments
502 th loop, 807782 secon

614 th loop, 615922 seconds in time period to check for comments
615 th loop, 613860 seconds in time period to check for comments
616 th loop, 612253 seconds in time period to check for comments
617 th loop, 610794 seconds in time period to check for comments
618 th loop, 608543 seconds in time period to check for comments
619 th loop, 606247 seconds in time period to check for comments
620 th loop, 604481 seconds in time period to check for comments
621 th loop, 602082 seconds in time period to check for comments
622 th loop, 599771 seconds in time period to check for comments
623 th loop, 596709 seconds in time period to check for comments
624 th loop, 594483 seconds in time period to check for comments
625 th loop, 591471 seconds in time period to check for comments
626 th loop, 589350 seconds in time period to check for comments
627 th loop, 587326 seconds in time period to check for comments
628 th loop, 585616 seconds in time period to check for comments
629 th loop, 583313 secon

741 th loop, 314315 seconds in time period to check for comments
742 th loop, 312388 seconds in time period to check for comments
743 th loop, 310220 seconds in time period to check for comments
744 th loop, 308274 seconds in time period to check for comments
745 th loop, 306393 seconds in time period to check for comments
746 th loop, 304582 seconds in time period to check for comments
747 th loop, 302683 seconds in time period to check for comments
748 th loop, 300691 seconds in time period to check for comments
749 th loop, 298603 seconds in time period to check for comments
750 th loop, 296701 seconds in time period to check for comments
751 th loop, 294909 seconds in time period to check for comments
752 th loop, 292421 seconds in time period to check for comments
753 th loop, 290468 seconds in time period to check for comments
754 th loop, 289432 seconds in time period to check for comments
755 th loop, 287546 seconds in time period to check for comments
756 th loop, 285890 secon