Denne notebook bruges til at hente reddit data. Den henter først kommentarer fra en subreddit, hvorefter den henter de submissions, som kommentarerne er til.

Funktionerne bruger Pushshift API, så tjek at API'en er oppe: https://stats.uptimerobot.com/l8RZDu1gBG

Nedenstående celle bør bare kunne køres, som den er. Indlæser pakker og laver funktioner.

In [None]:
import os
from os.path import join
import requests
import json
import re
import time
from datetime import datetime
import random

# where to store data
data_p = join('..', 'data')
if not os.path.isdir(data_p):
    os.mkdir(data_p)

# pushshift endpoints
sub_end = "https://api.pushshift.io/reddit/search/submission/"
com_end = "https://api.pushshift.io/reddit/search/comment/"

# function for comments
def comment_get(subreddit, start, end, com_end = com_end):
    parameters = {'subreddit': subreddit,
                  'after': start,
                  'before': end,
                  'size': 499}
    
    r = requests.get(com_end, params = parameters)
    
    data = r.json().get('data')
    
    return(data)

# function for submissions
def submission_get(subreddit, ids, start, end, sub_end = sub_end):
    
    ids = ','.join(ids)
    
    parameters = {'subreddit': subreddit,
                  'after': start,
                  'before': end,
                  'size': 499,
                  'ids': ids}
    
    r = requests.get(sub_end, params = parameters)
    
    data = r.json().get('data')
    
    return(data)

I nedenstående sættes indstillinger for søgning.

In [None]:
# overall parameters for search
subreddit = '' # hvilken subreddit? (udelad r/)
#collect_start = int(datetime(2019, 2, 27, 0, 0).timestamp()) # start for indsamling
collect_start = 1570466763
collect_end = int(datetime(2022, 11, 1, 0, 0).timestamp()) # slut for indsamling

comment_out = 'comments.json' # navn til fil med kommentarer
submission_out = 'submissions.json' # navn til fil med opslag

Næste celle henter kommentarer. Den henter for 6 timer ad gangen fra `collect_start` til `collect_end`. Kommentardata gemmes i "data" mappen. 

In [None]:
# get comment data
#comment_data = [] # list for storing comment data
request_start = collect_start # setting start time of first request
while request_start < collect_end:
    request_end = request_start + 21600 # adding 21600 to timestamp of request_start corresponds to 6 hours
    
    request_data = comment_get(subreddit = subreddit, start = request_start, end = request_end) # getting data for request
    comment_data = comment_data + request_data # adding to comment data list
    
    request_start = request_end + 1 # next request should start right after previous request end
    
    time.sleep(random.uniform(3, 4)) # wait between 3 to 4 seconds between requests
    

# save data
data_out = join(data_p, comment_out)
with open(data_out, 'w') as f:
    json.dump(comment_data, f)

Næste celle henter opslag. Den henter 450 opslag ad gangen indtil den har hentet alle submission id'er, som indgår i kommentardata. Opslagsdata gemems i "data" mappen.

In [None]:
# get submission ids in comment data
subids = [entry.get('link_id') for entry in comment_data] # get submission ids for comment
subids = list(set(subids)) # convert to set - keeping only unique ids
subids = [link_id.replace('t3_', '') for link_id in subids] # remove prefix t3_ to get id only


# get submission data
submission_data = [] # list for storing submission data
start_index = 0 # starting with first id (index 0 in list)
while start_index <= len(subids):
    end_index = start_index + 450 # setting index of last id of request (max 500 per request)
    
    if end_index >= len(subids):
        end_index = len(subids)
    
    ids_get = subids[start_index:end_index]
    
    request_data = submission_get(subreddit = subreddit, ids = ids_get, start = collect_start, end = collect_end) # getting data for request
    submission_data = submission_data + request_data # adding to submission data list
    
    start_index = end_index + 1 # next request should start with the id just after the last retrieved in the request
    
    time.sleep(random.uniform(3, 4)) # wait between 3 to 4 seconds between requests
    

# save data
data_out = join(data_p, submission_out)
with open(data_out, 'w') as f:
    json.dump(submission_data, f)