1. Introduction

This notebook demonstrates how to connect to and get data using the Reddit Application Programming Interfaces (API). This API uses Open Authorization (OAuth) to control access to their data. OAuth is a standard that provides a way to delegate internal data access to external parties. This data access method is useful for automated communications and for enforcing efficient query methods (which lowers server costs).

2. Imports

The concept of imports comes from the C programming language. Including all available codebases would make the program unreasonably large, so imports allow the addition of only what is needed, which decreases program size.  

In [2]:
import requests 
import json
import pandas as pd
from math import ceil
import time
from datetime import timedelta

*** Add Reddit API Client ID and Secret Key here. 

Reddit has limitations on the timespan for which historical data is available. Several researchers have curated more complete datasets by constantly monitoring Reddit over time.

In [4]:
CLIENT_ID = "Add your client ID here"
SECRET_KEY = "Add your secret key here"

# create authentication object
auth = requests.auth.HTTPBasicAuth(CLIENT_ID, SECRET_KEY)
reddit_headers = {'User-Agent': 'research test for u/Iskiicyblithely'}

data = {
    'grant_type': 'password',
    'username': 'Add your user name here', # enter own reddit username
    'password': 'Add your password here' # enter own reddit password associated with the username
}

token = requests.request('POST', 'https://www.reddit.com/api/v1/access_token', auth=auth, data=data, headers=reddit_headers).json()['access_token']
reddit_headers['Authorization']= f'bearer {token}'

def reddit_oauth(r):
    r.headers = reddit_headers
    return r

API connection for Reddit GET requests

In [7]:
def connect_to_endpoint_reddit(url, auth, headers, rate_limit):
    begin_time = time.time()
    # for calculating wait times
    rate_limit_window = timedelta(minutes=rate_limit)
    response = requests.request('GET', url, auth=auth,headers=headers)
    if response.status_code == 429:
        sleepy_time = rate_limit_window - timedelta(seconds=ceil(time.time()-begin_time))
        time.sleep(sleepy_time.seconds+10)

        #try again
        begin_time = time.time()
        return connect_to_endpoint_reddit(url,auth,headers,rate_limit)
    elif response.status_code != 200:
        raise Exception(f'Error reddit: {response.status_code} {response.text}')
    
    return response.json()

***REMOVE break below (see comment above break statement) to go further back in the comment tree for Reddit

The code below traverses the posts on r/wallstreetbets on Reddit and writes each post/comment to an ndjson file.

In [14]:
response = connect_to_endpoint_reddit('https://oauth.reddit.com/r/wallstreetbets/comments', reddit_oauth, reddit_headers,5)

with open('out/reddit_wallstreetbets.json','w') as f:
    for i in response['data']['children']:
        f.write(json.dumps(i)+'\n')

    while True:
        if 'data' in response:
            if 'after' in response['data']:
                headers = {'before': response['data']['after']}
                response = connect_to_endpoint_reddit('https://oauth.reddit.com/r/wallstreetbets/comments', reddit_oauth, reddit_headers,5)
                for i in response['data']['children']:
                    f.write(json.dumps(i)+'\n')
                # REMOVE THE break DIRECTLY BELOW TO PARSE POSTS BEFORE THIS SET
                break
            else:
                break
        else:
            break
