# Reddit API scraper

In the final dataset used for this project, we scraped new comments directly from the Reddit API. We wanted to compare data from 2019 with data from 2024, so we used the API to obtain the most recent data.

We followed the tutorial found on [Medium](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c) for this scraper. We highly recommend it as it was very clear and super helpful.

To connect with the API, you need a Reddit account and must use your credentials to access it.

In [1]:
import requests
import json

# note that CLIENT_ID refers to 'personal use script' and SECRET_TOKEN to 'token'
auth = requests.auth.HTTPBasicAuth('<CLIENT_ID>', '<SECRET_TOKEN>')

# here we pass our login method (password), username, and password
data = {'grant_type': 'password',
        'username': '<USERNAME>',
        'password': '<PASSWORD>'}

# setup our header info, which gives reddit a brief description of our app
headers = {'User-Agent': 'MyBot/0.0.1'}

# send our request for an OAuth token
res = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=auth, data=data, headers=headers)

# convert response to JSON and pull access_token value
TOKEN = res.json()['access_token']

# add authorization to our headers dictionary
headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# while the token is valid (~2 hours) we just add headers=headers to our requests
requests.get('https://oauth.reddit.com/api/v1/me', headers=headers)




<Response [200]>

# So the fun begins !

Once connected, this is where the fun begins. After examining the data we could extract from the API, we navigated to the specific information we needed: the comments, stored under the key "body." As you can imagine, Reddit contains a vast amount of information surrounding each post and comment. The response came as a nested JSON file, requiring us to dig through it to find the relevant data. We successfully located the "body" key and extracted the comment, the user, the date/time, and the title of the post it was associated with.

When dealing with scraped data, proper cleaning is often necessary, especially with textual data, where encoded characters are common. The clean_comment function helps clean each scraped comment as thoroughly as possible.

In [2]:
import re
import html

def clean_comment(comment):
    # Converting the comment to bytes and then decode using utf-8 and unicode_escape
    comment_bytes = comment.encode('utf-8', 'backslashreplace').decode('unicode_escape')
    
    # Decoding HTML entities if any
    comment = html.unescape(comment_bytes)
    
    # Replacing specific sequences with their corresponding characters
    replacements = {
        '\u00e2\u0080\u0099': "'",  # Apostrophe
        '\u00e2\u0080\u009c': '"',  # Left double quotation mark
        '\u00e2\u0080\u009d': '"',  # Right double quotation mark
        '\u00c2': ''                  # Non-breaking space (Â)
    }
    
    for target, replacement in replacements.items():
        comment = comment.replace(target, replacement)
    
    # Replacing newlines with spaces
    comment = comment.replace('\n', ' ')
    
    # Removing extra spaces
    comment = re.sub(r'\s+', ' ', comment).strip()
    
    return comment

## Extracting the relevant information 

The function 'find_comments' helps extracting the relevant information. As mentionned above we needed the comment, the author, the date/time, the title of the post and the subreddit. 

In [3]:
def find_comments(data, results, subreddit):
    # Check if the current 'data' is a dictionary
    if isinstance(data, dict):
        # Attempt to retrieve specific keys ('body', 'author', etc.) from the dictionary
        body = data.get('body')  # The text of the comment
        author = data.get('author')  # The username of the comment's author
        title = data.get('permalink')  # The permalink (URL) to the comment or post
        created_utc = data.get('created_utc')  # The timestamp of when the comment was created
        
        # If a 'body' is found (i.e., it's not None), store the relevant data in the 'results' list
        if body is not None:
            results.append({
                'body': clean_comment(body),  # Clean the comment text using the 'clean_comment' function
                'author': author,  # Store the author's username
                'title': title,  # Store the permalink
                'created_utc': created_utc,  # Store the creation timestamp
                'subreddit': subreddit  # Store the name of the subreddit
            })
        
        # Recursively process each value in the dictionary, allowing for nested dictionaries
        for value in data.values():
            find_comments(value, results, subreddit)
    
    # If 'data' is a list, iterate over each item in the list and process it recursively
    elif isinstance(data, list):
        for item in data:
            find_comments(item, results, subreddit)


## Time to Dig

Once everything was ready, it was time to dig through all this information and create a JSONL file with all the comments from the subreddits we wanted.

In [5]:
import pandas as pd 
import json
import os

# List of subreddits to scrape data from
subreddits = ['news', 'politics', 'relationship_advice', 'worldnews', 'AskReddit',
              'AmItheAsshole', 'todayilearned', 'Showerthoughts']

# Looping through each subreddit in the list
for subreddit in subreddits:

    # Sending a GET request to the Reddit API to retrieve data from the current subreddit
    response = requests.get(f"https://oauth.reddit.com/r/{subreddit}",
                            headers=headers)

    # Checking if the request was successful (HTTP status code 200 means OK)
    if response.status_code == 200:
        # Parsing the response into a JSON object
        data = response.json()

        # Extracting the list of posts (children) from the JSON response
        children = data["data"]["children"]

        # Looping through each post in the subreddit
        for dictionary in children:
            # Store the subreddit name (for potential use in data storage or display)
            subreddit = subreddit
            
            # Extracting the post's unique name (ID), removing the first 3 characters (e.g., 't3_')
            post_name = dictionary["data"]["name"]
            post_name = post_name[3:]
            
            # Printing the post ID (for debugging or tracking purposes)
            print(post_name)
            
            # Sending a GET request to retrieve comments for the current post
            comment_response = requests.get(f"https://oauth.reddit.com/r/{subreddit}/comments/{post_name}",
                                            headers=headers)
            
            # Parsing the comments response into a JSON object
            comment_data = comment_response.json()
            
            # Initialize an empty list to store results
            results = []

            # Attempting to find and extract the 'body' of each comment in the JSON data
            for element in comment_data:
                try:
                    # Custom function to recursively find comments and add them to results
                    find_comments(comment_data, results, subreddit)
                except json.JSONDecodeError as e:
                    # Handling JSON decoding errors (e.g., malformed JSON in the response)
                    print(f"Error decoding JSON: {e}")
                    continue
            
            # Determine the file mode: 'w' to write a new file or 'a' to append to an existing file
            file_mode = 'a' if os.path.exists('comment_data.jsonl') else 'w'

            # Writing the extracted comments to a JSONL file (one JSON object per line)
            for element in results:
                with open('comment_data.jsonl', file_mode) as comment_jsonl:
                    json.dump(element, comment_jsonl)
                    comment_jsonl.write('\n')
        
    else:
        # If the initial request to the subreddit fails, print the status code for debugging
        print(f"Failed to retrieve data: {response.status_code}")


1e2ut0z
1e2sz1e
1e2mn9z


  comment_bytes = comment.encode('utf-8', 'backslashreplace').decode('unicode_escape')


1e2o8ro
1e2kkr5


  comment_bytes = comment.encode('utf-8', 'backslashreplace').decode('unicode_escape')


1e2hdnp
1e2yj02
1e2reej
1e2a3ho


  comment_bytes = comment.encode('utf-8', 'backslashreplace').decode('unicode_escape')


1e2ta3o
1e2c0s5
1e2c33d
1e2hz59
1e2j8jv
1e2tdkv
1e2htyl
1e2f282
1e2d5v6
1e2agz5
1e252zk
1e1u54t
1e1puuc
1e21gvt
1e1rx8u
1e1v1k7
1dyaenq
1e2po0l
1e2q7gq
1e2retf
1e2n1kr


  comment_bytes = comment.encode('utf-8', 'backslashreplace').decode('unicode_escape')


1e2ollj
1e2qv6p
1e2sr68
1e2ooef
1e2pmel
1e2qo17
1e2tyyp


  comment_bytes = comment.encode('utf-8', 'backslashreplace').decode('unicode_escape')


1e2kqfv
1e2depi
1e2j6t6
1e2akmo
1e2jauq
1e2x5i0
1e2u7oz
1e2lbi5
1e2z1uf
1e2sh9e
1e2ssgl
1e2xbrq
1e2p82f
1e2937k
1e2fo1h
1311aib
1dc8ke8
1e2yvdb
1e2chzu
1e2no70
1e2jmes


  comment_bytes = comment.encode('utf-8', 'backslashreplace').decode('unicode_escape')


1e2itns
1e2ojy1
1e2iktj
1e2axqr
1e2uvol
1e2n01d
1e27yq8
1e1yhpe
1e2nglt
1e2fcyd
1e2xwx1
1e2v3ni
1e2po1b
1e2xjyo
1e2ttsw
1e28fjy
1e2zyg7
1e2u1xa
1e22hxf
1e2y23t
1e30olj
1e2t4u1
1e2ljh1
1e2r9gd
1e2x07b
1e2pzer
1e2rmrb
1e2fkdg
1e2r2yn
1e2gzpa
1e2aixj
1e2jgph
1e2zxm5
1e2rxru
1e2lrsc
1e2zhbk
1e2b412
1e2ll2d
1e2px62
1e2o7z4
1e28dz7
1e2qy7j
1e2wqja
1e30nb0
1e2egsk
1e27mqh
1e308mn
1e2zzqf
1e2wgpt
1e2xtcf
1e2vgk8
1e2z9qg
1e2m6mw
1e2vv6c
1e2zujd
1e2m4p3
1e302a9
1e2c45t
1e2ni2a
1e307n3
1e28398
1e29vi9
1e2wh8e
1e2vajl
1e2zqsb
1e28foj
1e2zezj
1e30i5z
1e2yx2z
1e2l97v
1e2dkhg
1e2wpa5
1e2dv0i
1dsk7b2
1e2uf1a
1e2mcq3
1e2yzbc
1e2sy42
1e2ujz1
1e2wbq0
1e2jy9g
1e2ybiu
1e286wh
1e2it9r
1e2xrpq
1e273mu
1e2fu43
1e2ke2e
1e2qv2f
1e28b4x
1e2x2z4
1e2d8xa
1e21azr
1e2o9zu
1e2qpom
1e2vri0
1e2ypiy
1e27huv
1e2ogon
1e2v1vu
1e2u2mq
1e2pdmm
1e2i25d
1e2lmcs
1e2pmoo
1e286xq


  comment_bytes = comment.encode('utf-8', 'backslashreplace').decode('unicode_escape')


1e2n5kb
1e29pcv
1e2dvp1
1e2abc0
1e2z5tz
1e2vv0j
1e2cnia
1e2n2x1
1e2hmpx
1e272jc
1e27g7w
1e2xy15
1e2y854
1e28c6d
1e21p7f
1e2l4cx
1e2p3x2
1e2glnl
1e2vxee
1e2kz9r
1e2neh5


  comment_bytes = comment.encode('utf-8', 'backslashreplace').decode('unicode_escape')


1e2hg3z
1e2dkbm
1e2j9yp
1e2vqe8
1e2lly8
1e21ivx
1e2ihmh
1e2dh6x
1e24646
1e1xw74
1e2ngny
1e22ac2
1e2mrn9
1e2d3t5
1e2ykoj
1e22dx8
1e1gb7l
1e1t7q2
1e2gwzi
1e1iffj
1e1xpai
1e1yu0d
