<img src="../../Img/backdrop-wh.png" alt="Drawing" style="width: 300px;"/>

DIGHUM160 - Critical Digital Humanities<br>
Digital Hermeneutics<br>
OPTIONAL: The Reddit API <br>
Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk)

# The Reddit API

In this notebook, we'll access the Reddit API to get your own data.

The Reddit API allows you to do lots of things, such as automatically post as a user. It also allows you to retrieve data from Reddit, such as subreddit posts and comments. 

There are restrictions in place: Reddit's API only allows you to retrieve 1000 posts (and associated comments) per task. While we can create a script that takes note of the timecodes of posts so as to scrape the entiry of a subreddit in multiple tasks, for now we will just download 1000 posts from our dataset (or fewer, if your subreddit has fewer than 1000 posts).

If you want to get more data, check out Timesearch: https://github.com/voussoir/timesearch

### 1. Sign up

Go to http://www.reddit.com and **sign up** for an account.

### 2. Create an app
Go to https://ssl.reddit.com/prefs/apps/ and click on `create app`. Give your app a name and description, and under redirect uri, enter http://localhost:8080.

### 3. Note details 
Note the client ID, client secret, and your username/password for Reddit, as you'll need them here.

client_id:
The client ID is at least a 14-character string listed just under “personal use script” for the desired developed application

client_secret:
The client secret is at least a 27-character string listed adjacent to secret for the application.

password:
The password for the Reddit account used to register the application.

username:
The username of the Reddit account used to register the application.

<img src="../../Img/reddit-API.png" alt="Drawing" style="width: 700px;"/>



In [1]:
import praw

reddit = praw.Reddit(client_id='YOUR_CLIENT_NAME_HERE',
                     client_secret='YOUR_CLIENT_SECRET_HERE',
                     password='YOUR_REDDIT_PSW_HERE',
                     user_agent='Get Reddit data 1.0 by /u/YOUR_REDDIT_NAME_HERE',
                     username='YOUR_REDDIT_USERNAME_HERE')

## Getting data with the Reddit API

With the details we just created, we can access the Reddit API using PRAW [Python Reddit API Wrapper].

For the purpose of this exercise, we'll download them in one data file, but it's common practice to download posts and comments in two different relational databases.

First, we enter the user details of the app we just created. Then, we run a function that retrieves the post and its associated metadata, as well as the comments. We save the information in a CSV.

**Note:** you might want to add other metadata elements to your function, or organize it differently. For example, Reddit submissions also have a "spoiler" attribute that indicates whether a response is a spoiler (relevant if you're gathering data from a movie or game-related subreddit!). For a list of all the attibutes you can use, check:

* https://praw.readthedocs.io/en/latest/code_overview/models/submission.html for submissions/posts
* https://praw.readthedocs.io/en/latest/code_overview/models/comment.html for comments

In [7]:
import csv
from datetime import datetime

def get_reddit_data(subreddit_name, max_count):
    filename = subreddit_name + '_' + str(max_count) + '_' + datetime.now().strftime('%Y%m%d') + '.csv'
    # Setting up a csv writer and write the first row 
    writer = csv.writer(open(filename, 'wt', encoding = 'utf-8'))
    writer.writerow(['idstr', 'created', 'created_datetime', 'nsfw', 'flair_text', 'flair_css_class',
                     'author', 'title', 'selftext', 'score', 'upvote_ratio', 
                     'distinguished', 'textlen', 'num_comments', 'top_comments'])   
    item_count = 0
    comment_count = 0
    for submission in reddit.subreddit(subreddit_name).hot(limit=None): 
        try:
            item_count += 1
            idstr = submission.id
            created = submission.created
            created_datetime = datetime.fromtimestamp(created).strftime('%Y' + '-' + '%m' + '-' + '%d')
            nsfw = submission.over_18
            flair_text = submission.link_flair_text
            flair_css_class = submission.link_flair_css_class
            author = submission.author
            title = submission.title
            selftext = submission.selftext
            score = submission.score
            upvote_ratio = submission.upvote_ratio
            distinguished = submission.distinguished
            textlen = len(submission.selftext)
            num_comments = submission.num_comments
            comment_list = []
            submission.comments.replace_more(limit=None)
            for comment in submission.comments.list():
                if comment.author != None:
                    comment_count += 1
                    comment_list.append(comment.body)
            comments = ' '.join(comment_list)
            writer.writerow( (idstr, created, created_datetime, nsfw, flair_text, flair_css_class,
                              author, title, selftext, score, upvote_ratio,
                              distinguished, textlen, num_comments, comments) )
            print('.', end='', flush=True)
        except:
            print('Error found--resuming...')
        if item_count == max_count:
            break

    if item_count > 0:
        print('Done!' + '\n' + 'Found ' + str(item_count) + ' posts' + 
              '\n' + 'Found ' + str(comment_count) + ' comments')


Now that we're set up, let's get our data. Change "amitheasshole" in the function call below to your preferred subreddit name (you can find it in Reddit's URL, after "/r/").

In the `for` loop statement above, instead of using `.hot` (currently popular posts), you can also try `.top` (top scoring posts), `.new` (the latest posts), or `.controversial` (posts with a lot of up- and downvotes).

In [8]:
get_reddit_data('amitheasshole', 3)

...Done!
Found 3 posts
Found 2573 comments


In [10]:
import os
# We include two ../ because we want to go two levels up in the file structure
os.chdir('../../Data')

In [11]:
import pandas as pd
df = pd.read_csv('amitheasshole_3_20220320.csv')
df

Unnamed: 0,idstr,created,created_datetime,nsfw,flair_text,flair_css_class,author,title,selftext,score,upvote_ratio,distinguished,textlen,num_comments,top_comments
0,t4eh7s,1646159000.0,2022-03-01,False,Open Forum,,AITAMod,AITA Monthly Open Forum March 2022,#Keep things civil. Rules still apply.\n\n~~I ...,507,0.95,,3595,799,"The validation posts are so tedious.\n\n""Perso..."
1,tio99u,1647792000.0,2022-03-20,False,META,,techiesgoboom,So we decided to fuck with the sub... again.,Greetings assholes and asshole enthusiasts!\n\...,185,0.92,,2107,21,Does this still mean only the top comment vote...
2,tikg9a,1647781000.0,2022-03-20,False,,,throwawaySarah7,AITA for getting mad at my husband because he ...,\n\nFor context_ I'm a sahm with 2 kids (3yrs ...,14984,0.96,,1989,1884,#[Be Civil](https://www.reddit.com/r/AmItheAss...
