# Tutorial: Retrieving Posts and Comments from Reddit using the Reddit API

The Reddit Data API launched in 2008.

There are constraints to how much data we can retrieve through the API because maintaining the API is expensive and those using the API cause server traffic. Some of the limitations are:

- You can make a max of 60 requests per minute
- You can request up to 100 items (e.g. posts) per request
- It is easier to collect current data and plan to collect into the future than retrieving historical data with the reddit API.

For more information, check out the API documentation here: https://www.reddit.com/dev/api/

## Step 1: Create a Reddit Developer Account 

After creating an account in reddit and logging in, go here: https://www.reddit.com/prefs/apps

Go to the end of the webpage where it says `Developed Applications` and click `Create another app`.  You will see a form which you can fill in as follows:

-  `name`, choose any name for your app; mine is called MyBot for lack of originality.
- Three Options: The default is `web app` but for pulling reddit data you need to change it `script` which allows us to use the API for personal reasons. The other options available allow you to develop apps, bots, etc.
- `description`, briefly explain what you are going to use the API for.
- `about url`, leave empty.
- `redirect url`, if you do not have one, enter http://localhost:8080

## Step 2: Get Credentials to Access API 

Now you are registered! We need to get the credentials to access the API. If you go again here https://www.reddit.com/prefs/apps, while logged into your account, you will find the application you created in the previous step under `developed applications`. 

You will see the following information:

[SCREEN SHOT]

The API requires OAuth 2.0 which stands for *Open Authorization*. It is an authorization framework that, in this case, allows reddit to grant access to third-parties (developers or users retrieving data for personal use) to their protected resources (e.g. reddit posts). To go through the authentication process, you will need the following:

- Reddit username
- Reddit password
- Client secret
- Client ID

To make the information secure, I saved the information in a JSON file that looks like this:

In [None]:
{
    "USERNAME_REDDIT": "YourUsername",
    "PASSWORD_REDDIT": "YourPassword",
    "CLIENT_SECRET": "YourSecret",
    "CLIENT_ID": "YourClient"
}

 After creating the file with the credentials, you should include it in a `.gitignore` file. 

We are now ready to move to Python! Below are all of the libraries I will use.

In [1]:
# all libraries needed for all steps
import json
import requests
import pandas as pd
import datetime

Let's now load the credentials from the JSON file:

In [2]:
# read json file which is saved in the same folder
with open(".json") as json_file:
    credentials = json.load(json_file)

# read username & password & secret & ID saved in hidden file
USERNAME = credentials['USERNAME_REDDIT']
PASSWORD = credentials['PASSWORD_REDDIT']
CLIENT_SECRET = credentials['CLIENT_SECRET']
CLIENT_ID = credentials['CLIENT_ID']

# uncomment below if you want to check that the information is correct
# print(USERNAME, PASSWORD, CLIENT_SECRET, CLIENT_ID)

 ## Step 3: Request Authorization to API and get Oauth Token

We are now ready to request an **access token** to access the reddit API. The token will provide access for 2 hours, after which you will need to request a new token.

To request the authorization token, we use `requests.post` with inputs `auth` (which includes the client ID and secret),`data` (which includes the reddit username and password), and `headers` (which provides information about the client -me- requesting the resource). We also include the reddit link that provides the access token. Reddit API provides the token in JSON format along with other information. I save the token in an object called `TOKEN` for later use.

In [3]:
# information for log-in authorization
auth = requests.auth.HTTPBasicAuth(CLIENT_ID, CLIENT_SECRET)

data = {'grant_type': 'password', 'username': USERNAME, 'password': PASSWORD}

headers = {'User-Agent': 'myBot/1.0'}

# submit our request for an OAuth token
res = requests.post('https://www.reddit.com/api/v1/access_token',
                    auth=auth, data=data, headers=headers)

# Uncomment below to see the access token and other information
# # print(res.json())

# retrieve access token
TOKEN = res.json()['access_token']

## Step 4: Request Reddit Posts and Save Them in a DataFrame

Now that we have the authorization token, we can get some data from the reddit API! 

To make requests we use `requests.get`. The data will be provided in a JSON format.

We can filter the data modifying the link:

https://oauth.reddit.com/r/{subreddit}/{type}.json?t={time}

where,

- Subreddit is the name of the subreddit you want to pull data from.

- Type is what I call the way reddit allows users to sort/look for posts:
    - 'hot'
    - 'top'
    - 'new'
    - 'controversial'
    - 'rising'
    - 'random'
    - 'best'

- Time refers to day, week, month, or year. 

we can specify the subreddit, the number of posts to pull (max allowed is 100), whether/how to sort it, search posts by keywords, among others

Using the filters we could get, for instance, the top 100 posts in a particular subreddit in the last year. Let's try it out! 

In the example below, I specify `my_headers` and `my_params` and later, pull 100 posts from today (default) that are 'hot' from the data science subreddit. After, I pull the top 100 hot posts from the same subreddit for the past year. 

In [4]:
# add authorization to headers
my_headers = {**headers, **{'Authorization': f"bearer {TOKEN}"}}

# add sorting by date and get 100
my_params = {'sort': 'date', 'limit': 100}

reddit_hot = requests.get("https://oauth.reddit.com/r/datascience/hot.json",
                   headers=my_headers, params=my_params)


reddit_hot_year = requests.get("https://oauth.reddit.com/r/datascience/top.json?t=year",
                   headers=headers, params=my_params)

For each post, we can get the following information:

In [5]:
# variables for each post
print(reddit_hot.json()['data']['children'][0]['data'].keys())

dict_keys(['approved_at_utc', 'subreddit', 'selftext', 'author_fullname', 'saved', 'mod_reason_title', 'gilded', 'clicked', 'title', 'link_flair_richtext', 'subreddit_name_prefixed', 'hidden', 'pwls', 'link_flair_css_class', 'downs', 'thumbnail_height', 'top_awarded_type', 'hide_score', 'name', 'quarantine', 'link_flair_text_color', 'upvote_ratio', 'author_flair_background_color', 'subreddit_type', 'ups', 'total_awards_received', 'media_embed', 'thumbnail_width', 'author_flair_template_id', 'is_original_content', 'user_reports', 'secure_media', 'is_reddit_media_domain', 'is_meta', 'category', 'secure_media_embed', 'link_flair_text', 'can_mod_post', 'score', 'approved_by', 'is_created_from_ads_ui', 'author_premium', 'thumbnail', 'edited', 'author_flair_css_class', 'author_flair_richtext', 'gildings', 'content_categories', 'is_self', 'mod_note', 'created', 'link_flair_type', 'wls', 'removed_by_category', 'banned_by', 'author_flair_type', 'domain', 'allow_live_comments', 'selftext_html', 

We can also filter posts with key words:

https://oauth.reddit.com/r/{subreddit}/search.json?q={key_words}&restrict_sr=on

where:
- subreddit is the subreddit we want to search (note: like above, you have to add `restrict_sr=on` so that the search is restricted to the subreddit)
- key_words are the key words you want to use for your search; white spaces are replaced by `_`.

In the example below I pull the last 100 posts with key words ChatGPT.

In [6]:
reddit_gpt = requests.get("https://oauth.reddit.com/r/datascience/search.json?q=ChatGPT&restrict_sr=on",
                            headers=headers, params=my_params)

On a side note, the date in which a post is created is called `created` and are unix timestamps -- number of seconds since January 1, 1970. It is trivial to convert in Python using the `datetime` library. For instance, the first post, the unix timestamp is:

In [7]:
created_post0 = reddit_hot.json()['data']['children'][0]['data'].get('created')
print(created_post0)

1695009704.0


Converted to UTC below, it yields 24 of April 2023 at 04:01:26.

In [8]:
datetime.datetime.utcfromtimestamp(created_post0)

datetime.datetime(2023, 9, 18, 4, 1, 44)

To create a data frame with the posts, I select a number of attributes and do the following:

In [9]:
df = pd.DataFrame()  # initialize dataframe

# loop through each post retrieved from GET request
for post in reddit_hot.json()['data']['children']:
    # append relevant data to dataframe
    df = df.append({
        'subreddit': post['data']['subreddit'],
        'id': post['data']['id'],
        'title': post['data']['title'],
        'selftext': post['data']['selftext'],
        'upvote_ratio': post['data']['upvote_ratio'],
        'ups': post['data']['ups'],
        'downs': post['data']['downs'],
        'score': post['data']['score'],
        'num_comments': post['data']['num_comments'],
        'view_count': post['data']['view_count'],
        'total_awards': post['data']['total_awards_received'],
        'created': post['data']['created']
    }, ignore_index=True)

In [10]:
df.head()

Unnamed: 0,subreddit,id,title,selftext,upvote_ratio,ups,downs,score,num_comments,view_count,total_awards,created
0,datascience,16ll6ro,Weekly Entering &amp; Transitioning - Thread 1...,\n\nWelcome to this week's entering &amp; tra...,0.9,7.0,0.0,7.0,106.0,,0.0,1695010000.0
1,datascience,16r1881,Poor statistical/Linear Algebra foundation,Often you hear people saying that understandin...,0.84,26.0,0.0,26.0,22.0,,0.0,1695571000.0
2,datascience,16qtywk,"For people in the industry, how do you explain...",I'll be having a job interview in a few days a...,0.96,70.0,0.0,70.0,27.0,,0.0,1695550000.0
3,datascience,16r5v0j,What do data scientists do anyway?,I have been working in a data science Consulti...,0.8,12.0,0.0,12.0,21.0,,0.0,1695582000.0
4,datascience,16qxfo0,"Did academics prepare you for your role, or fo...",I browse through post often and general have l...,0.84,11.0,0.0,11.0,11.0,,0.0,1695561000.0


In [11]:
df.shape

(101, 12)

## Step 5: Retrieving Comments from a Post

Moving to the comments, in order to retrieve them, we need an ID for a specific post (in the case below, post 12x2df1, which is [this](https://www.reddit.com/r/datascience/comments/12x2df1/weekly_entering_transitioning_thread_24_apr_2023/) thread.)

The comments appear as a tree. For simplicity, I collect only the *parent* comments only and save them in a DataFrame below.

In [12]:
reddit_comments = requests.get("https://oauth.reddit.com/r/datascience/comments/12x2df1.json?threded=false",
                   headers=my_headers, params=my_params)
print(reddit_comments.json()[0]['data']['children'][0])
print(reddit_comments.json()[1]['data']['children'][1]['data'])

{'kind': 't3', 'data': {'approved_at_utc': None, 'subreddit': 'datascience', 'selftext': " \n\nWelcome to this week's entering &amp; transitioning thread! This thread is for any questions about getting started, studying, or transitioning into the data science field. Topics include:\n\n* Learning resources (e.g. books, tutorials, videos)\n* Traditional education (e.g. schools, degrees, electives)\n* Alternative education (e.g. online courses, bootcamps)\n* Job search questions (e.g. resumes, applying, career prospects)\n* Elementary questions (e.g. where to start, what next)\n\nWhile you wait for answers from the community, check out the [FAQ](https://www.reddit.com/r/datascience/wiki/frequently-asked-questions) and Resources pages on our wiki. You can also search for answers in [past weekly threads](https://www.reddit.com/r/datascience/search?q=weekly%20thread&amp;restrict_sr=1&amp;sort=new).", 'user_reports': [], 'saved': False, 'mod_reason_title': None, 'gilded': 0, 'clicked': False,

In [47]:
df_comments = pd.DataFrame()  # initialize dataframe

# loop through each post retrieved from GET request
for i in range(0, len(reddit_comments.json()[1]['data']['children'])):
    # append relevant data to dataframe
    df_comments = df_comments.append({
        'subreddit': reddit_comments.json()[1]['data']['children'][i]['data']['subreddit'],
        'id': reddit_comments.json()[1]['data']['children'][i]['data']['id'],
        'body': reddit_comments.json()[1]['data']['children'][i]['data']['body'],
        'ups': reddit_comments.json()[1]['data']['children'][i]['data']['ups'],
        'downs': reddit_comments.json()[1]['data']['children'][i]['data']['downs'],
        'score': reddit_comments.json()[1]['data']['children'][i]['data']['score'],
        'total_awards': reddit_comments.json()[1]['data']['children'][i]['data']['total_awards_received'],
        'created': reddit_comments.json()[1]['data']['children'][i]['data']['created_utc'],
        'author': reddit_comments.json()[1]['data']['children'][i]['data']['author']
        
    }, ignore_index=True)

In [48]:
df_comments.shape

(36, 10)

In [49]:
df_comments.head()

Unnamed: 0,subreddit,id,body,ups,downs,score,total_awards,created,author,num_replies
0,datascience,jhjhfkf,"Hi all,\n\nI have got question about moving fr...",5.0,0.0,5.0,0.0,1682356000.0,Quest_to_peace,2.0
1,datascience,jhkplu9,I'm having a final interview next week for an ...,4.0,0.0,4.0,0.0,1682373000.0,junejiehuang,2.0
2,datascience,jhytaqq,[deleted],5.0,0.0,5.0,0.0,1682628000.0,[deleted],2.0
3,datascience,jhhrzhw,Hey is it possible to go into data science wit...,3.0,0.0,3.0,0.0,1682322000.0,Far-Pizza9567bNana,2.0
4,datascience,jhihp20,"This is a cry for help, if you can give me som...",3.0,0.0,3.0,0.0,1682341000.0,moon3dot14,2.0


# References