# Pulling posts through the Reddit API

In [None]:
import sys, time
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import praw
from datetime import datetime, timedelta

Similar to Twitter, Reddit provides an API that allows us to access a lot of data directly from Python. This time, we use the library `praw` as a wrapper for our requests.

Note that Reddit is a bit more liberal with the allowable number of requests than Twitter. You can make up to 60 requests per minute (with a single request returning up to 100 posts). More information can be found <a href='https://github.com/reddit-archive/reddit/wiki/API'>here</a>.

To access the Reddit API, we again need to authenticate ourselves:

1. Go to the <a href=https://www.reddit.com/prefs/apps>Reddit Apps Site</a> and "create another app" (you will need a Reddit account for this).
2. Create an application (e.g., "dtvc_bot"). The best option for our purposes is "script".
3. Read the <a href=https://docs.google.com/a/reddit.com/forms/d/e/1FAIpQLSezNdDNK1-P8mspSbmtC2r86Ee9ZRbC66u929cG2GX0T9UMyw/viewform>terms and conditions</a> and register.
4. With your application in hand, you can transfer your personal use script and secret token. However, you will also need your username and password to make this work.

As before, I'm transferring all data from a csv-file, but feel free to input your data directly as a string.

In [None]:
api_access = pd.read_csv('API_access.csv',delimiter=';')
PERSONAL_USE_SCRIPT = api_access[api_access['api'] == 'reddit_personal_use_script']['key'].tolist()[0]
SECRET_TOKEN = api_access[api_access['api'] == 'reddit_secret_token']['key'].tolist()[0]
USERNAME = api_access[api_access['api'] == 'reddit_username']['key'].tolist()[0]
PASSWORD = api_access[api_access['api'] == 'reddit_password']['key'].tolist()[0]
USER_AGENT = api_access[api_access['api'] == 'reddit_user_agent']['key'].tolist()[0] # This should be descriptive, such as 'testscript by u/<Username>'

We now create a link to Reddit using `praw`:

In [None]:
reddit = praw.Reddit(
            client_id = PERSONAL_USE_SCRIPT,
            client_secret = SECRET_TOKEN,
            user_agent = USER_AGENT,
            username = USERNAME,
            password = PASSWORD)

## Pulling Reddit data

We start by pulling "hot" posts within any subreddit:

In [None]:
for post in reddit.subreddit("all").hot(limit=10):
    print(post.title)

We can also sort posts in other ways:
* hot
* controversial
* gilded
* new
* rising
* top

We can only pull 100 posts at a time. Let's do that, get some of the key data, and put it into a dataframe:

In [None]:
post_df = pd.DataFrame()
for post in reddit.subreddit("all").new(limit=100):
    post_df = post_df.append({
        'subreddit': post.subreddit,
        'title': post.title,
        'upvote_ratio': post.upvote_ratio,
        'score': post.score,
        'created_utc': post.created_utc,
        'fullname': post.fullname
    }, ignore_index=True)
post_df.head()

To get more data, we need to make multiple requests. It's a little bit tricky to ensure that we are not getting overlapping posts - we essentially have to ensure that we look at posts "up to" what we have found so far:

In [None]:
params = {}
for i in range(3):
    params['after'] = post_df.iloc[len(post_df)-1]['fullname']
    posts = reddit.subreddit("all").new(limit=100,params=params)
    add_post_df = pd.DataFrame()
    count = 0
    for post in posts:
        count += 1
        add_post_df = add_post_df.append({
                'subreddit': post.subreddit,
                'title': post.title,
                'upvote_ratio': post.upvote_ratio,
                'score': post.score,
                'created_utc': post.created_utc,
                'fullname': post.fullname
            }, ignore_index=True)
    post_df = post_df.append(add_post_df, ignore_index=True)
    print("Added " + str(count) + " posts")
print(post_df.shape)

If we want to search only hot posts, we have to work with time stamps, so things get a little bit more complicated. For example code, see <a href='https://github.com/praw-dev/praw/blob/50610754ffce4c2ce2ba416aba51f0f38744b2d4/praw/models/reddit/subreddit.py#L161'>here</a>.

## Back to our Marketing campaign

Let's start by pulling a bit of data about Red Bull from the corresponding subreddit. In the setup below, we request the most recent 100 posts, then the 100 before that, and so on. If you want to crank up the search, remember the 60 requests/minute restriction!

In [None]:
post_df = pd.DataFrame()
params = {}
for i in range(5):
    posts = reddit.subreddit("redbull").new(limit=100,params=params)
    add_post_df = pd.DataFrame()
    count = 0
    for post in posts:
        count += 1
        add_post_df = add_post_df.append({
                'subreddit': post.subreddit,
                'title': post.title,
                'text': post.selftext,
                'upvote_ratio': post.upvote_ratio,
                'score': post.score,
                'img_link': post.url,
                'created_utc': post.created_utc,
                'fullname': post.fullname
            }, ignore_index=True)
    post_df = post_df.append(add_post_df, ignore_index=True)
    print("Added " + str(count) + " posts")
    params['after'] = post_df.iloc[len(post_df)-1]['fullname']
post_df

Let's see when users have been most active. Note that time is given in Unix epoch. We can easily convert that to "normal" date and time using `pandas`;

In [None]:
post_df['created_utc'] = pd.to_datetime(post_df['created_utc'], unit='s')
sns.histplot(post_df['created_utc'])
plt.show()

What about posts in the last week. Can we triangulate what we saw on Twitter?

In [None]:
post_df[post_df['created_utc'] > (datetime.utcnow() - timedelta(days=7))].shape

Can you give two reasons why we don't see as much activity as on Twitter?

Let's try something new: in all subreddits, let's look at posts with Red Bull in the title (this may take a while). To make it comparable to our Twitter example, we will search only results of the last week.

In [None]:
post_week_df = pd.DataFrame()
params = {}
for i in range(3):
    posts = reddit.subreddit("all").search("red bull",sort="new",time_filter="week",params=params)
    add_post_df = pd.DataFrame()
    count = 0
    for post in posts:
        count += 1
        add_post_df = add_post_df.append({
                'subreddit': post.subreddit.title,
                'title': post.title,
                'text': post.selftext,
                'upvote_ratio': post.upvote_ratio,
                'score': post.score,
                'img_link': post.url,
                'created_utc': post.created_utc,
                'fullname': post.fullname
            }, ignore_index=True)
    post_week_df = post_week_df.append(add_post_df, ignore_index=True)
    print("Added " + str(count) + " posts")
    params['after'] = post_week_df.iloc[len(post_week_df)-1]['fullname']
post_week_df

We will likely have some of the same posts in different subreddits. Let's filter those out, using the `groupby` function of `pandas`:

In [None]:
post_week_grouped_df = post_week_df.groupby('title').agg({'subreddit':'unique'}).reset_index()
post_week_grouped_df

As you know, many reddit posts contain relatively little text. But there are a lot of comments on pretty much everything. Of course, we don't have to stop at the posts, but we can also make use of the comments.

In [None]:
for post in reddit.subreddit("redbull").new(limit=10):
    post.comments.replace_more(limit=None)
    for comment in post.comments:
        print(comment.body)
        print('-----------')

When scrolling through Reddit, we often come across a "More Comments" button. Instead of getting back a `MoreComments` object, we use `replace_more()` to get rid of those. Note, however, that this allows only to replace up to 32 instances of "More Comments" appearing.

Another issue is that we may not want to only look at the top-level comments, but we may want to see the replies to those comments, and the replies to those replies, etc.... You can think of the comment structure as a tree, so we will be searching everywhere through the tree (in what is called a "breadth-first search"):

In [None]:
for post in reddit.subreddit("redbull").new(limit=10):
    post.comments.replace_more(limit=None)
    comment_queue = post.comments[:]
    while len(comment_queue) > 0:
        comment = comment_queue.pop(0)
        print(comment.body)
        comment_queue.extend(comment.replies)
    print('-----------')

Of course, we can do a lot of extra text analysis on this, also incorporating the structure of the comment tree.