# Reddit Post API

The purpose of this notebook is to query a specific subreddit for the most recent posts and save those posts into a CSV and JSON file. This file is meant to be used in conjunction with the **Topic Modeling** notebook. The notebook contains the following contents:

1. Create API Object
2. Create Dataframe
3. Exporting Dataframe

# Creating API Object

Before anything else can be done, I had to first set up a Reddit development app. Once I had the app, I could find the necessary access keys and ID necessary to query the API.

In [42]:
import myconfig

# Necessary development app variables are saved on config file for convenience and safety. Reading the praw documentation can help find these variables on the Reddit development app page.
cid = myconfig.cid
csec = myconfig.csec
ua = myconfig.ua

In [43]:
# The module to request the Reddit API is the praw module. This is just a wrapper module to make the requests.
import praw

# create a reddit connection
reddit = praw.Reddit(client_id= cid,
                     client_secret= csec,
                     user_agent= ua)

### Testing the API Object
Now that the API object is created, I wanted to test to make sure that it works and see what the structure of the data I would get from it would look like before making the dataframe.

There are different ways to query the API object. The first test should pull the newest post in r/news. I also print off all the variables associated with a post, which allows me to get an idea of how I want the future database to be arranged. The second test should pull the top 5 posts in r/datascience and print out the post titles. There are ways to query multiple subreddits at a time, although they will not be necessary for the project.

In [44]:
new_post = reddit.subreddit('news').new(limit = 1)
for post in new_post:
    print(vars(post))

{'comment_limit': 2048, 'comment_sort': 'confidence', '_reddit': <praw.reddit.Reddit object at 0x7f9db173edc0>, 'approved_at_utc': None, 'subreddit': Subreddit(display_name='news'), 'selftext': '', 'author_fullname': 't2_9zgtt86k', 'saved': False, 'mod_reason_title': None, 'gilded': 0, 'clicked': False, 'title': 'As the FBI comes under threat, its leader tries to stay out of fray', 'link_flair_richtext': [], 'subreddit_name_prefixed': 'r/news', 'hidden': False, 'pwls': 6, 'link_flair_css_class': None, 'downs': 0, 'top_awarded_type': None, 'hide_score': True, 'name': 't3_wtqo1r', 'quarantine': False, 'link_flair_text_color': 'dark', 'upvote_ratio': 0.89, 'author_flair_background_color': None, 'subreddit_type': 'public', 'ups': 209, 'total_awards_received': 0, 'media_embed': {}, 'author_flair_template_id': None, 'is_original_content': False, 'user_reports': [], 'secure_media': None, 'is_reddit_media_domain': False, 'is_meta': False, 'category': None, 'secure_media_embed': {}, 'link_flair

In [45]:
top_post = reddit.subreddit('datascience').top(limit = 5)
for post in top_post:
    print(post.title)

data siens
The pain and excitement
Shout Out to All the Mediocre Data Scientists Out There
It’s never too early
Guys, we’ve been doing it wrong this whole time


# Creating the dataframe

In [46]:
import pandas as pd

# List for df conversion
posts = []

# Using the API to get 5000 new posts in r/news
news_posts = reddit.subreddit('news').new(limit=5000)

# Return the attributes
for post in news_posts:
    posts.append([post.title, # Title for the post
                  post.url, # Web URL to the post
                  post.score, # Reddit score on the post
                  post.pinned, # If the post is pinned or not
                  post.upvote_ratio, # Ratio of upvotes to downvotes
                  post.total_awards_received, # Total number of awards a post received
                  post.created_utc]) # When the post was created - UTC

# Creating the dataframe
posts = pd.DataFrame(posts, columns=['title',' article_url','score', 'pinned', 'upvote_ratio', 'total_awards', 'created_utc'])

# Checking results
posts.head(10)

Unnamed: 0,title,article_url,score,pinned,upvote_ratio,total_awards,created_utc
0,"As the FBI comes under threat, its leader trie...",https://www.washingtonpost.com/national-securi...,212,False,0.89,0,1661057000.0
1,Daughter of Russian who was inspirational forc...,https://www.cnn.com/2022/08/20/europe/darya-du...,3891,False,0.97,5,1661051000.0
2,"Caught in act, suspect in catalytic theft free...",https://www.ktvu.com/news/caught-in-act-suspec...,2088,False,0.95,0,1661045000.0
3,Climate change forces indigenous islanders in ...,https://www.bbc.com/news/av/world-latin-americ...,1002,False,0.93,0,1661038000.0
4,"Albania arrests two Russians, one Ukrainian tr...",https://www.reuters.com/world/europe/albania-a...,797,False,0.96,1,1661031000.0
5,Russia accuses Ukraine of ‘chemical terrorism’...,https://www.aljazeera.com/news/2022/8/20/russi...,250,False,0.82,0,1661029000.0
6,Gary Busey charged with sex offenses at Monste...,https://www.nbcnews.com/news/rcna44079,6444,False,0.96,1,1661029000.0
7,China sentences tycoon Xiao Jianhua to 13 year...,https://www.nbcnews.com/news/world/china-tycoo...,2303,False,0.96,0,1661022000.0
8,Public schools receive 'In God We Trust' poste...,https://www.cnn.com/2022/08/19/us/texas-school...,21342,False,0.88,2,1661019000.0
9,UN: US buying big Ukraine grain shipment for h...,https://apnews.com/article/russia-ukraine-drou...,767,False,0.95,0,1661014000.0


# Exporting the dataframe

NOTE: Encoding / decoding can be weird with some characters. The one that seems to have the most trouble, for me at least, is the possessive apostrophe.

Luckily, I know that during preprocessing the data for topic modeling and sentiment analysis this character will be removed anyway. Therefore, I will go ahead and remove that character from all posts titles before I export to CSV and JSON files.

Checking the dataset before analysis is recommended.

In [47]:
import re

# Creating function to remove apostrophe from dataframe
def removeApostrophe(text):
    text = re.sub("'", " ", text)
    return text

# Applying function to post titles
posts['title'] = posts['title'].apply(removeApostrophe)

#Output a CSV
posts.to_csv('../data/reddit_data.csv', encoding = "utf-8")

#Output a JSON
posts.to_json('../data/reddit_data.json')