## Reddit API

This notebook is intended for scraping data from the Reddit API. The project's objective is to collect data within a defined timeframe, specifically from June 13th to June 18th. The data is sourced from two subreddits: `Personal Finance` and `Investing`. As a result of recent restrictions and limitations imposed by Reddit, the data collection process needs to be executed on a daily basis.

In [1]:
# !pip install praw
import praw
import pandas as pd
import datetime
import time

I am utilizing PRAW, the official Python wrapper for Reddit's API, to access Reddit's API functionality. Before proceeding, I created a Reddit application and obtained the necessary API credentials from [here](https://www.reddit.com/prefs/apps).

In [2]:
from credentials import API_KEY, API_SECRET, Reddit_password, Reddit_username

In [3]:
reddit = praw.Reddit(
    client_id= API_KEY,
    client_secret= API_SECRET,
    user_agent='praw',
    username= Reddit_username,
    password= Reddit_password
)

In [4]:
def combine_data(posts, label):
    """
    Combine relevant information from Reddit posts into a list of data rows.

    Args:
        posts (list): List of Reddit post objects.
        label (str): Label or category associated with the posts.

    Returns:
        list: List of data rows, each containing the post's creation time, title, selftext, and subreddit.

    """
    data = []  # List to store the combined data rows

    for p in posts:
        if p.stickied:
            continue  # Skip stickied posts and move to the next iteration
        else:
            row = (p.created_utc, p.title, p.selftext, p.subreddit)  # Create a data row tuple
            data.append(row)  # Append the row to the data list

    min_time = int(min(r[0] for r in data)) - 100_000  # Calculate the minimum creation time in the data

    print(f"{label.upper()} Posts:: N= {len(data)}")  # Print the number of posts processed
    return data  # Return the combined data

In [5]:
today = datetime.date.today().strftime("%Y%m%d")
today

'20230618'

## Personal Finance

In [6]:
subreddit = reddit.subreddit('PersonalFinance') # Get the 'PersonalFinance' subreddit instance

posts_new = subreddit.new(limit=1000) # Get a listing generator for the newest 1000 posts
posts_hot = subreddit.hot(limit=1000)

posts_top_all = subreddit.top(limit=1000)
posts_top_year = subreddit.top(limit=1000, time_filter="year")
posts_top_month = subreddit.top(limit=1000, time_filter="month")
posts_top_week = subreddit.top(limit=1000, time_filter="week")

posts_con_all = subreddit.controversial(limit=1000)
posts_con_year = subreddit.controversial(limit=1000, time_filter="year")
posts_con_month = subreddit.controversial(limit=1000, time_filter="month")
posts_con_week = subreddit.controversial(limit=1000, time_filter="week")

In [7]:
data_new = combine_data(posts_new, 'new')
data_hot = combine_data(posts_hot, 'hot')

print('sleeping for 60 seconds')
# This is where I found how to add a wait time in my code --> https://realpython.com/python-sleep/
time.sleep(60)

data_top_all = combine_data(posts_top_all, 'top_all')
data_top_year = combine_data(posts_top_year, 'top_year')
data_top_month = combine_data(posts_top_month, 'top_month')
data_top_week = combine_data(posts_top_week, 'top_week')

print('sleeping for another 60 seconds (last time)')
time.sleep(60)

data_con_all = combine_data(posts_con_all, 'controversial_all')
data_con_year = combine_data(posts_con_year, 'controversial_year')
data_con_month = combine_data(posts_con_month, 'controversial_month')
data_con_week = combine_data(posts_con_week, 'controversial_week')

NEW Posts:: N= 991
HOT Posts:: N= 892
sleeping for 60 seconds
TOP_ALL Posts:: N= 990
TOP_YEAR Posts:: N= 1000
TOP_MONTH Posts:: N= 995
TOP_WEEK Posts:: N= 996
sleeping for another 60 seconds (last time)
CONTROVERSIAL_ALL Posts:: N= 991
CONTROVERSIAL_YEAR Posts:: N= 999
CONTROVERSIAL_MONTH Posts:: N= 993
CONTROVERSIAL_WEEK Posts:: N= 996


In [8]:
personal_fiannce = pd.DataFrame(
    data_new +
    data_hot +
    data_top_all +
    data_top_year +
    data_top_month +
    data_top_week +
    data_con_all +
    data_con_year +
    data_con_month +
    data_con_week, 
    columns=['time', 'title', 'text', 'subreddit'])

personal_fiannce = personal_fiannce.drop_duplicates()
personal_fiannce.shape

(6161, 4)

In [9]:
personal_fiannce.to_csv(f"../data/{today}-personalfiannce-praw.csv", index=False)

## Investing

In [10]:
subreddit = reddit.subreddit('investing')

posts_new = subreddit.new(limit=1000)
posts_hot = subreddit.hot(limit=1000)

posts_top_all = subreddit.top(limit=1000)
posts_top_year = subreddit.top(limit=1000, time_filter="year")
posts_top_month = subreddit.top(limit=1000, time_filter="month")
posts_top_week = subreddit.top(limit=1000, time_filter="week")

posts_con_all = subreddit.controversial(limit=1000)
posts_con_year = subreddit.controversial(limit=1000, time_filter="year")
posts_con_month = subreddit.controversial(limit=1000, time_filter="month")
posts_con_week = subreddit.controversial(limit=1000, time_filter="week")

In [11]:
data_new = combine_data(posts_new, 'new')
data_hot = combine_data(posts_hot, 'hot')

print('sleeping for 60 seconds')
time.sleep(60)

data_top_all = combine_data(posts_top_all, 'top_all')
data_top_year = combine_data(posts_top_year, 'top_year')
data_top_month = combine_data(posts_top_month, 'top_month')
data_top_week = combine_data(posts_top_week, 'top_week')

print('sleeping for another 60 seconds (last time)')
time.sleep(60)

data_con_all = combine_data(posts_con_all, 'controversial_all')
data_con_year = combine_data(posts_con_year, 'controversial_year')
data_con_month = combine_data(posts_con_month, 'controversial_month')
data_con_week = combine_data(posts_con_week, 'controversial_week')

NEW Posts:: N= 845
HOT Posts:: N= 376
sleeping for 60 seconds
TOP_ALL Posts:: N= 990
TOP_YEAR Posts:: N= 1000
TOP_MONTH Posts:: N= 567
TOP_WEEK Posts:: N= 161
sleeping for another 60 seconds (last time)
CONTROVERSIAL_ALL Posts:: N= 993
CONTROVERSIAL_YEAR Posts:: N= 1000
CONTROVERSIAL_MONTH Posts:: N= 567
CONTROVERSIAL_WEEK Posts:: N= 161


In [12]:
investing = pd.DataFrame(
    data_new +
    data_hot +
    data_top_all +
    data_top_year +
    data_top_month +
    data_top_week +
    data_con_all +
    data_con_year +
    data_con_month +
    data_con_week, 
    columns=['time', 'title', 'text', 'subreddit'])

investing = investing.drop_duplicates()
investing.shape


(4296, 4)

In [13]:
investing.to_csv(f"../data/{today}-investing-praw.csv", 
                 index=False,
                 escapechar= '\\' # Jeff Alexander helped me figure out the need for the escapechar                
                )