# Reddit API Scraper Tool
---
Code used to log into Reddit's API and scrape posts from subreddits.  In this project, the subreddits used were 'fountainpens' and pens'.

## Contents
---
- [Imports and Setup](#Imports-and-Setup)
- [Functions](#Functions)
- [Steps to Run](#Steps-to-Run)

## Imports and Setup

Note :  I never got around to impleting datetime to automate this further.  Consider that a stretch goal!

In [2]:
#Imports
import pandas as pd
import requests
import getpass

import os
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
username = os.getenv('redditname')
password = os.getenv('password')
client_id = os.getenv('client_id')
client_secret = os.getenv('client_secret')
user_agent = os.getenv('user_agent')

In [4]:
#Authentication
auth = requests.auth.HTTPBasicAuth(client_id, client_secret)

data = {
    'grant_type': 'password',
    'username': username,
    'password': password
}

In [5]:
#create an informative header for your application - check if getting a response.
headers = {'User-Agent': 'namehere/0.0.1'}

res = requests.post(
    'https://www.reddit.com/api/v1/access_token',
    auth=auth,
    data=data,
    headers=headers)

print(res)

<Response [200]>


In [6]:
#retrieve access token
token = res.json()['access_token']
headers

{'User-Agent': 'namehere/0.0.1'}

In [14]:
#Set up heders and check if working
headers['Authorization'] = f'bearer {token}'

requests.get('https://oauth.reddit.com/api/v1/me', headers=headers).status_code == 200

True

### Functions
---
|Function|Purpose|
|--------|-------|
|scraper|Input subreddit to be scraped.  It will scrape 1000 posts in the subreddit and extract the post title and body of the post.  The function then generates a dataframe from this data, adds 'subreddit' column that adds a classifier string, drops duplicate rows, and outputs the dataframe.
|csv_add|Input subreddit dataframes to be added to corpus.csv.  Function will add the dataframes, drop duplicates, reset the index, and re-write to corpus.csv with updated data.|


In [15]:
# This has been the most challenging piece of code for me in the course, I think partly because I'm not all all familiar with how using APIs work.
# I spoke with Alanna, Hank, Hank AND Alanna, searched the internet, and ChatGPT

def scraper(subreddit, classifier):
    
    """
    This function accepts the following inputs:
    subreddit: (str) Name of subreddit the user wants to scrape.
    classifier: (str) classifier of subreddit:
        fountainpens = 'fp'
        pens = 'pens'
    The function will scrape the Reddit subreddit for 100 posts.  It generates a JSON file
    from the data and then extracts the title and selftext(body of post). The process loops
    until it's collected 1000 posts.

    A dataframe is generated using the title and selftext as columns.  A column is added 
    to the dataframe titled 'subreddit' that adds the classifier value to all cells.
    Finally, any duplicates in the dataframe are dropped and the dataframe is returned!
    """

    #Necessary variables for function to run
    base_url = 'https://oauth.reddit.com/r/'
    posts = []
    limit = 100
    after =  None
    total_posts = 1000

    #pulls data from Reddit API
    while len(posts) <= total_posts:
        scrapes = requests.get(base_url+subreddit,
                               headers=headers,
                               params= {'limit':limit,'after':after})
        data = scrapes.json()
        new_posts = data['data']['children']
    
        #appends data to posts list
        for p in new_posts:
            posts.append(p['data'])

        #cycles to the next after string and starts again until 1000 posts are collected
        after = data['data']['after']

    # Get the selftext and title from each child
    titles = [post['title'] for post in posts]
    selftexts = [post['selftext'] for post in posts]

    # Creating a DataFrame
    df = pd.DataFrame({'title': titles, 'selftext': selftexts})
    df['subreddit'] = classifier
    df.drop_duplicates(inplace = True)
    return(df)
        

In [16]:
# Hank helped with showing me how to make this not write to csvs now that I have the data I need.
def csv_add(subreddit1, subreddit2, export = True):
    
    """
    subreddit1: (str) Name of first subreddit dataframe.
    subreddit2: (str) Name of second subreddit dataframe.
    This function readsd in the corpus.csv, concatenates the new data to it as 'subreddit1' and 'subreddit2'.
    It then drops any duplicates, resets the data index and writes the updated dataframe badck to the
    corpus.csv./
    """

    #read in corpus.csv
    corpus_df = pd.read_csv('./data/corpus.csv', index_col = 0)

    #concatenates subreddit1 and subreddit2 to a temp dataframe
    temp_df = pd.concat([corpus_df, subreddit1, subreddit2])

    #drops any duplicates from the temp dataframe
    temp_df.drop_duplicates(inplace = True)

    #resets the index of the temp dataframe
    temp_df.reset_index(drop=True, inplace=True)

    #rewrites the updated dataframe to the corpus.csv
    if export:
        temp_df.to_csv('./data/corpus_AFTER.csv')
    return(temp_df)

### Steps to Run
___

|Variable|Purpose|
|--------|-------|
|fountain_pens_df|Variable that holds 'fountainpens' subreddit data collected data from scraper function in a dataframe.|
|pens_df|Variable that hold 'pens' subreddit data collected from scraper function in a dataframe.|
|temp|Dataframe of data written to corpus.csv.  Returned from csv_add function to check number of fountainpens and pens subreddit post count.|

In [17]:
#Process the fountainpens subreddit
fountain_pens_df = scraper('fountainpens', 'fp')

In [18]:
#Process the pens subreddit
pens_df = scraper('pens', 'pens')

In [20]:
#Tranfer newly acquired dataframs to corpus.csv
temp = csv_add(fountain_pens_df, pens_df, export = False)

In [21]:
#Check to see how many of each posts are collected
temp['subreddit'].value_counts()

subreddit
fp      1772
pens    1278
Name: count, dtype: int64