# Project 3: API

### Introduction
For Project 3, the objective is to collect and analyze posts from two distinct subreddits. Subreddits are specialized forums on the popular website Reddit that cater to specific topics and interests.

The focus will be on creating a classification model to discern the origin of a subreddit post between the following two subreddits:

**Subreddit 1: [r/AmItheAsshole]('https://www.reddit.com/r/AmItheAsshole/')**

Description: Serving as a platform for moral introspection, r/AmItheAsshole provides a space for individuals to seek feedback on contentious issues. Users present both sides of a dilemma and solicit opinions to ascertain whether their actions align with societal norms.


**Subreddit 2: [r/AskLawyers]('https://www.reddit.com/r/AskLawyers/')**

Description: Offering legal guidance, r/AskLawyers encourages users to pose questions with the understanding that professional legal advice requires consultation with an attorney. Public posts and comments in this subreddit are not construed as forming an attorney-client relationship.


The selection of these two subreddits stems from their shared purpose of providing a venue for individuals to seek input on personal matters. While one focuses on ethical considerations and societal expectations, the other centers on legal principles and potential courses of action. This juxtaposition between moral inquiry and legal inquiry presents an intriguing avenue for exploration, highlighting the intersection of public opinion and formal legal frameworks.



In [1]:
# Import necessary libraries
import pandas as pd
import requests
import time
from datetime import datetime

# this setting widens how many characters pandas will display in a column:
pd.options.display.max_colwidth = 400

## Authorizing a Connection

In [2]:
# set credentials
client_id = 'g9LyrhpQZmvO3B_erOWQyQ' #alphanumeric string provided under "personal use script"
client_secret = '9hr0j6-ofILVfxsCPW3CuJbCKyAS3Q'  #alphanumeric string provided as "secret"
user_agent = 'Project 3' #the name of your application
username = 'Soft_Ad_116' #your reddit username
password = 'lascondes' #your reddit password

In [3]:
# Requesting authorization 
auth = requests.auth.HTTPBasicAuth(client_id, client_secret)

data = {
    'grant_type': 'password',
    'username': username,
    'password': password
}
   
#create an informative header for my application
headers = {'User-Agent': 'namehere/0.0.1'}

res = requests.post(
    'https://www.reddit.com/api/v1/access_token',
    auth=auth,
    data=data,
    headers=headers)

#retrieve access token
token = res.json()['access_token']

headers['Authorization'] = f'bearer {token}'

requests.get('https://oauth.reddit.com/api/v1/me', headers=headers).status_code == 200

True

## Collecting Data from Subreddit 1

In the code below, I state the subreddit I want to use in the `subreddit` variable.
Then, I request 100 posts 10 times and use a dictionary method to create a data frame for each scrape. Each data frame is saved to a CSV. 

Note: To run correctly, the requests need the parameter `after`, which takes an ID code from the previous scrape. A boolean variable named `label` ensures the code runs the first time. 

Citations:
* Rowan Schaefer helped me make the request loop.
* [datetime function]('https://www.toppr.com/guides/python-guide/tutorials/python-date-and-time/datetime/current-datetime/how-to-get-current-date-and-time-in-python/#:~:text=The%20datetime%20module's%20now(),dd%20hh%3Amm%3Ass.')

In [4]:
# Defining the site we are connecting to
base_url = 'https://oauth.reddit.com/r/'
subreddit = 'AmItheAsshole'

# Getting the posts
label = False  # ensures the loop runs the first time
for request in range(10): 
    
    time.sleep(2) # wait 2 seconds between runs
    if label:
        # set parameters
        params = {
            'limit': 100,
            'after': after_label
        }
        
        # make the request
        res = requests.get(base_url + subreddit, 
                   headers = headers,
                   params = params)
   
    else:      # only runs in the first iteration of the loop
        # parameters with no 'after'
        params = {'limit': 100}

        # makes first request, w/o the 'after' parameter
        res = requests.get(base_url + subreddit, 
                   headers = headers,
                   params = params)
        
    
    # sets the 'after' parameter
    after_label = res.json()['data']['after'] 
    label = True

    
    # Making a Data Frame using the information from the 100 posts
    
    posts = [] # list of dictionaries to store post data

    #looping through posts to get pertinent data
    no_of_posts = len(res.json()['data']['children']) 
    
    for i in range(no_of_posts):
        post_title = res.json()['data']['children'][i]['data']['title']
        post_text = res.json()['data']['children'][i]['data']['selftext']
        post_source = res.json()['data']['children'][i]['data']['subreddit']

        posts.append({'title': post_title, 'post': post_text, 'source' : post_source})

    #creating a Pandas DataFrame using 'posts'
    df = pd.DataFrame(posts) 
    
    # Storing the current date in a variable to use in csv file name
    date = datetime.now().strftime('%m-%d')

    # saving data frame to csv
    df.to_csv(f'../project-3/Data/df{date}{str(request)}.csv', index = False)



## Collecting Data from Subreddit 2

In the code below, I state the subreddit I want to use in the `subreddit` variable.
Then, I request 100 posts 10 times and use a dictionary method to create a data frame for each scrape. Each data frame is saved to a CSV. 

Note: To run correctly, the requests need the parameter `after`, which takes an ID code from the previous scrape. A boolean variable named `label` ensures the code runs the first time. 

Citations:
* Rowan Schaefer helped me make the request loop.
* [datetime function]('https://www.toppr.com/guides/python-guide/tutorials/python-date-and-time/datetime/current-datetime/how-to-get-current-date-and-time-in-python/#:~:text=The%20datetime%20module's%20now(),dd%20hh%3Amm%3Ass.')

In [6]:
# Defining the site we are connecting to
base_url = 'https://oauth.reddit.com/r/'
subreddit = 'AskLawyers'


label = False  # ensures the loop runs the first time

for request in range(10): 
    
    time.sleep(2) # wait 2 seconds between runs
    if label:
        # set parameters
        params = {
            'limit': 100,
            'after': after_label
        }
        
        # make the request
        res = requests.get(base_url + subreddit, 
                   headers = headers,
                   params = params)
   
    else:      # only runs in the first iteration of the loop
        # parameters with no 'after'
        params = {'limit': 100}

        # makes first request, w/o the 'after' parameter
        res = requests.get(base_url+subreddit, 
                   headers=headers,
                   params= params)
        
    
    # sets the 'after' parameter
    after_label = res.json()['data']['after'] 
    label = True

    
    # Making a Data Frame using the information from the 100 posts
    
    posts = [] # list of dictionaries to store post data

    #looping through posts to get pertinent data
    no_of_posts = len(res.json()['data']['children']) 
    
    for i in range(no_of_posts):
        post_title = res.json()['data']['children'][i]['data']['title']
        post_text = res.json()['data']['children'][i]['data']['selftext']
        post_source = res.json()['data']['children'][i]['data']['subreddit']

        posts.append({'title': post_title, 'post': post_text, 'source' : post_source})

    #creating a Pandas DataFrame using 'posts'
    df = pd.DataFrame(posts) 
    
    # Storing the current date in a variable to use in csv file name
    date = datetime.now().strftime('%m-%d')

    # saving data frame to csv
    df.to_csv(f'../project-3/Data/df{date}{str(request)}.csv', index = False)
