# Problem Statement:

Given the common ground of the 2020 presidential nominees for the Democratic Party, would it be possible to use Classification modeling and Natural Language Processing techniques, to take a random headline and body text from a candidate's Subreddit and determine which subreddit it came from.


# Executive Summary

After a painstaking attempt to use classification modeling techniques (excluding neural networks), it was determined to be very difficult to classify the 4 thousand+ documents collected from the subreddits of Bernie Sanders, Elizabeth Warren, Kamala Harris, and Pete Buttigieg.  However, three benchmarks are worth noting right off the top.

First, the baseline accuracy was only 26%, as we were training the models on four different classes.  There was also one unbalanced class, as I was not able to collect quite as many posts from the Elizabeth Warren Subreddit.

Second, the accuracy was actually quite high (close to 90%) when the candidate names / nicknames were left in the model.  However, I found it more interesting to try to create a model that excluded these, in order to see if there are any strong cultural or policy differentiators between the candidates.

And third, while the model that took these considerations under advisement performed with an accuracy of only 52%, the is still double the baseline score, and it provided us with a handfull of insights about how the candidates are within the Reddit bubble.  



# Import Libraries

In [217]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import requests
import time

%matplotlib inline

#### Maximize DataFrame Display Columns

In [2]:
#Remove the max column setting in pandas

pd.options.display.max_columns = None
pd.options.display.max_rows = None

# Data Collection

#### Create a 'User-Agent header' for Access to Reddit API

In [3]:
headers = {'User-Agent' : 'Calliope'}

## Functions to Scrape Reddit API and Convert to DataFrame

### 1.  Subreddit Scraper Instantiator

In [4]:
#Create a function to scrape a subreddit, with an output list name, a website slug, and a number of iterations.  

def subredditscraper_instantiate(slug):

    output_list = []

    params = {'after' : None}
    url = 'https://www.reddit.com/r/'+slug+'.json'
    res = requests.get(url, params=params, headers=headers)
    if res.status_code == 200:
        slug_json = res.json()
        output_list.extend(slug_json['data']['children'])
        after = slug_json['data']['after']
        #print (after)
    else:
        print(res.status_code)
    time.sleep(1)
    return (output_list)

    #Confirm unique posts    
    #len(set([p['data']['name'] for p in output_list]))  

### 2.  Subreddit Scraper Iterator

In [218]:
def subredditscraper_iterator(slug, redd_list=False, iterations=1):     
    for i in range(iterations):
        if redd_list == False:
            params = {}
            red_list = subredditscraper_instantiate(slug)
        else:
            after = redd_list[-1]['data']['name']
            params = {'after' : after,
                     "default_comment_sort" : 'new'}
        url = 'https://www.reddit.com/r/'+slug+'.json'
        res = requests.get(url, params=params, headers=headers)
        if res.status_code == 200:
            slug_json = res.json()
            redd_list.extend(slug_json['data']['children'])
            after = slug_json['data']['after']
            #print (after)
        else:
            print(res.status_code)
            break
        time.sleep(1)
    print(len(set([p['data']['name'] for p in redd_list])))    
    return redd_list
        
        
    
#Confirm unique posts    
#len(set([p['data']['name'] for p in output_list]))  

### 3.  List to Pandas Converter
#### Drop Duplicates and Convert to Pandas DataFrame, with optional save to .csv file

In [1]:
def reddit_to_df(redd_list, output_csv = False):
    #Convert to DataFrame and drop duplicates

    output_df = pd.DataFrame(redd_list)
    output_df = pd.DataFrame([posts for posts in output_df['data']])
    output_df.drop_duplicates('name', keep = 'first', inplace = True)
    if output_csv:    
        output_df.to_csv('./datasets/'+output_csv)
    return(output_df)

# Reddit URL Slugs for Project

In [9]:
#Democratic Primary Competitors

#Bernie Sanders
bernie_slug = 'SandersForPresident'

#Pete Buttigieg
butti_slug = 'Pete_Buttigieg'

#Kamala Harris
kamala_slug = 'Kamala'

#Elizabeth Warren
warren_slug = 'ElizabethWarren'

In [10]:
#Comparison Slugs

climate_slug = 'ClimateOffensive'
funny_slug = 'funny'

# Instantiate Reddit Lists 

#### ----Run Only Once----

In [11]:
#Test subreddit
if 0==1:
    funny_list = subredditscraper_instantiate('funny')

In [12]:
#Comparison Subreddit for Democratic Primary Competitors
if 0==1:
    climate_list = subredditscraper_instantiate('ClimateOffensive')

In [13]:
#Democratic Primary Competitors
if 0==1:
    bernie_list = subredditscraper_instantiate(bernie_slug) 
    butti_list = subredditscraper_instantiate(butti_slug)
    kamala_list = subredditscraper_instantiate(kamala_slug)
    warren_list = subredditscraper_instantiate(warren_slug)


## Iterate Reddit Lists
#### Come back to this section to generate more data

In [219]:
#Comparison Subreddits

funny_list = subredditscraper_iterator('funny', funny_list, 40)
climate_list = subredditscraper_iterator('ClimateOffensive', climate_list, 40)

760
972


In [220]:
#Democratic Primary Competitors

bernie_list = subredditscraper_iterator('SandersForPresident', bernie_list, 40) 
butti_list = subredditscraper_iterator('Pete_Buttigieg', butti_list, 40)
kamala_list = subredditscraper_iterator('Kamala', kamala_list, 40)
warren_list = subredditscraper_iterator('ElizabethWarren', warren_list , 40)

990
988
997
797


## Convert to DataFrames and Save to .CSV

In [221]:
#Comparison Subreddits

funny_df = reddit_to_df(funny_list, 'raw_funny.csv')
climate_df = reddit_to_df(climate_list, 'raw_climate.csv')

In [222]:
#Democratic Candidates

bernie_df = reddit_to_df(bernie_list, 'raw_bernie.csv')
butti_df = reddit_to_df(butti_list, 'raw_butti.csv')
kamala_df = reddit_to_df(kamala_list, 'raw_kamala.csv')
warren_df = reddit_to_df(warren_list, 'raw_warren.csv')