# Project 3: Web APIs & NLP - 01

Kelly Slatery | US-DSI-10 | 01.31.2020

## Problem Statement

The 1960's bore witness to a music revolution like no one had ever seen before--multitrack recording and music video clips started to spread and rock & roll went mainstream, all alongside a rise in national U.S. pop culture. The Beatles broke the charts with the formation fo their legendary group in 1957, while Queen dared to challenge norms after their formation in 1970. The popularity and quality of these two groups is often compared due to their record-breaking album sales, dedication of their fanbases, iconic ground-breaking tracks, and co-membership of the rock music genre. Normally, it is a question of criteria and subjective opinion to determine which of these groups could be considered "the greatest band". However, in this notebook, we will answer the question of who is really the better band with one criterion: love of their fans to this day. Which fanbase loves their band more? And how can we use classification models to determine which band receives more love from their fanbase to end this discussion once and for all?

## Executive Summary

In this project, we explore data collected through Reddit's Pushshift API to classify posts to either the Beatles or Queen subreddit. The data analyzed consists of the most recent 20,000 posts (submissions) from either subreddit as of Tuesday morning January 28th. As the request limit for Reddit's Pushshift API is 200 requests per minute (with 500 posts/submissions per request), all of the data for this notebook was collected without issue in under a minute. 

In order to classify posts, some initial data cleaning and NLP parsing were required before beginning to model using both Count Vectorizers and TFIDF (Term Frequency Inverse Document Frequency) Vectorizers. Seven different classification models were tested using various parameter search methods and all but one with both type of vectorizer to determine which combination of vectorizer, model type, and hyperparameters yielded the most accurate model.

In the end, the Support Vector Machine (SVC), Random Forest, and Logistic Regression yielded the most accurate predictions. However, SVC was inefficient and Random Forest was more overfit, whereas Logistic Regression proveded fairly accurate predictions (accuracy ~= 85.53% ; baseline accuracy ~= 50.6%) and a more interpretable model. Thus, the Logistic Regression model using the TFIDF Vecotrizer were the transformer/model of choice to classify posts/submissions and answer our problem statement.

After running the Logistic Regression model, it was found that amongst the top 20 predictors for both Beatles and Queen, almost all were band member or album names, or important years. Thus, we looked to the next top 20 predictors for either band and found few more interesting words. After looking at all predictors of the Beatles subreddit vs. predictors of the Queen subreddit, a trend appeared: Beatles subreddit predictors tended to include more words associated with success/popularity (e.g. album(s), cover(s), official, Spotify, ranking), while Queen subreddit predictors tended to include more positive words (e.g. amazing, special, quality, love, enjoy).

## Data Dictionary

As described above, the data here comes from two subreddits: [Beatles](https://www.reddit.com/r/beatles/) and [Queen](https://www.reddit.com/r/queen/). In accordance with time limits (200 requests/min at 500 submissions/comments per request), 20,000 submissions and 20,000 comments were pulled from each subreddit. However, only submissions data was used for analysis in this project. Data types and manipulations are described below:

|  	| Column Name 	| Data Type 	| Modifications 	| Description 	|
|----	|--------------------------------	|-----------	|-----------------------------------------------------------------------------------	|---------------------------------------------------------------------	|
| 0 	| author 	| object 	| None 	| Author of post/submission 	|
| 1 	| author_flair_text 	| object 	| None 	| Non-ascii parts of author username 	|
| 2 	| created_utc 	| int64 	| None 	| Time post was created in UTC 	|
| 3 	| score 	| int64 	| None 	| Aggregate sum of upvotes and downvotes (no negatives) 	|
| 4 	| selftext 	| object 	| Nulls replaced with '-' 	| Body of post/submission 	|
| 5 	| subreddit 	| int64 	| Binarized 	| Subreddit to which the post/submission belongs (1=Beatles, 0=Queen) 	|
| 6 	| title 	| object 	| Nulls replaced with '-' 	| Title of post/submission 	|
| 7 	| author_full 	| object 	| None 	| Combined: 'author' and 'author_flair_text' 	|
| 8 	| all_text 	| object 	| None 	| Combined: 'selftext' and 'title' 	|
| 9 	| tokenized_selftext 	| object 	| Text data: tokenized with regular expression 	| Tokenized 'selftext' 	|
| 10 	| tokenized_title 	| object 	| Text data: tokenized with regular expression 	| Tokenized 'title' 	|
| 11 	| tokenized_all_text 	| object 	| Text data: tokenized with regular expression 	| Tokenized 'all_text' 	|
| 12 	| lemmatized_tokenized_selftext 	| object 	| Text data: tokenized with regular expression, lemmatized with WordNetLemmatizer() 	| Lemmatized 'tokenized_selftext' 	|
| 13 	| lemmatized_tokenized_title 	| object 	| Text data: tokenized with regular expression, lemmatized with WordNetLemmatizer() 	| Lemmatized 'tokenized_title' 	|
| 14 	| lemmatized_tokenized_all_text 	| object 	| Text data: tokenized with regular expression, lemmatized with WordNetLemmatizer() 	| Lemmatized 'tokenized_all_text' 	|
| 15 	| stemmatized_tokenized_selftext 	| object 	| Text data: tokenized with regular expression, stemmatized with PorterStemmer() 	| Stemmatized 'tokenized_selftext' 	|
| 16 	| stemmatized_tokenized_title 	| object 	| Text data: tokenized with regular expression, stemmatized with PorterStemmer() 	| Stemmatized 'tokenized_title' 	|
| 17 	| stemmatized_tokenized_all_text 	| object 	| Text data: tokenized with regular expression, stemmatized with PorterStemmer() 	| Stemmatized 'tokenized_all_text' 	|

In [1]:
# Imports
import numpy as np
import pandas as pd
import requests

# Collect data

### Collect initial 500 posts per subreddit

In [2]:
# How many requests per group do we need to make to get 20,000 posts?
20000/500

40.0

In [3]:
# Define the base urls for submissions/comments from the reddit api
baseurl = 'https://api.pushshift.io/reddit/search/submission'


### Define functions to collect the data

In [4]:
# Define a function to get new parameters for the preceding 500 posts
def get_params(base_df, subreddit):
    params = {
        'subreddit': subreddit, 
        'size': 500, 
        'before': base_df.loc[(base_df.shape[0] - 1), 'created_utc'] 
    }
    return params

In [5]:
# Define a function that returns a list of dictionaries for the content of each post
def get_posts(params, baseurl='https://api.pushshift.io/reddit/search/submission'):
    res = requests.get(baseurl, params)
    if res.status_code != 200:
        return f'Error! Status code: {res.status_code}'
    else:
        data = res.json()
        posts = data['data']
    return posts
  

In [6]:
# Define a function to turn the list of posts into a DataFrame
def create_new_df(posts):
    return pd.DataFrame(posts)

In [7]:
# Define a function to update the base DataFrame with the 500 succeeding posts
def update_df(base_df, subreddit):
    params = get_params(base_df, subreddit)
    # print(params)
    posts = get_posts(params)
    # print(len(posts))
    df2 = create_new_df(posts)
    # print(df2.shape)
    updated = pd.concat([base_df, df2], axis=0, ignore_index=True, sort=True)
    return updated
    

### Beatles: 500 submissions

In [8]:
# Set up url parameters for the first pull from the Beatles subreddit (first 500 posts)
params_beatles = {
    'subreddit': 'beatles', 
    'size': 500
}

In [9]:
# Get a list of posts
posts_beatles = get_posts(params_beatles)

In [10]:
# Create a dataframe from the posts
df_beatles = create_new_df(posts_beatles)

In [11]:
# Look at the shape (rows, columns)
df_beatles.shape

(500, 74)

In [12]:
# Look at dataframe of posts
df_beatles.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,removed_by_category,media,media_embed,secure_media,secure_media_embed,crosspost_parent,crosspost_parent_list,media_metadata,steward_reports,author_cakeday
0,[],False,Lanovart,,[],,text,t2_5dxdcxr1,False,False,...,,,,,,,,,,
1,[],False,jackjoy1992,,[],,text,t2_hxiw11,False,False,...,,,,,,,,,,
2,[],False,jackjoy1992,,[],,text,t2_hxiw11,False,False,...,,,,,,,,,,
3,[],False,jackjoy1992,,[],,text,t2_hxiw11,False,False,...,,,,,,,,,,
4,[],False,jackjoy1992,,[],,text,t2_hxiw11,False,False,...,,,,,,,,,,


In [13]:
# Look at columns: subreddit, selftext (description), title
df_beatles[['subreddit', 'selftext', 'title', 'created_utc']].head()

Unnamed: 0,subreddit,selftext,title,created_utc
0,beatles,,Fan art. Magnet from gypsum,1580154959
1,beatles,,"EMI Studios, 1963.",1580154755
2,beatles,,"Klein, Lennon and Ono, 1969.",1580154655
3,beatles,,"Hyde Park, May 1967",1580154607
4,beatles,,"Hyde Park, May 1967",1580154599


### Queen: 500 submissions

In [14]:
# Set up url parameters for the first pull from the Queen subreddit (first 500 posts)
params_queen = {
    'subreddit': 'queen', 
    'size': 500
}

In [15]:
# Get a list of posts
posts_queen = get_posts(params_queen)

In [16]:
# Create a dataframe from the posts
df_queen = create_new_df(posts_queen)

In [17]:
# Look at the shape (rows, columns)
df_queen.shape

(500, 71)

In [18]:
# Look at the dataframe of posts
df_queen.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,crosspost_parent_list,media,media_embed,secure_media,secure_media_embed,removed_by_category,author_flair_background_color,author_flair_text_color,author_flair_template_id,steward_reports
0,[],False,Chevalenz,,[],,text,t2_3xkkann2,False,False,...,,,,,,,,,,
1,[],False,Brooktrout12,,[],,text,t2_tdkqfco,False,False,...,,,,,,,,,,
2,[],False,Palafranco,,[],,text,t2_45sac6gi,False,False,...,,,,,,,,,,
3,[],False,itsyeboiweeaboo,,[],,text,t2_5cuaqbpf,False,False,...,,,,,,,,,,
4,[],False,Taylor200212,,[],,text,t2_3s18h91g,False,False,...,,,,,,,,,,


In [19]:
# Look at columns: subreddit, selftext (description), title
df_queen[['subreddit', 'selftext', 'title', 'created_utc']].head()

Unnamed: 0,subreddit,selftext,title,created_utc
0,queen,"With 15 votes, 'Hang On In There' is out! Reme...",The Miracle - Queen Survivor (Round 6),1580155230
1,queen,,Freddie Mercury Canvas Artwork in New Orleans,1580153714
2,queen,,Rare pic of young Freddie (in the middle) with...,1580146789
3,queen,Okay so I recently finished being in the We Wi...,We Will Rock You,1580146178
4,queen,"If we are being completely honest, chances are...",Remaining members of Queen,1580143852


### Collect 19,500 more posts per subreddit

In [20]:
# How many requests per group do we need to make to get remaining 19,500 posts?
19500/500

39.0

In [21]:
# Update the Beatles dataframe with the 19,500 succeeding posts
for i in range(39):
    df_beatles = update_df(df_beatles, 'beatles')
    if i in [10, 20, 30]:
        print(df_beatles.shape)

df_beatles.shape

(6000, 81)
(11000, 81)
(16000, 87)


(20000, 87)

In [22]:
# Update the Queen dataframe with the 19,500 succeeding posts
for i in range(39):
    df_queen = update_df(df_queen, 'queen')
    if i in [10, 20, 30]:
        print(df_queen.shape)

df_beatles.shape

(6000, 78)
(11000, 78)
(16000, 82)


(20000, 87)

### Export 20,000 Beatles subreddit posts

In [48]:
# Export Beatles submissions dataframe to a csv
df_beatles.to_csv('./data/beatles_subs.csv', index=False)

### Export 20,000 Queen subreddit posts

In [49]:
# Export Queen submissions dataframe to a csv
df_queen.to_csv('./data/queen_subs.csv', index=False)

## Collect 40,000 comments per subreddit

In [25]:
baseurl_comments = 'https://api.pushshift.io/reddit/search/comment'

### Beatles: 500 comments

In [26]:
# Set up url parameters for the first pull from the Beatles subreddit (first 500 posts)
params_beatles_com = {
    'subreddit': 'beatles', 
    'size': 500
}

In [27]:
# Get a list of comments
comments_beatles = get_posts(params_beatles_com, baseurl = baseurl_comments)

In [28]:
# Create a dataframe from the comments
df_beatles_comments = create_new_df(comments_beatles)

In [29]:
# Look at the shape (rows, columns)
df_beatles_comments.shape

(500, 34)

In [30]:
# Look at dataframe of posts
df_beatles_comments.head()

Unnamed: 0,all_awardings,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,...,permalink,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,total_awards_received,author_cakeday,edited
0,[],,356BC,,,[],,,,text,...,/r/beatles/comments/euqx9g/concert_abby_road_1...,1580156504,1,True,False,beatles,t5_2qt7l,0,,
1,[],,EveningsAndWeekends,,,[],,,,text,...,/r/beatles/comments/euqx9g/concert_abby_road_1...,1580156293,1,True,False,beatles,t5_2qt7l,0,,
2,[],,EveningsAndWeekends,,,[],,,,text,...,/r/beatles/comments/euqx9g/concert_abby_road_1...,1580156255,1,True,False,beatles,t5_2qt7l,0,,
3,[],,Bowiequeen,,,[],,,,text,...,/r/beatles/comments/eugkn7/did_my_first_cross_...,1580156133,1,True,False,beatles,t5_2qt7l,0,,
4,[],,Scrutchpipe,,,[],,,,text,...,/r/beatles/comments/eulc65/the_sikh_man_in_hey...,1580156000,1,True,False,beatles,t5_2qt7l,0,,


In [31]:
# Look at columns: subreddit, body, author_flair_text
df_beatles_comments[['subreddit', 'body', 'author_flair_text', 'created_utc']].head()

Unnamed: 0,subreddit,body,author_flair_text,created_utc
0,beatles,"Sorry, I wasn't trying to sound like a dick. I...",,1580156460
1,beatles,"Oh man to be one of those standing there, watc...",,1580156243
2,beatles,Ded from those sick beats,,1580156204
3,beatles,Living is easy with eyes closed,,1580156077
4,beatles,It would be good to do a ‘where are they now’ ...,,1580155942


### Queen: 500 comments

In [32]:
# Set up url parameters for the first pull from the Beatles subreddit (first 500 posts)
params_queen_com = {
    'subreddit': 'queen', 
    'size': 500
}

In [33]:
# Get a list of comments
comments_queen = get_posts(params_queen_com, baseurl = baseurl_comments)

In [34]:
# Create a dataframe from the comments
df_queen_comments = create_new_df(comments_queen)

In [35]:
# Look at the shape (rows, columns)
df_queen_comments.shape

(500, 35)

In [36]:
# Look at dataframe of posts
df_queen_comments.head()

Unnamed: 0,all_awardings,associated_award,author,author_flair_background_color,author_flair_css_class,author_flair_richtext,author_flair_template_id,author_flair_text,author_flair_text_color,author_flair_type,...,retrieved_on,score,send_replies,stickied,subreddit,subreddit_id,total_awards_received,author_cakeday,distinguished,edited
0,[],,Chevalenz,,,[],,,,text,...,1580155989,1,True,False,queen,t5_2s4ze,0,,,
1,[],,EFF198783,,,[],,,,text,...,1580155709,1,True,False,queen,t5_2s4ze,0,,,
2,[],,Dilanep37,,,[],,,,text,...,1580154022,1,True,False,queen,t5_2s4ze,0,,,
3,[],,Dilanep37,,,[],,,,text,...,1580153638,1,True,False,queen,t5_2s4ze,0,,,
4,[],,Jakeybaby125,,,[],,,,text,...,1580153376,1,True,False,queen,t5_2s4ze,0,,,


In [37]:
# Look at columns: subreddit, body, author_flair_text
df_queen_comments[['subreddit', 'body', 'author_flair_text', 'created_utc']].head()

Unnamed: 0,subreddit,body,author_flair_text,created_utc
0,queen,Gotta be honest: 'Hang On In There' and 'Khash...,,1580155931
1,queen,Freddie should be on 1st place!!!,,1580155699
2,queen,"I can get why you don't like hot space, as it ...",,1580154020
3,queen,"nah, I like a lot of other songs on the album....",,1580153636
4,queen,That's you opinion but I honestly think it's q...,,1580153373


### Collect 19,500 more comments per subreddit

In [38]:
# Update the Beatles dataframe with the 19,500 succeeding comments
for i in range(39):
    df_beatles_comments = update_df(df_beatles_comments, 'beatles')
    if i in [10, 20, 30]:
        print(df_beatles_comments.shape)
        
df_beatles_comments.shape

(6000, 87)
(11000, 87)
(16000, 93)


(20000, 93)

In [39]:
# Update the Queen dataframe with the 19,500 succeeding comments
for i in range(39):
    df_queen_comments = update_df(df_queen_comments, 'queen')
    if i in [10, 20, 30]:
            print(df_queen_comments.shape)
            
df_queen_comments.shape

(6000, 85)
(11000, 85)
(16000, 87)


(20000, 95)

### Export 20,000 Beatles subreddit comments

In [46]:
# Export Beatles comments dataframe to a csv
df_beatles_comments.to_csv('./data/beatles_coms.csv', index=False)

### Export 20,000 Queen subreddit comments

In [47]:
# Export Queen comments dataframe to a csv
df_queen_comments.to_csv('./data/queen_coms.csv', index=False)