# Project 3 - Web APIs & NLP
*By: Despina Matos*

# Part One: Data Collection and Data Cleaning

## Problem Statement: 

Reddit is an American social news aggregation, web content rating, and discussion website based off the definition on [Google](https://www.google.com/search?q=reddit&rlz=1C5CHFA_enUS877US877&oq=reddit+&aqs=chrome.0.69i59j35i39j0l2j69i60l4.4152j0j7&sourceid=chrome&ie=UTF-8). Suggested from the [Reddit's website](https://www.redditinc.com/), it has over 130,000 communities that are in the form of "subreddits". Each page is a platform where the Users can post, comment, and vote. A "post" is where the community share content by stories, links, images, and videos. A "comment" provides discussions on posts. And both comments & posts can be scored by being upvoted or downvoted. Yet, there is a dilemma, what if we wanted to gather data and model mulitple "subreddits"? This is difficult to compare such information without a classifier. Thus, **can we use supervised machine learning to classify similar content from two different web sources?**

*How do we investigate this problem?* We scraped about 1000 posts from two chosen subreddits. Each subreddit we scraped was about 500 posts by using Reddit's API. Then, we finally used natural language processing to train a classifier model to check which post came for the correct subreddit. The classification models we decided to use were Logistic Regression, Bernoulli Naive Bayes, Bagged Decision Tree, and Random Forest which we evaluated on accuracy scores and results from confusion matrices. 

## Executive Summary:

We begin by pulling the data from the two subreddits by using Reddit's API. The subreddits that we pulled were the r/Books and r/Movies subreddits. The data that was imported was in JSON format. Therefore, we decided to create dataframes in Pandas to have easier access to clean and multiplate through the data.

Once, we looked through the dataframes, we looked for particular subfields using the Reddit's API data dictionary. We focused on the title, author, selftext, and subreddit features. We chose these as the subfields because we wanted the best features for our modeling. 

Next, we did some data cleaning. We checked for duplicate posts and missing values in each of the dataframes. 

Then, we did some exploratory data analysis. We checked for the summary statistics and as a result, we dropped the selftext feature. Thinking back on our problem statement, we want to detemine similar content in both of the datasets, thus, we do this by looking at the frequently occurring words in each dataframe. We will did this by using an NLP functions called stemming and countvectorizer. We then chose to display bar graphs that had the top 20 frequently occurring single gram word & bigram words in each subreddit. Finally, we determined the outliers in each of our dataframes. 

Next, we were able to preprocess. We merged our datasets into one and dropped the author feature because we do not need it for modeling. We mapped our target variable: subreddit into a binary classification. We did some more NLP processing. We used lemmatization, stemming, and stopwords to analyize our dataset futher. Then, we created our X feature and target variable and did a train-test split. We decided to change our X feature as a stem version for our modeling. Lastly, we determined the basline score to compare to our models' results.

Finally, we were able to model. We modeled four different classification models. We modeled Logistic Regression, Bernoulli Naive Bayes, Bagged Decision Tree, and Random Forest. We also created a confusion matrix for each model to have further insights on each of our models. We wanted to see how well our models were able to correctly classify where each post came from. In the end, we focused on accuracy score and the bias-variance tradeoff from each  model to determined which model was the best to answer our problem statement.

## Contents:
- [Information on the Two Subreddits](#Information-on-the-Two-Subreddits)
- [Data Collection](#Data-Collection)
    - [Libraries](#Libraries)
    - [Getting the URLs](#Getting-the-URLs) 
    - [Check the Status Codes](#Check-the-Status-Codes)
    - [Creating JSON Objects](#Creating-JSON-Objects)
    - [Creating Dataframes](#Creating-Dataframes)
    - [Creating the Correct Subfields](#Creating-the-Correct-Subfields)
- [Data Cleaning](#Data-cleaning)
    - [Checking for Duplicate Posts](#Checking-for-Duplicate-Posts)
    - [Checking for Missing Values](#Checking-for-Missing-Values)
    - [Saved Clean Datasets](#Saved-Clean-Datasets)

## Information on the Two Subreddits

We decided to choose r/Books and r/Movies as our two subreddits to answer our problem statement. Both subreddits states about anything relating to Books or Movies.

The [r/Books subreddit](https://www.reddit.com/r/books/) has about 17.5 million subscribers. It was created January 25, 2008. It filters the posts by weekly thread. Also, this subreddit has eleven rules one must follow before they can post. All posts must be directly book related, informative, and discussion focused. One must use a civil tone while posting or interacting with other Users. One can not have personal request recommendations about books. One can not ask, "What's that book called?". One must follow the promotional rules on Reddit. One must not solicitate for pirated books. One must mark their posts that have spoliers. One can not ask for homework help. One can only post text, image, and video only. One can not post low quality book lists. And one must look at the full rules and guidelines before posting. 

The [r/Movies subreddit](https://www.reddit.com/r/movies/) has about 22.2 million subscribers. It was created January 25, 2008. It filters the posts by discussion, poster, media, article, trailers, news, resource, and spoliers. Also, this subreddit has eleven rules one must follow before they can post. One must watch out for self promotion when posting. One must not include hate speech in their posts. One must not include ambiguous/click-bait posts. One can not spam their posts on the site. One can not name call other Users on the site. One can not have brigading. One can not post television clips or advertisement. One can not spam post about comic book movies. Or in general, can not spam their posts on the site. One can not give lot of attention for negative things. And one can not have misleading titles with their posts. 


We believe that these subreddits are similar in content because movies are made into books. And books fans will like movies to be made from books. 

We should also consider in our data science process, how the pre-filtering controller in both of the subreddits sites will affect our overall resluts.

## Data Collection

We began by pulling in the necessary data by using the requests library. This library submits an HTTP requests from Python. Thus, we will be able to request our two subreddits by their specific URLs and checking their specific status codes to see if each site was successfully loaded in. 

### Libraries

In [1]:
#Pulling in the data 
import requests

#Creating a time.sleep request
import time

#Creating a workable dataset
import pandas as pd

### Getting the URLs

Our two URLs that we have chosen are the r/Books and r/Movies sub-reddits. However, we need to use a specific type of URL to pull the data in. This type is the [Pushshift Reddit API](https://github.com/pushshift/api). Application Programming Interface (API) is the messenger that delivers our requests to Reddit and then delivers the response back to us. In other words, the API becomes like a key and askes Reddit if we can use their data for this data science process.

Lastly, we need to add a size element at the end of each URL because we want to pull in 500 posts each. We chose this amount because we need to gather enough data to generate a significant result. 

*Subreddit 1: Books*

In [2]:
#getting the url
subreddit_1_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=books&size=500"

*Subreddit 2: Movies* 

In [3]:
#getting the url
subreddit_2_url = "https://api.pushshift.io/reddit/search/submission/?subreddit=movies&size=500"

Next, we need to check the status code for each URL. 

### Check the Status Codes

We request a response or status code because we need to check if our specific HTTP request has been successfully completed. We also will need to put thought into the number of requests per second we are requesting on the Reddit's server. We do this by using the sleep method in time library. Hence, our goal is to get our status codes to be 200. 

For other information on other types of status codes look [here](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes).

*Subreddit 1: Books*

In [4]:
#make request
res_1 = requests.get(subreddit_1_url)
#number of requests
time.sleep(2)

In [5]:
#check the status code
res_1.status_code

200

Our subreddit 1 data is successfully loaded in. 

*Subreddit 2: Movies*

In [6]:
#make request
res_2 = requests.get(subreddit_2_url)
#number of requests
time.sleep(2)

In [7]:
#check the status code
res_2.status_code

200

Our subreddit 2 data is successfully loaded in.

Next, we need to check how our data is loaded in. We discovered that our data is in JSON formatting. *Why?* JavaScript Object Notation (JSON) is an open-standard file format. In other words, when we get data from the world wide web it is **in** JSON formatting.  

### Creating JSON Objects

We will create JSON objects for each of the sub-reddits to sucessfully complete the scraping of our datasets. To clarify, we need to fully check if everything is loaded in sucessfully. 

*Subreddit 1: Books*

In [8]:
#creating the json object
results_1 = res_1.json()['data']

Our subreddit 1 data has been successfully loaded in.

*Subreddit 2: Movies*

In [10]:
#creating the json object
results_2 = res_2.json()['data']

Our subreddit 2 data has been successfully loaded in.

We sucessfully loaded in both of our data's subreddits. **BUT!** There is an issue, the data is not in the format where we can clean and multiplate it. 

In our next section, we will create dataframes to resolve this issue. 

### Creating Dataframes 

Using the Pandas libary, we will be able to create two dataframes for each subreddit dataset. To emphasize, we use this to have easier access on the data.  

*Subreddit 1: Books*

In [12]:
#creating the dataframe of subreddit 1
subreddit_1_df = pd.DataFrame(results_1)

In [13]:
#check to see if it worked
subreddit_1_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,post_hint,preview,secure_media,secure_media_embed,thumbnail_height,thumbnail_width,suggested_sort,author_cakeday,crosspost_parent,crosspost_parent_list
0,[],False,spiritofthesquirrels,,[],,text,t2_ehnka,False,False,...,,,,,,,,,,
1,[],False,ladydadas-nightmare,reading,"[{'e': 'text', 't': 'Les Mis'}]",Les Mis,richtext,t2_ttg7a,False,False,...,,,,,,,,,,
2,[],False,allahu_adamsmith,,[],,text,t2_my89u,False,False,...,,,,,,,,,,
3,[],False,shabuluba,,[],,text,t2_7xj3xr,False,True,...,,,,,,,,,,
4,[],False,pearloz,points-1,"[{'a': ':redstar:', 'e': 'emoji', 'u': 'https:...",:redstar:11,richtext,t2_437dr,False,False,...,,,,,,,,,,


Our subreddit 1 data has been successfully loaded in a dataframe.

We should check the shape of this dataframe.

In [14]:
#check the shape
subreddit_1_df.shape

(500, 71)

In the subreddit 1, we have 500 rows and 70 columns. Each column is a different feature that a post consistent of. For instance, one of the features is the author and that feature tells us who created the post. 

And each of the rows are the different posts created in subreddit 1.

*Subreddit 2: Movies*

In [15]:
#creating the dataframe of subreddit 2
subreddit_2_df = pd.DataFrame(results_2)

In [16]:
#check to see if it worked
subreddit_2_df.head()

Unnamed: 0,all_awardings,allow_live_comments,author,author_flair_css_class,author_flair_richtext,author_flair_text,author_flair_type,author_fullname,author_patreon_flair,author_premium,...,post_hint,preview,secure_media,secure_media_embed,thumbnail_height,thumbnail_width,author_flair_background_color,author_flair_text_color,crosspost_parent,crosspost_parent_list
0,[],False,topstenclub,,[],,text,t2_534vl7cn,False,False,...,,,,,,,,,,
1,[],False,harunamika,,[],,text,t2_crr1rvj,False,False,...,,,,,,,,,,
2,[],False,funnypilgo,,[],,text,t2_9gwzy0x,False,False,...,,,,,,,,,,
3,[],False,westoffensive,,[],,text,t2_31tx3ksf,False,False,...,,,,,,,,,,
4,[],False,GoldenJoel,,[],,text,t2_69bnj,False,True,...,rich:video,"{'enabled': False, 'images': [{'id': 'dAeA21Yn...","{'oembed': {'author_name': 'Emory University',...","{'content': '&lt;iframe width=""600"" height=""33...",105.0,140.0,,,,


Our subreddit 2 data has been successfully loaded in a dataframe.

We should also check the shape of this dataframe.

In [17]:
#check the shape 
subreddit_2_df.shape

(500, 71)

Our subreddit 2 also has the same elements as subreddit 1. Yet, subreddit 2 has 75 columns.

*So what is to come?* We cannot use all these features to make the best possible model for our problem statement. *How do we get the results for this then?* We need to revisit what is a classification model. A classification model is a form of supervised machine learning that predicts categorical class labels based on the training set and uses it in classifying new data. To clarify, a classficiation model has it's target variable as disrete and the features are not ordered. 

So in this case, we need to identify the best features we can use to have the best possible model for our problem statment. In order to get this, we need to look again at [Pushshift Reddit API](https://github.com/pushshift/api). In the next section, we will be exploring this documentation to get the correct subfields from both of our subreddit's dataframes.

### Creating the Correct Subfields

In this [data dictionary](https://github.com/pushshift/api), we can see which are the best features to put in a subfield. We decided on four features. 

We will start our subfield with the title feature because we will want to know what is the post called. Next, we will want to use the author feature because we will want to know who created the post. Then, we will want to use the selftext feature because we will like to know what is the content in the post. Finally, we will want to use the subreddit feature to determine which post was from where.

In [18]:
#creating the 7 features into a list
subfield = ['title', 'author', 'selftext', 'subreddit'] 

Now we will implement this list into our two dataframes. 

*Subreddit 1: Books*

In [19]:
#implementing the subfield in subreddit 1
subreddit_1_df = subreddit_1_df[subfield]

In [20]:
#check to see if it worked
subreddit_1_df.head()

Unnamed: 0,title,author,selftext,subreddit
0,Book Suggestions for my husband and I to read ...,spiritofthesquirrels,My husband and I have decided to start and fin...,books
1,The one where Friends spoils Little Women,ladydadas-nightmare,"I just started reading 'Little Women', and was...",books
2,"The must-read works of the late, great Jim Har...",allahu_adamsmith,,books
3,Margaret Atwood to publish first collection of...,shabuluba,,books
4,One Man’s Impossible Quest to Read—and Review—...,pearloz,,books


Our subreddit 1 data has been successfully changed in our dataframe.

We should check if our shape changed.

In [21]:
#Check the shape
subreddit_1_df.shape

(500, 4)

Our subbreddit 1 has 500 rows and 4 columns.

*Subreddit 2: Movies*

In [22]:
#implementing the subfield in subreddit 2
subreddit_2_df = subreddit_2_df[subfield]

In [23]:
#check to see if it worked
subreddit_2_df.head()

Unnamed: 0,title,author,selftext,subreddit
0,Gold Jewellery Manufacturer in Jaipur,topstenclub,,movies
1,"looking for a movie similar to Stay alive, wit...",harunamika,Hello i am looking a movie similar to the movi...,movies
2,2019 movies that hit you most emotionally?,funnypilgo,What movies made you cry or nearly made you bu...,movies
3,"Where can I watch Salò, or the 120 Days of Sod...",westoffensive,[removed],movies
4,"Contagion: From Simple Cough, to Global Pandemic",GoldenJoel,,movies


Our subreddit 2 data has been successfully changed in our dataframe.

We should also check if our shape changed.

In [24]:
#Check the shape
subreddit_2_df.shape

(500, 4)

Our subreddit 2 has 500 rows and 4 columns.

Finally, we can start our data cleaning on both dataframes.

## Data Cleaning

If we want to generate a great model, we need to clean up our datasets before we use them. We need to check for duplicate posts and check for missing values in our dataframes so we will not skewed our model results. Finally, we can save our clean dataframes to use for the rest of our data science process in a different notebook.

### Checking for Duplicate Posts

Lets check for duplicate posts in both of our dataframes and if we do have them, lets drop them.

*Subreddit 1: Books*

In [25]:
#checking and dropping the duplicates in subreddit 1 
subreddit_1_df = subreddit_1_df.drop_duplicates() 

In [26]:
#check to see if it worked, lets look at the shape
subreddit_1_df.shape

(492, 4)

Our subreddit 1 now has 492 rows and 4 columns. So there must of been eight duplicate posts.

*Subreddit 2: Movies*

In [27]:
#checking and dropping the duplicates in subreddit 2
subreddit_2_df = subreddit_2_df.drop_duplicates() 

In [28]:
#check to see if it worked, lets look at the shape
subreddit_2_df.shape

(475, 4)

Our subreddit 2 now has 475 rows and 4 columns. So there must of been 25 duplicate posts.

Next, we should check for missing data in each dataframe.

### Checking for Missing Values

Lets check for the missing values in our dataframes because we will ideally want no missing values. However, if we do, then we will need to fix it.

*Subreddit 1: Books*

In [29]:
#using isnull with sum to check for missing values
subreddit_1_df.isnull().sum()

title        0
author       0
selftext     0
subreddit    0
dtype: int64

Our subreddit 1 does not have any missing values.

*Subreddit 2: Movies*

In [30]:
#using isnull with sum to check for missing values
subreddit_2_df.isnull().sum()

title        0
author       0
selftext     0
subreddit    0
dtype: int64

Our subreddit 2 does not have any missing values.

Thus, our datasets do not have to be fixed. 

Let's save our dataframes so we can work with them in later use.

### Saved Clean Datasets

*Subreddit 1: Books*

In [31]:
#Data is clean so far in subreddit 1, here is what we will like to save it as
#index = false for no index column
subreddit_1 = subreddit_1_df.to_csv('../datasets/subreddit1.csv', index = False)

*Subreddit 1: Movies*

In [32]:
#Data is clean so far in subreddit 2, here is what we will like to save it as
#index = false for no index column
subreddit_2 = subreddit_2_df.to_csv('../datasets/subreddit2.csv', index = False)