<a id="section__top"></a>

# Project  3 - Subreddit Classifier
General Assembly DSI CC7 Project 3
<br>Anne Kerr - SF<br>
Due April 5, 2019

This notebook contains an Introduction to the project, outlines the problem statement, provides and overview of the data, and concludes with an Execuive Summary.
## Introduction
The challenge for this project is to build a model that compares text from two or more reddit subposts and correctly classify them. Data is not provided, so another component to this project is to collect data from reddit.
## Problem Statement
**Business context:** Let's assume the folks at reddit want to improve their ability to perform analysis of text posted to their site. Social media and tech companies are increasingly being called upon to be responsible for the content on their site. E.g., they may be asked to prevent hate speech that spreads violence, identify threats, etc. They are considering forming a team of data scientists to work on this problem, and as a proof-of-concept, have asked the students of DSI 7 to demonstrate the value of machine learning applied to natural language processing. 
<br><br>In response to that challenge, this project specifically attempts to answer the question:<br>
***Given a post is from one of the these four subreddits, correctly classify whether or not a post came from the r/travel thread.***
-  r/travel     <=== Target
-  r/Fitness
-  r/wine
-  r/gardening


## Overview of the Data 

The requests.get method of the reddit API retuns a set data elements related to 25 posts at a time. This can be called in a loop to gather all or as many posts as desired for a particular URL. (The URL indicates the subreddit of interest.) Many data elements are returned per post. For this project I need only the text of the post itself, and the name of the subreddit. The process for gathering data is time consuming, so I decided to save a few other key fields that may be of interest. I did not end up using them for the project, but they are in the dataset, which I saved for potential future use. The fields I saved are defined in the table below. 


|Data Element|Description|
|-------|----------|
| subreddit | Name of the subreddit, e.g., 'travel', 'fitness', gardening, 'wine' |
| id | Unique identier of post |
| selftext| The text of the post. Not all posts have text. Some are only images or videos. For this project only post with text were collected  |
| title | Post Title |
| author | Reddit ID of author |
| created | Date the post was created |
| ups | Number of up votes the post has received |
| downs | Number of down votes the post has received |


Refer to the API documention for a complete description, and information about the other data available. __[reddit API](https://www.reddit.com/dev/api/)__


## This project contains the following notebooks

1. Overview: This notebook. Introduces the project and gives an overview of the approach, and concludes with an executive summary.
2. Get-Data: Code to instantiate the RedditPostReaderClass (contained i reddit_posts.py, in the code folder) and use it to gather data for four separate subreddits, drop duplicates and those with empty post strings, and combine the data into a single dataframe for saving to csv. 
3. EDA: Here we read the combined saved data from step 2, and examine the subtext data we plan to feed to the models. Some cleanup is necessary to eliminate unwanted punctuation and formatting characters. Once that is done we split the dataframe into several sets to examine the word frequency in each. We are doing this to get an overall sense of the nature of the language in each thread. What words are similarities (words in common?) What words stand out in each dataset? These are likely to have predictive value. Once we have a sense of that, we save the cleaned data into a new file for use in the modeling process.
4. Modeling and Conclusion: Here is where the model building and evaluation happens. We use GridSearch techniques to score a variety of combinations of tokenizers, estimators, and parameters to determie which combination provides the best score. Once we select a model we fine tune the parameters a bit more to get the best model possible, then evaluate it, and use it to make predictions. Finally we summarize what we learned, and outline some things to try next.



## Project Directory Structure
| Folder  | Contents                                                              |
|---------|-----------------------------------------------------------------------|
| root    | README project overview for GitHub                                    |
| \code   | Jupyter notebooks and python code                                     |
| \data   | Dataset saved from reddit and interim files used for downstream steps |
| \images | Graphs and charts in jpg format, pfd of final presentation            |



## Overview of approach

I began by reading data from four different subreddits related to subjects that interest me. I wasn't sure at the start what I would find and which of the four I would use for the model or which would be the target. What I found was that even though I attempted to capture 1000 unique, non-empty posts for each, after dropping duplicates I was left with far fewer.  Travel had the most - with just over 800 posts. The other three combined had just over 1000, so I decided to make r/travel my target, and the other three collectively the alternate, ie., 'not travel.'

Before modeling I looked at the most common words contained in each set to get a sense for the language within. Did they look likely to be predictive? If not I might have revised my approach, but they did seem to be predictive, and I concluded I had enough data to move forward. Some cleaning and pre-processig was necessary prior to modeling. The data had a lot of punction and formatting characters, so I used a regular expression tokenizer to to create a new data element with only the word tokens. It was this element that I passed to the modeling process.

I started the modeling process by defining a series of tests using pipelines and gridsearch to try a variety of machine learning techniques, looking for the best model. Once that was found, I used that model to make predictions and analyze the results. 

Through the process I was able to produce a model that performed remakably well. This is perhaps not too surprising, given that the language in the four chosen threads does have  very distincly words that make classification relatively easy. As a proof-of-concept it does illustrate the value of using machine learning for analyzing and classifyig text. The final recommendation is that reddit does create a data science team to continue this work.


## Executive Summary

**Reddit, take a lesson from facebook. Be responsible for the content on your site!!**

This project is a proof-of-concept to illustrate the power of natural language processing and machine learning in identifying and classifying posts based on analysis of the words they contain. Reddit doesn't want to be caught unprepared and is considering building out a team of data scientists to develop machine learning models that will help them understand and manage content on their site. Our task was attempt to show that we could build a model that would identify, with 90% accuracy or better, whether or not a post came from a particular subreddit. We chose to use data from four different subreddits and build a model to identify the post in the dataset that were from r/travel.

As you can see by following the notebooks contained within this project, we were able to meet the objective by training a model to predict with 96% accuracy those posts belonging to r/travel. We also examine the small number of misclassified posts to better understand why they were misclassified. This knowledge can be incorporated into future models, if they choose to go forward with hiring the team. It was easy, as a human, to understand why the model had trouble with certain posts. A good example is a post in r/wine that talked about traveling to Sonoma for wine tasting. Clearly the language was of a travel nature. This demonstrates the value of identifying key words that can classify the nature of posts.

This can be extrapolated to more pressing topics, for instance being able to identify potentially threatening or violent behavior. The lines of privacy and responsibility are blurring and it remains to be seen what legal and/or moral obligation social media providers will have in the future. 

Here we only begin to explore what can be done with natual language processing and machine learning, but we feel it is a compelling start and hope that it is enough to convince Reddit to take the plunge and invest in a team of data scientists. We strongly recommend they proceed. 

