## Project 3:

### Scenario

You're fresh out of your Data Science bootcamp and looking to break through in the world of freelance data journalism. Nate Silver and co. at FiveThirtyEight have agreed to hear your pitch for a story in two weeks!

Your piece is going to be on how to create a Reddit post that will get the most engagement from Reddit users. Because this is FiveThirtyEight, you're going to have to get data and analyze it in order to make a compelling narrative.

### Problem Statement: What characteristics of a post on Reddit are most predictive of the overall interaction on a thread (as measured by number of comments) ?

Once you've got the data, you will build a classification model that, using the text and any other relevant features, predicts whether or not a given Reddit post will have above or below the _median_ number of comments.

### Requirements

- **Create and compare two or more models**. One of these must be a random forest, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.
- Jupyter Notebook(s) with your analysis for a peer audience of data scientists
- An executive summary of your results
- A 8 to 10 minute presentation outlining your process and findings for a semi-technical audience. The reason we say 'semi-technical' is that FiveThirtyEight wants to see how you plan to explain your findings in your article, and their audience is likely readers who are familiar with and interested in data/statistics, but are not experts. This means that if you'd like to talk about your model works you can, but explain what exactly your model does at a high level.

---

### Necessary Deliverables / Submission

- Code and executive summary must be in a clearly commented Jupyter Notebook
- You must submit your slide deck
- Materials must be submitted by 9 a.m. Wednesday, September 7th

---

### The Data Science Process

**Problem Statement**
- [] Is it clear what the goal of the project is?
- [] What type of model will be developed?
- [] How will success be evaluated?
- [] Is the scope of the project appropriate?
- [] Is it clear who cares about this or why this is important to investigate?
- [] Does the student consider the audience and the primary and secondary stakeholders?

**Data Collection**
- [] Was enough data gathered to generate a significant result (at least 10,000 posts)?
- [] Was data collected that was useful and relevant to the project?
- [] Was data collection and storage optimized through custom functions, pipelines, and/or automation?
- [] Was thought given to the server receiving the requests such as considering number of requests per second?

**Data Cleaning and EDA**
- [] Are missing values imputed/handled appropriately?
- [] Are distributions examined and described?
- [] Are outliers identified and addressed?
- [] Are appropriate summary statistics provided?
- [] Are steps taken during data cleaning and EDA framed appropriately?
- [] Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

**Preprocessing and Modeling**
- [] Is text data successfully converted to a matrix representation?
- [] Are methods such as stop words, stemming, and lemmatization explored?
- [] Does the student properly split and/or sample the data for validation/training purposes?
- [] Does the student test and evaluate a variety of models to identify a production algorithm (**AT MINIMUM:** two models)?
- [] Does the student defend their choice of production model relevant to the data at hand and the problem?
- [] Does the student explain how the model works and evaluate its performance successes/downfalls?

**Evaluation and Conceptual Understanding**
- [] Does the student accurately identify and explain the baseline score?
- [] Does the student select and use metrics relevant to the problem objective?
- [] Does the student interpret the results of their model for purposes of inference?
- [] Is domain knowledge demonstrated when interpreting results?
- [] Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?

**Conclusion and Recommendations**
- [] Does the student provide appropriate context to connect individual steps back to the overall project?
- [] Is it clear how the final recommendations were reached?
- [] Are the conclusions/recommendations clearly stated?
- [] Does the conclusion answer the original problem statement?
- [] Does the student address how findings of this research can be applied for the benefit of stakeholders?
- [] Are future steps to move the project forward identified?


### Organization and Professionalism

**Project Organization**
- [] Are modules imported correctly (using appropriate aliases)?
- [] Are data imported/saved using relative paths?
- [] Does the README provide a good executive summary of the project?
- [] Is markdown formatting used appropriately to structure notebooks?
- [] Are there an appropriate amount of comments to support the code?
- [] Are files & directories organized correctly?
- [] Are there unnecessary files included?
- [] Do files and directories have well-structured, appropriate, consistent names?

**Visualizations**
- [] Are sufficient visualizations provided?
- [] Do plots accurately demonstrate valid relationships?
- [] Are plots labeled properly?
- [] Are plots interpreted appropriately?
- [] Are plots formatted and scaled appropriately for inclusion in a notebook-based technical report?

**Python Syntax and Control Flow**
- [] Is care taken to write human readable code?
- [] Is the code syntactically correct (no runtime errors)?
- [] Does the code generate desired results (logically correct)?
- [] Does the code follows general best practices and style guidelines?
- [] Are Pandas functions used appropriately?
- [] Are `sklearn` and `NLTK` methods used appropriately?

**Presentation**
- [] Is the problem statement clearly presented?
- [] Does a strong narrative run through the presentation building toward a final conclusion?
- [] Are the conclusions/recommendations clearly stated?
- [] Is the level of technicality appropriate for the intended audience?
- [] Is the student substantially over or under time?
- [] Does the student appropriately pace their presentation?
- [] Does the student deliver their message with clarity and volume?
- [] Are appropriate visualizations generated for the intended audience?
- [] Are visualizations necessary and useful for supporting conclusions/explaining findings?

**Part 1** of the project focuses on **Data wrangling/gathering/acquisition**. This is a very important skill as not all the data you will need will be in clean CSVs or a single table in SQL. There is a good chance that wherever you land you will have to gather some data from some unstructured/semi-structured sources; when possible, requesting information from an API, but often scraping it because they don't have an API (or it's terribly documented).

**Part 2** of the project focuses on **Natural Language Processing** and converting standard text data (like Titles and Comments) into a format that allows us to analyze it and use it in modeling.

**Part 3** of the project focuses on **Classification Modeling**. Given that project 2 was a regression focused problem, we needed to give you a classification focused problem to practice the various models, means of assessment and preprocessing associated with classification.

## Preparing Data For Modeling

In [1]:
import pandas as pd
import numpy as np

In [7]:
df = pd.read_csv('data/reddit_complete.csv')

In [8]:
df

Unnamed: 0,sub,title,score,comments,created
0,LeopardsAteMyFace,Did they really think he'd pay out?,20509,1031,2010-09-10 06:51:25
1,news,"China drought causes Yangtze to dry up, sparki...",19660,1546,2010-09-10 06:51:25
2,MadeMeSmile,He did it!,86623,871,2010-09-10 06:51:25
3,aww,Loving the water.,13717,112,2010-09-10 06:51:25
4,politics,Biden vows to crack down on colleges 'jacking ...,44057,2259,2010-09-10 06:51:25
...,...,...,...,...,...
16790,Superstonk,Guess it's time to finally feed the bot. +1171...,1800,13,2010-09-10 06:51:25
16791,memes,guess they're gone forever now,235,4,2010-09-10 06:51:25
16792,Superstonk,JAN ‘21 apes attempting to purchase GameStop O...,3396,86,2010-09-10 06:51:25
16793,memes,I think there will be a problem,100,11,2010-09-10 06:51:25
