## Rubric
Your local instructor will evaluate your project (for the most part) using the following criteria.  You should make sure that you consider and/or follow most if not all of the considerations/recommendations outlined below **while** working through your project.

For Project 3 the evaluation categories are as follows:<br>
**The Data Science Process**
- Problem Statement
- Data Collection
- Data Cleaning & EDA
- Preprocessing & Modeling
- Evaluation and Conceptual Understanding
- Conclusion and Recommendations

**Organization and Professionalism**
- Organization
- Visualizations
- Python Syntax and Control Flow
- Presentation

**Scores will be out of 30 points based on the 10 categories in the rubric.** <br>
*3 points per section*<br>

| Score | Interpretation |
| --- | --- |
| **0** | *Project fails to meet the minimum requirements for this item.* |
| **1** | *Project meets the minimum requirements for this item, but falls significantly short of portfolio-ready expectations.* |
| **2** | *Project exceeds the minimum requirements for this item, but falls short of portfolio-ready expectations.* |
| **3** | *Project meets or exceeds portfolio-ready expectations; demonstrates a thorough understanding of every outlined consideration.* |


### The Data Science Process

**Problem Statement**
- Is it clear what the goal of the project is?
- What type of model will be developed?
- How will success be evaluated?
- Is the scope of the project appropriate?
- Is it clear who cares about this or why this is important to investigate?
- Does the student consider the audience and the primary and secondary stakeholders?

**Data Collection**
- Was enough data gathered to generate a significant result?
- Was data collected that was useful and relevant to the project?
- Was data collection and storage optimized through custom functions, pipelines, and/or automation?
- Was thought given to the server receiving the requests such as considering number of requests per second?

**Data Cleaning and EDA**
- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

**Preprocessing and Modeling**
- Is text data successfully converted to a matrix representation?
- Are methods such as stop words, stemming, and lemmatization explored?
- Does the student properly split and/or sample the data for validation/training purposes?
- Does the student test and evaluate a variety of models to identify a production algorithm (**AT MINIMUM:** Bayes and one other model), Mahdi suggests a logreg model?
- Does the student defend their choice of production model relevant to the data at hand and the problem?
- Does the student explain how the model works and evaluate its performance successes/downfalls?

**Evaluation and Conceptual Understanding**
- Does the student accurately identify and explain the baseline score?
- Does the student select and use metrics relevant to the problem objective?
- Does the student interpret the results of their model for purposes of inference?
- Is domain knowledge demonstrated when interpreting results?
- Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?

**Conclusion and Recommendations**
- Does the student provide appropriate context to connect individual steps back to the overall project?
- Is it clear how the final recommendations were reached?
- Are the conclusions/recommendations clearly stated?
- Does the conclusion answer the original problem statement?
- Does the student address how findings of this research can be applied for the benefit of stakeholders?
- Are future steps to move the project forward identified?


### Organization and Professionalism

**Project Organization**
- Are modules imported correctly (using appropriate aliases)?
- Are data imported/saved using relative paths?
- Does the README provide a good executive summary of the project?
- Is markdown formatting used appropriately to structure notebooks?
- Are there an appropriate amount of comments to support the code?
- Are files & directories organized correctly?
- Are there unnecessary files included?
- Do files and directories have well-structured, appropriate, consistent names?

**Visualizations**
- Are sufficient visualizations provided?
- Do plots accurately demonstrate valid relationships?
- Are plots labeled properly?
- Are plots interpreted appropriately?
- Are plots formatted and scaled appropriately for inclusion in a notebook-based technical report?

**Python Syntax and Control Flow**
- Is care taken to write human readable code?
- Is the code syntactically correct (no runtime errors)?
- Does the code generate desired results (logically correct)?
- Does the code follows general best practices and style guidelines?
- Are Pandas functions used appropriately?
- Are `sklearn` and `NLTK` methods used appropriately?

**Presentation**
- Is the problem statement clearly presented?
- Does a strong narrative run through the presentation building toward a final conclusion?
- Are the conclusions/recommendations clearly stated?
- Is the level of technicality appropriate for the intended audience?
- Is the student substantially over or under time?
- Does the student appropriately pace their presentation?
- Does the student deliver their message with clarity and volume?
- Are appropriate visualizations generated for the intended audience?
- Are visualizations necessary and useful for supporting conclusions/explaining findings?


---

### Why we choose this project for you?
This project covers three of the biggest concepts we cover in the class: Classification Modeling, Natural Language Processing and Data Wrangling/Acquisition.

Part 1 of the project focuses on **Data wrangling/gathering/acquisition**. This is a very important skill as not all the data you will need will be in clean CSVs or a single table in SQL.  There is a good chance that wherever you land you will have to gather some data from some unstructured/semi-structured sources; when possible, requesting information from an API, but often scraping it because they don't have an API (or it's terribly documented).

Part 2 of the project focuses on **Natural Language Processing** and converting standard text data (like Titles and Comments) into a format that allows us to analyze it and use it in modeling.

Part 3 of the project focuses on **Classification Modeling**.  Given that project 2 was a regression focused problem, we needed to give you a classification focused problem to practice the various models, means of assessment and preprocessing associated with classification.   


Notes: Even if I end up removing it from the project, list appropriate steps in markdown.  Always show baseline model!  Next model needs to be a classification model.  We need train score, test score, cross-val.  Model selection showcases scores of models, but we do not need cross-val in this section.

Notes: Show confusion matrix & in a graph.  Mahdi was able to show misclassifiers, which is actually what I would like to find.

# Problem Statement

I am a comedy writer (True Story).  I've run out of my own original ideas, so in order to get a good idea to pitch to Mike Schur, I want to find common word choice/themes that fans of both The Good Place and Parks & Recreation are using.  While I want my KNN and Bayes models to have high accuracy, I want to look at what my models are incorrectly able to predict, and the overlap will help me generate an idea of what type of show I can write that Mike Schur will want to further develop.

# Executive Summary

# Data Collection

In [306]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import datetime as dt
import time
import requests

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

import regex as re


In [236]:
# Mahdi helped us build this code through his intro lesson
def query_pushshift(subreddit, kind = 'submission', day_window = 365, n = 14):
    SUBFIELDS = ['title', 'selftext', 'subreddit', 'created_utc',
                 'author', 'num_comments', 'score', 'is_self', 'is_original_content',
                'over_18','is_crosspostable']
    
    # establish base url and stem
    BASE_URL = f"https://api.pushshift.io/reddit/search/{kind}" # also known as the "API endpoint" 
    stem = f"{BASE_URL}?subreddit={subreddit}&size=500" # always pulling max of 500
    
    # instantiate empty list for temp storage
    posts = []
    
    # implement for loop with `time.sleep(2)`
    for i in range(1, n + 1):
        URL = "{}&after={}d".format(stem, day_window * i)
        print("Querying from: " + URL)
        response = requests.get(URL)
        assert response.status_code == 200
        mine = response.json()['data']
        df = pd.DataFrame.from_dict(mine)
        posts.append(df)
        time.sleep(2)
    
    # pd.concat storage list
    full = pd.concat(posts, sort=False)
    
    # if submission
    if kind == "submission":
        # select desired columns
        full = full[SUBFIELDS]
        # drop duplicates
        full.drop_duplicates(inplace = True)
        # select `is_self` == True
        full = full.loc[full['is_self'] == True]

    # create `timestamp` column
    full['timestamp'] = full["created_utc"].map(dt.date.fromtimestamp)
    
    print("Query Complete!")    
    return full 

In [237]:
good_place_df = query_pushshift('TheGoodPlace')

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheGoodPlace&size=500&after=365d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheGoodPlace&size=500&after=730d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheGoodPlace&size=500&after=1095d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheGoodPlace&size=500&after=1460d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheGoodPlace&size=500&after=1825d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheGoodPlace&size=500&after=2190d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheGoodPlace&size=500&after=2555d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheGoodPlace&size=500&after=2920d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=TheGoodPlace&size=500&after=3285d
Querying from: https:

In [238]:
good_place_df.shape

(1026, 12)

In [239]:
parks_and_rec_df = query_pushshift('PandR')

Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=365d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=730d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=1095d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=1460d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=1825d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=2190d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=2555d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=2920d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&size=500&after=3285d
Querying from: https://api.pushshift.io/reddit/search/submission?subreddit=PandR&siz

In [240]:
parks_and_rec_df.shape

(1117, 12)

In [284]:
df = pd.concat([parks_and_rec_df, good_place_df], ignore_index = True)

In [253]:
df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,is_original_content,over_18,is_crosspostable,timestamp
0,Hulu vs Netflix,So many differences! I'm watching Season 2 Ep ...,PandR,1556148216,DevoidSauce,0,2,True,False,False,True,2019-04-24
1,CMV: Onceb one has been faithfully all the way...,,PandR,1556155364,DariusMDeV,0,1,True,False,False,True,2019-04-24
2,CMV: Once one has been faithfully through the ...,,PandR,1556155446,DariusMDeV,5,2,True,False,False,True,2019-04-24
3,Small Thing I Noticed With “Sister City”,At the end of the episode when that Venezuela...,PandR,1556159655,Hejwiowknnmz,2,21,True,False,False,True,2019-04-24
4,I just realized that in the episode Article Tw...,,PandR,1556191863,UnoriginalName373,2,5,True,False,False,True,2019-04-25


In [254]:
df.tail()

Unnamed: 0,title,selftext,subreddit,created_utc,author,num_comments,score,is_self,is_original_content,over_18,is_crosspostable,timestamp
2138,Janet Cosplay,My wife is trying to put together a cosplay of...,TheGoodPlace,1491420552,arinlome,3,13,True,,False,,2017-04-05
2139,Knowing michael was evil didnt ruin his charac...,[removed],TheGoodPlace,1491589139,ardyraindropsrd,7,24,True,,True,,2017-04-07
2140,[SPOILER] Janet Theory,Please do not read this if you haven't watched...,TheGoodPlace,1491858194,BrianFoxShow,10,42,True,,False,,2017-04-10
2141,Is Janet Allegorical? [spoiler],Since my Janet is god theory didn't resonate -...,TheGoodPlace,1491921533,kdubstep,5,3,True,,False,,2017-04-11
2142,Having just finished the (amazing) first seaso...,That pizza is DEFINITELY gluten-free.,TheGoodPlace,1492246895,weblowinherseys,4,40,True,,False,,2017-04-15


Checked head and tail to ensure the data properly went to one dataframe

# Cleaning & EDA

In [244]:
df.isnull().sum()

title                     0
selftext                  0
subreddit                 0
created_utc               0
author                    0
num_comments              0
score                     0
is_self                   0
is_original_content    1509
over_18                   0
is_crosspostable       1174
timestamp                 0
dtype: int64

We have a lot of nulls in original content, and crosspostable.  I suspect these are likely Boolean categories that can be dummied, so I'm going to look further into that.

In [285]:
df['is_original_content'].value_counts()

False    634
Name: is_original_content, dtype: int64

No need to keep orinal content column, since there are no True values.

In [286]:
df.drop(columns = ['is_original_content'], inplace = True)

In [287]:
df.dtypes

title               object
selftext            object
subreddit           object
created_utc          int64
author              object
num_comments         int64
score                int64
is_self               bool
over_18               bool
is_crosspostable    object
timestamp           object
dtype: object

In [288]:
df.describe()

Unnamed: 0,created_utc,num_comments,score
count,2143.0,2143.0,2143.0
mean,1475388000.0,10.414372,25.695287
std,68761950.0,33.966947,75.988653
min,1295516000.0,0.0,0.0
25%,1434691000.0,2.0,2.0
50%,1494897000.0,5.0,8.0
75%,1525468000.0,11.0,20.0
max,1560892000.0,887.0,1310.0


In [291]:
df['is_crosspostable']

0        True
1        True
2        True
3        True
4        True
        ...  
2138    False
2139    False
2140    False
2141    False
2142    False
Name: is_crosspostable, Length: 2143, dtype: object

Since I will dummy crosspostable out as True and "not True", I am going to assume null values are false.

In [292]:
df.fillna('False', inplace = True)

In [293]:
df['over_18'] = df['over_18'].astype(str).replace({
    'False' : '0',
    'True' : '1'
}).astype(int)
df['is_self'] = df['is_self'].astype(str).replace({
    'False' : '0',
    'True' : '1'
}).astype(int)
df['is_crosspostable'] = df['is_crosspostable'].astype(str).replace({
    'False' : '0',
    'True' : '1'
}).astype(int)

In [294]:
df.describe()

Unnamed: 0,created_utc,num_comments,score,is_self,over_18,is_crosspostable
count,2143.0,2143.0,2143.0,2143.0,2143.0,2143.0
mean,1475388000.0,10.414372,25.695287,1.0,0.007,0.426972
std,68761950.0,33.966947,75.988653,0.0,0.083389,0.494754
min,1295516000.0,0.0,0.0,1.0,0.0,0.0
25%,1434691000.0,2.0,2.0,1.0,0.0,0.0
50%,1494897000.0,5.0,8.0,1.0,0.0,0.0
75%,1525468000.0,11.0,20.0,1.0,0.0,1.0
max,1560892000.0,887.0,1310.0,1.0,1.0,1.0


All values of is_self are 1, so I will drop that.  over_18 is a very imbalanced class, with only .7% having a true value, so I will drop that column too.

In [295]:
df.drop(columns = ['is_self', 'over_18'], inplace = True)

In [267]:
df['is_crosspostable'].value_counts()

False    1228
True      915
Name: is_crosspostable, dtype: int64

# EDA

Do CountVectorizer, sum all of the rows together, sort, take the top 10-15 (show distribution by subreddit), from this list, we can create a custom list of subreddits

In [298]:
cvec = CountVectorizer()

In [316]:
cv_title = cvec.fit_transform(df['title'])

In [317]:
cvec_title = pd.DataFrame(cv_title.toarray(), columns = cvec.get_feature_names)

TypeError: 'method' object is not iterable

In [300]:
df.columns

Index(['title', 'selftext', 'subreddit', 'created_utc', 'author',
       'num_comments', 'score', 'is_crosspostable', 'timestamp'],
      dtype='object')