# Project 3 - Classification with Natural Language Processing

*Author: Grace Campbell*

## Problem Statement

[Reddit](https://reddit.com) is a content aggregation website where members can submit links, text posts, images, and videos, which other members can then comment on and discuss. The posts "are organized by subject into user-created boards called 'subreddits', which cover a variety of topics including news, science, movies, video games, music, books, fitness, food, and image-sharing." ([Wikipedia](https://en.wikipedia.org/wiki/Reddit))

Given post titles from two different subreddits, can I create a model which accurately predicts which subreddit a post came from? And more generally, can I create a model that can detect satire through text?

## Data Gathering

In [157]:
import requests
import time
import pandas as pd

In [158]:
headers = {'User-agent': 'Grace'}

### Getting /r/TheOnion post titles

In [159]:
onion_posts = []
after = None
for i in range(40):
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url = 'https://www.reddit.com/r/theonion.json'
    res = requests.get(url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        onion_posts.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(2)

In [160]:
titles = []
for i in range(len(onion_posts)):
    titles.append(onion_posts[i]['data']['title'])

In [161]:
onion_titles = list((set(titles)))
len(onion_titles)

949

### Getting /r/News post titles

In [162]:
news_posts = []
after = None
for i in range(40):
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url = 'https://www.reddit.com/r/worldnews.json'
    res = requests.get(url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        news_posts.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res.status_code)
        break
    time.sleep(2)

In [163]:
titles = []
for i in range(len(news_posts)):
    titles.append(news_posts[i]['data']['title'])

In [164]:
news_titles = list(set(titles))
len(news_titles)

634

In [262]:
onion = pd.DataFrame(onion_titles)
onion['is_onion'] = 1

news = pd.DataFrame(news_titles)
news['is_onion'] = 0

titles = news.append(onion, ignore_index=True)
titles.rename({0: 'title'}, axis=1, inplace=True)

titles.to_csv('titles.csv', index=False)

## Exploratory Data Analysis

In [301]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB

from nltk import pos_tag
from nltk.corpus import wordnet
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem.wordnet import WordNetLemmatizer

df = pd.read_csv('titles.csv')

In [264]:
# Looking at a few of the titles from each subreddit -- /r/TheOnion = 1, /r/News = 0
pd.options.display.max_colwidth = 300
df[:5].append(df[-11:-6])

Unnamed: 0,title,is_onion
0,Trump threatens to close U.S.-Mexico border next week,0
1,German Government Refuses to Recognise Guaido’s Envoy as Ambassador,0
2,Netanyahu Takes Shot At Obama In Campaign Ad,0
3,MPs asked to vote on withdrawal agreement only,0
4,The shock of the nude: Brazil's stark new form of political protest,0
1572,Report: It Pretty Incredible That Americans Entrusted With Driving Cars,1
1573,Self-Conscious Puppet Has No Idea What To Do With Hands,1
1574,5 Tokyo Buildings Godzilla Spared Because He Was Considering Having His Bar Mitzvah There,1
1575,Nation’s Flag Nerds Anxiously Watching D.C. Statehood Push,1
1576,"‘This One Means The Least Of All,’ Says Tom Brady Accepting Super Bowl Trophy",1


In [265]:
# Are there any null values?
df.isnull().sum()

title       0
is_onion    0
dtype: int64

In [266]:
# Are classes balanced?
df['is_onion'].value_counts(normalize=True)

1    0.599495
0    0.400505
Name: is_onion, dtype: float64

There are no null values in this dataset. The classes are relatively balanced, with ~60% of the data from /r/TheOnion and ~40% from /r/News.

### Preprocessing the Data

First, I need to process the titles into tokens and vectorize them so that I can perform exploratory analysis and begin modeling.

This will take two steps:
1. Create a tokenizing function to break down each title into tokens.
    - Convert each word to a token that only includes alphabetical elements (i.e. no numbers or special characters)
    - Remove English stop words from the tokens list.
    - Define a function that transforms Penn Treebank part-of-speech tags into WordNet part-of-speech tags.
    - Use that function to map each token to its lemma.
    - Return the final list of lemmatized tokens.
    
    
    
2. Use CountVectorizer to transform every word into a count of its frequency across documents.
    - For now, do not specify any parameters in the CountVectorizer except the tokenizing function.

In [216]:
# To convert TreeBank POS tags to WordNet POS tags
def get_wordnet_pos(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return None
    
# To return a list of lemmatized tokens
def token_func(document):
    tokenizer = RegexpTokenizer('[a-zA-Z]+')
    lemmatizer = WordNetLemmatizer()
    tokens = tokenizer.tokenize(document.lower())
    words = [w for w in tokens if w not in stopwords.words('english')]
    tagged = pos_tag(words)
    lemmas = []
    for word, tag in tagged:
        wntag = get_wordnet_pos(tag)
        if wntag is None:
            lemma = lemmatizer.lemmatize(word)
            lemmas.append(lemma)
        else:
            lemma = lemmatizer.lemmatize(word, pos=wntag)
            lemmas.append(lemma)
    return lemmas

In [184]:
cvec = CountVectorizer(tokenizer=token_func)

features = cvec.fit_transform(df['title'])

In [267]:
features_df = pd.DataFrame(features.todense(), columns=cvec.get_feature_names()).join(df['is_onion'])

In [189]:
# How many unique words do we have in this entire dataset?
len(cvec.get_feature_names())

5055

In [273]:
# In the titles from /r/TheOnion, what are the 10 most frequently seen words? 
features_df.loc[features_df['is_onion'] == 1, :].sum().sort_values(ascending=False)[1:11]

new       74
man       66
say       60
get       53
trump     42
report    39
year      38
woman     37
time      34
make      33
dtype: int64

In [275]:
# In the titles from /r/News, what are the 10 most frequently seen words? 
features_df.loc[features_df['is_onion'] == 0, :].sum().sort_values(ascending=False).head(10)

u         85
say       75
brexit    58
trump     46
may       43
year      41
deal      34
new       32
report    32
eu        31
dtype: int64

In this first look, it appears that 'new' appears most frequently in titles from /r/TheOnion, while 'u' appears the most in titles from /r/News ('u' corresponds to 'U.S.', due to the way the text was tokenized). 

It looks like the words 'trump', 'report', and 'year' will not be very discerning, as both groups of titles see these words appear frequently.

In [277]:
# How many words only appear once in /r/TheOnion titles?
(features_df.loc[features_df['is_onion'] == 1, :].sum() == 1).sum()

2357

In [278]:
# How many words only appear once in /r/News titles?
(features_df.loc[features_df['is_onion'] == 0, :].sum() == 1).sum()

1416

In [295]:
# How many words only appear once in the entire dataset?
(features_df.sum() == 1).sum()

2906

It appears that titles from /r/TheOnion contain more words that are unique within that subset of titles: there are 2,357 words that only appear once in the /r/TheOnion titles. By contrast, the /r/News titles have 1,416 such words.

Out of the 5,055 unique words in the entire dataset, only 2,419 appear more than once.

## Modeling

- **Create and compare two models**. One of these must be a Bayes classifier, however the other can be a classifier of your choosing: logistic regression, KNN, SVM, etc.

In [343]:
X_train.shape

(1187,)

In [337]:
X = df['title']
y = df['is_onion']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

### Naive Bayes

In [356]:
pipe = Pipeline([
    ('tvec', TfidfVectorizer(tokenizer=token_func)),
    ('mnb', MultinomialNB())
])

params = {
    'tvec__ngram_range': [(1, 1), (1, 2)],
    'tvec__max_features': [None, 900, 1000],
    'tvec__min_df': [1, 2, 3],
    'tvec__max_df': [0.9, 0.95]
}

grid = GridSearchCV(pipe, params)
grid.fit(X_train, y_train);

In [357]:
grid.best_params_

{'tvec__max_df': 0.9,
 'tvec__max_features': None,
 'tvec__min_df': 1,
 'tvec__ngram_range': (1, 1)}

In [358]:
grid.best_score_

0.8348778433024431

### Logistic Regression

In [362]:
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('tvec', TfidfVectorizer(tokenizer=token_func)),
    ('logreg', LogisticRegression())
])

params = {
    'tvec__ngram_range': [(1, 1), (1, 2)],
    'tvec__max_features': [None, 900, 1000],
    'tvec__min_df': [1, 2, 3],
    'tvec__max_df': [0.9, 0.95]
}

grid = GridSearchCV(pipe, params)
grid.fit(X_train, y_train);

In [363]:
grid.best_params_

{'tvec__max_df': 0.9,
 'tvec__max_features': 1000,
 'tvec__min_df': 2,
 'tvec__ngram_range': (1, 1)}

In [364]:
grid.best_score_

0.7902274641954508

### Random Forest

In [366]:
from sklearn.ensemble import RandomForestClassifier

pipe = Pipeline([
    ('tvec', TfidfVectorizer(tokenizer=token_func)),
    ('rf', RandomForestClassifier())
])

params = {
    'tvec__ngram_range': [(1, 1), (1, 2)],
    'tvec__max_features': [None, 900, 1000],
    'tvec__min_df': [1, 2, 3],
    'tvec__max_df': [0.9, 0.95]
}

grid = GridSearchCV(pipe, params)
grid.fit(X_train, y_train);

KeyboardInterrupt: 

In [360]:
grid.best_params_

{'tvec__max_df': 0.9,
 'tvec__max_features': None,
 'tvec__min_df': 2,
 'tvec__ngram_range': (1, 2)}

In [361]:
grid.best_score_

0.8011794439764112

# Deliverables
- A Jupyter Notebook with your analysis for a peer audience of data scientists.
- An executive summary of the results you found.
- A short presentation outlining your process and findings for a semi-technical audience.

**Pro Tip 1:** You can find a good example executive summary [here](https://www.proposify.biz/blog/executive-summary).

- Materials must be submitted by **10:00 AM on Monday, April 8th**.

---

## Rubric
Your local instructor will evaluate your project (for the most part) using the following criteria.  You should make sure that you consider and/or follow most if not all of the considerations/recommendations outlined below **while** working through your project.

For Project 3 the evaluation categories are as follows:<br>
**The Data Science Process**
- Problem Statement
- Data Collection
- Data Cleaning & EDA
- Preprocessing & Modeling
- Evaluation and Conceptual Understanding
- Conclusion and Recommendations

**Organization and Professionalism**
- Organization
- Visualizations
- Python Syntax and Control Flow
- Presentation

**Scores will be out of 30 points based on the 10 categories in the rubric.** <br>
*3 points per section*<br>

| Score | Interpretation |
| --- | --- |
| **0** | *Project fails to meet the outlined expectations; many major issues exist.* |
| **1** | *Project close to meeting expectations; many minor issues or a few major issues.* |
| **2** | *Project meets expectations; few (and relatively minor) mistakes.* |
| **3** | *Project demonstrates a thorough understanding of all of the considerations outlined.* |


### The Data Science Process

**Problem Statement** 
- Is it clear what the goal of the project is?
- What type of model will be developed?
- How will success be evaluated?
- Is the scope of the project appropriate?
- Is it clear who cares about this or why this is important to investigate?
- Does the student consider the audience and the primary and secondary stakeholders?

**Data Collection** 
- Was enough data gathered to generate a significant result?
- Was data collected that was useful and relevant to the project?
- Was data collection and storage optimized through custom functions, pipelines, and/or automation?
- Was thought given to the server receiving the requests such as considering number of requests per second?

**Data Cleaning and EDA** 
- Are missing values imputed/handled appropriately?
- Are distributions examined and described?
- Are outliers identified and addressed?
- Are appropriate summary statistics provided?
- Are steps taken during data cleaning and EDA framed appropriately?
- Does the student address whether or not they are likely to be able to answer their problem statement with the provided data given what they've discovered during EDA?

**Preprocessing and Modeling** 
- Is text data successfully converted to a matrix representation?
- Are methods such as stop words, stemming, and lemmatization explored?
- Does the student properly split and/or sample the data for validation/training purposes?
- Does the student test and evaluate a variety of models to identify a production algorithm (**AT MINIMUM:** Bayes and one other model)?
- Does the student defend their choice of production model relevant to the data at hand and the problem?
- Does the student explain how the model works and evaluate its performance successes/downfalls?

**Evaluation and Conceptual Understanding** 
- Does the student accurately identify and explain the baseline score?
- Does the student select and use metrics relevant to the problem objective?
- Does the student interpret the results of their model for purposes of inference?
- Is domain knowledge demonstrated when interpreting results?
- Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?

**Conclusion and Recommendations** 
- Does the student provide appropriate context to connect individual steps back to the overall project?
- Is it clear how the final recommendations were reached?
- Are the conclusions/recommendations clearly stated?
- Does the conclusion answer the original problem statement?
- Does the student address how findings of this research can be applied for the benefit of stakeholders?
- Are future steps to move the project forward identified?


### Organization and Professionalism

**Project Organization**
- Are modules imported correctly (using appropriate aliases)?
- Are data imported/saved using relative paths?
- Does the README provide a good executive summary of the project?
- Is markdown formatting used appropriately to structure notebooks?
- Are there an appropriate amount of comments to support the code?
- Are files & directories organized correctly?
- Are there unnecessary files included?
- Do files and directories have well-structured, appropriate, consistent names?

**Visualizations**
- Are sufficient visualizations provided?
- Do plots accurately demonstrate valid relationships?
- Are plots labeled properly?
- Are plots interpreted appropriately?
- Are plots formatted and scaled appropriately for inclusion in a notebook-based technical report?

**Python Syntax and Control Flow**
- Is care taken to write human readable code?
- Is the code syntactically correct (no runtime errors)?
- Does the code generate desired results (logically correct)?
- Does the code follows general best practices and style guidelines?
- Are Pandas functions used appropriately?
- Are `sklearn` and `NLTK` methods used appropriately?

**Presentation**
- Is the problem statement clearly presented?
- Does a strong narrative run through the presentation building toward a final conclusion?
- Are the conclusions/recommendations clearly stated?
- Is the level of technicality appropriate for the intended audience?
- Is the student substantially over or under time?
- Does the student appropriately pace their presentation?
- Does the student deliver their message with clarity and volume?
- Are appropriate visualizations generated for the intended audience?
- Are visualizations necessary and useful for supporting conclusions/explaining findings?