# Subreddit Classification with Natural Language Processing

## Modeling with Naive Bayes

*Author: Grace Campbell*

#### Project Directory
1. Data Preparation 
    - [Data Gathering](https://github.com/GraceCampbell/Fake-News-Classification-NLP/blob/master/data-gathering.ipynb)
    - [Exploratory Data Analysis](https://github.com/GraceCampbell/Fake-News-Classification-NLP/blob/master/exploratory-data-analysis.ipynb)
2. Modeling
    - *Naive Bayes*
    - [$k$-Nearest Neighbors](https://github.com/GraceCampbell/Fake-News-Classification-NLP/blob/master/modeling-knn.ipynb)
    - [Support-Vector Machine](https://github.com/GraceCampbell/Fake-News-Classification-NLP/blob/master/modeling-svm.ipynb)
    - [Final Testing on New Data](https://github.com/GraceCampbell/Fake-News-Classification-NLP/blob/master/final-models-testing.ipynb)

### Model Introduction

In this notebook, I will be modeling with `MultinomialNB`, which is a Naive Bayes classifier. Naive Bayes (as in Bayes' theorem) is a conditional probability model that assumes independence between the feature variables. While this is not a realistic assumption for a natural language model, the Naive Bayes classifier can make predictions with surprising accuracy.


### Modeling Strategy
Before I can begin modeling, I need to turn my text data into numeric data using `CountVectorizer`. This transformer will create a matrix of values, where the columns represent every word that appears in the corpus, and the rows represent each document in the corpus. The values are gross counts of how many times a word appears in a document.

Both of these methods have hyperparameters that can be tuned to optimize model performance, so I will perform a grid search using a pipeline with `CountVectorizer` and `MultinomialNB` to find the best parameters for both in the context of one another.

The grid search will test 3 different `CountVectorizer` hyperparameters:
1. `max_features`: how many features to extract (chosen by highest total frequency)
2. `min_df`: the minimum number of documents in which a feature must appear
3. `max_df`: the maximum percentage of documents in which a feature can appear

and 2 different `MultinomialNB` hyperparameters:
1. `alpha`: the additive smoothing (Laplace/Lidstone smoothing) parameter to be used on each feature
2. `fit_prior`: whether the model will learn the prior probabilities of the classes

### Grid Searching for Best Hyperparameters

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

from tokenizer import token_func

df = pd.read_csv('./materials/titles.csv')

In [2]:
# Creating X and y
X = df['title']
y = df['is_onion']

# Train-test splitting (with stratification)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

In [3]:
# Instantiating a pipeline
pipe = Pipeline([
    ('cvec', CountVectorizer(tokenizer=token_func)),
    ('mnb', MultinomialNB())
])

# Hyperparameters to search over
params = {
    'cvec__max_features': [None, 1000],
    'cvec__min_df': [1, 2],
    'cvec__max_df': [0.9, 1.0],
    'mnb__alpha': [1, 5],
    'mnb__fit_prior': [True, False]
}

# Fitting the grid search
grid = GridSearchCV(pipe, params, cv=3)
grid.fit(X_train, y_train);

In [4]:
grid.best_score_

0.8407750631844987

In [5]:
# Which parameters did the grid search choose?
grid.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': None,
 'cvec__min_df': 1,
 'mnb__alpha': 1,
 'mnb__fit_prior': True}

For `CountVectorizer`, the grid search decided:
- `max_features` should be None
    - I will have to investigate how many features are kept in the model when there is no maximum. I do not want more features than I have rows in `X_train` (to prevent collinearity), so I may have to set the `max_features` anyway, regardless of this grid search result.


- `min_df` should be 1, effectively meaning there is no minimum document frequency
    - Again, I need to see how many features the model keeps, and may need to change `min_df` anyway.


- `max_df` should be 0.9, meaning a feature will not be included in the model if it appears in more than 90% of the documents
    - Since I eliminated stopwords from the tokens, there most likely will not be many (if any) words that show up in more than 90% of the titles.
    
For `MultinomialNB`, the grid search decided:
- `alpha` should be 1
    - This will add a smoothing parameter of 1 to each feature in the data
    
    
- `fit_prior` should be True
    - The model will learn the class prior probabilities 

### Transforming `X` Using Best Parameters

As I suspected, there are almost 3 times more features in `X_train` than there are rows when `max_features` is None. For the final transformation, I will set `max_features` equal to 1187 (the number of rows) so the number of features does not exceed the number of rows.

In [6]:
cvec = CountVectorizer(tokenizer=token_func, max_features=1187, min_df=1, max_df=0.9)

cvec.fit(X_train)

X_train_c = pd.DataFrame(cvec.transform(X_train).todense(), columns=cvec.get_feature_names())
X_test_c  = pd.DataFrame(cvec.transform(X_test).todense(), columns=cvec.get_feature_names())

### Running the Optimized Model

In [7]:
# Instantiating and fitting the model
mnb = MultinomialNB(alpha=1, fit_prior=True)
mnb.fit(X_train_c, y_train)

# Storing predictions
y_pred = mnb.predict(X_test_c)

In [8]:
# Accuracy score
mnb.score(X_test_c, y_test)

0.8661616161616161

In [9]:
# Confusion Matrix + other metrics
tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()

print(f'True Positives: {tp}')
print(f'False Negatives: {fn}')
print(f'True Negatives: {tn}')
print(f'False Positives: {fp}\n')
print(f'Sensitivity: {tp/(tp+fn)}')
print(f'Specificity: {tn/(tn+fp)}')

True Positives: 201
False Negatives: 36
True Negatives: 142
False Positives: 17

Sensitivity: 0.8481012658227848
Specificity: 0.8930817610062893


### Interpretation of Results

The grid search found that `max_features` should be None to optimize the model's performance. However, without a feature limit, the model had ~4000 features after vectorization. I had to reduce `max_features` to 1187, the number of rows in `X_train`, to reduce collinearity. To get the highest possible accuracy score from this model, I would need to gather more data so that the model could use more features to make predictions.

The baseline accuracy score for the data is the score I would get if I predicted the majority class for every data point. The majority class here, /r/TheOnion, holds around 60% of the data. If I were to predict that every document in the data belonged to /r/TheOnion, I would get an accuracy score of 60%. That is to say, if a model does not predict subreddit membership with greater than 60% accuracy, then it is not a very good model.

The model's accuracy score is 86.6%, which is well above the baseline score of 60%. This means that the model correctly predicted the class 86.6% of the time. The model has relatively high sensitivity at 84.8%, meaning that 84.8% of the posts that were actually from /r/TheOnion were correctly predicted to be from /r/TheOnion. The model has higher specificity, 89.3%, which means that 89.3% of posts that belong to /r/News were correctly predicted to be from /r/News.

In a real-world application, it is equally important to me that this model be able to correctly predict when a post is satirical **and** when it is real. The positive class in this case is does not hold more weight than the negative class, therefore I would rather the model be very accurate than very sensitive or very specific. This model, however, is all three, which means that for my purposes it is a great model.