## Vecorization and Modeling out data

In this notebook, I'll be creating `Pipelines` to help vectorize and model my data. I'll use `GridSearchCV` to iterate over parameters in my pipelines. The models I explore are Logistic Regression, Multinomial Naive Bayes, and Support Vector Classfication.

In [89]:
import numpy as np
import pandas as pd
import regex as re
from my_functions import tokenize_and_stem
from bs4 import BeautifulSoup
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords 
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.svm import SVC

In [24]:
posts = pd.read_csv('./data/reddit_clean.csv')

In [90]:
# Creating X and y values to be passed through train_test_split
X = posts['selftext']
y = posts['subreddit']

In [91]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=182, stratify=y) # our data is pretty close to even, but I still want to stratify just to be safe.

In [92]:
# Creating a stop word list based on my EDA of words that aren't helpful
stop_words = [
    'https',
    'com',
    'www',
    'amp',
    'like',
    'just',
    'spotify',
    'because',
    'song',
    'music',
    'album',
    'want',
    'would',
    'make',
    'know',
    'becau',
]

In [93]:
# Creating a list of custom stop words
custom_sw = stopwords.words('english') + stop_words

In [94]:
# Processing my stop words in the same way I'll process my data
processed_sw = tokenize_and_stem(' '.join(custom_sw))

# Storing this for later use across Jupyter Notebooks
%store processed_sw

Stored 'processed_sw' (list)


### What's our baseline score?

In [95]:
# Baseline score
y.value_counts(normalize=True)

poppunkers    0.500784
punk          0.499216
Name: subreddit, dtype: float64

Our baseline score is about 50%

### Setting up and Running Pipes

In [96]:
# Setting variables for the my eventual parameter grid for easy tuning
max_df = [0.80]
min_df = [0, 0.002]
ngram_range = [(1, 2), (1, 1)]
max_features = [4000]
stop_words = [processed_sw, None]
tokenizer = [None]

In [97]:
# Count Vectorizer and Naive Bayes
pipe_cvec_nb = Pipeline([
    ('cvec', CountVectorizer(tokenizer=tokenize_and_stem)),
    ('nb', MultinomialNB())
])

# Setting my parameters
params_cvec_nb = {
    'cvec__max_df' : max_df,
    'cvec__min_df' : min_df,
    'cvec__max_features' : max_features,
    'cvec__ngram_range' : ngram_range,
    'cvec__stop_words' : stop_words
}

In [98]:
# Count Vectorizer and Logistic Regression
pipe_cvec_lr = Pipeline([
    ('cvec', CountVectorizer(tokenizer=tokenize_and_stem)),
    ('lr', LogisticRegression(max_iter=2000))
])

params_cvec_lr = {
    'cvec__max_df' : max_df,
    'cvec__min_df' : min_df,
    'cvec__max_features' : max_features,
    'cvec__ngram_range' : ngram_range,
    'cvec__stop_words' : stop_words
}

In [99]:
# Tfidf Vectorizer and SVC
pipe_tvec_svc = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('svc', SVC(degree=2, kernel='poly'))
])

params_tvec_svc = {
    'tvec__max_df' : max_df,
    'tvec__min_df' : min_df,
    'tvec__max_features' : max_features,
    'tvec__ngram_range' : ngram_range,
    'svc__C' : [0.789495]
}

In [100]:
gs_1 = GridSearchCV(pipe_cvec_nb,
                         param_grid=params_cvec_nb,
                         cv=5,
                         n_jobs = 12,
                         verbose=2)

gs_3 = GridSearchCV(pipe_cvec_lr,
                         param_grid=params_cvec_lr,
                         cv=5,
                         n_jobs = 12,
                         verbose=2)

gs_4 = GridSearchCV(pipe_tvec_svc,
                         param_grid=params_tvec_svc,
                         cv=5,
                         n_jobs = 12,
                         verbose=2)

In [101]:
# # Commented out so the .csv doesn't get overwritten
# model_params = {}
# count = 0

In [102]:
# # Uncomment if you really want to run this GridSearch again, it will take awhile
# gs_1.fit(X_train, y_train)
# gs_3.fit(X_train, y_train)
# gs_4.fit(X_train, y_train)

# # Create a new dictionary entry with the vectorizer used in the GridSearch Pipeline
# gs_1.best_params_['vectorizer'] = gs_1.estimator[0]
# gs_3.best_params_['vectorizer'] = gs_3.estimator[0]
# gs_4.best_params_['vectorizer'] = gs_4.estimator[0]

# # Create a new dictionary entry with the model used in the GridSearch Pipeline
# gs_1.best_params_['model'] = gs_1.estimator[1]
# gs_3.best_params_['model'] = gs_3.estimator[1]
# gs_4.best_params_['model'] = gs_4.estimator[1]

# # Create a new dictionary entry with the train score from the GridSearch
# gs_1.best_params_['train_score'] = gs_1.best_score_
# gs_3.best_params_['train_score'] = gs_3.best_score_
# gs_4.best_params_['train_score'] = gs_4.best_score_

# # Create a new dictionary entry with the test score from the GridSearch
# gs_1.best_params_['test_score'] = gs_1.score(X_test, y_test)
# gs_3.best_params_['test_score'] = gs_3.score(X_test, y_test)
# gs_4.best_params_['test_score'] = gs_4.score(X_test, y_test)

# # Add each of these entries to the list
# count += 1
# model_params[f'model_{count}'] = gs_1.best_params_
# count += 1
# model_params[f'model_{count}'] = gs_3.best_params_
# count += 1
# model_params[f'model_{count}'] = gs_4.best_params_

# # Create a DataFrame from the dictionary we created above
# model_df = pd.DataFrame.from_dict(model_params, orient='index')

In [103]:
# Code to store model is commented out so it doesn't get overwritten
# model_df.to_csv('./data/gs_results.csv', index=False)

In [104]:
model_df = pd.read_csv('./data/gs_results.csv')

In [105]:
model_df.index.name = 'Test Number'

In [106]:
model_df

Unnamed: 0_level_0,cvec__max_df,cvec__max_features,cvec__min_df,cvec__ngram_range,cvec__stop_words,vectorizer,model,train_score,test_score,svc__C,tvec__max_df,tvec__max_features,tvec__min_df,tvec__ngram_range
Test Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
0,0.95,4000.0,0.001,"(1, 2)","['i', 'me', 'my', 'myself', 'we', 'our', 'our'...",CountVectorizer(tokenizer=<function tokenize_a...,MultinomialNB(),0.811783,0.834901,,,,,
1,0.95,4000.0,0.001,"(1, 2)","['i', 'me', 'my', 'myself', 'we', 'our', 'our'...",CountVectorizer(tokenizer=<function tokenize_a...,LogisticRegression(max_iter=2000),0.788779,0.830721,,,,,
2,0.99,4000.0,0.0,"(1, 2)","['i', 'me', 'my', 'myself', 'we', 'our', 'our'...",CountVectorizer(tokenizer=<function tokenize_a...,MultinomialNB(),0.812131,0.832811,,,,,
3,0.99,4000.0,0.0,"(1, 2)","['i', 'me', 'my', 'myself', 'we', 'our', 'our'...",CountVectorizer(tokenizer=<function tokenize_a...,LogisticRegression(max_iter=2000),0.787733,0.831766,,,,,
4,0.8,4000.0,0.0,"(1, 2)","['i', 'me', 'my', 'myself', 'we', 'our', 'our'...",CountVectorizer(tokenizer=<function tokenize_a...,MultinomialNB(),0.812131,0.832811,,,,,
5,0.8,4000.0,0.0,"(1, 2)","['i', 'me', 'my', 'myself', 'we', 'our', 'our'...",CountVectorizer(tokenizer=<function tokenize_a...,LogisticRegression(max_iter=2000),0.787733,0.831766,,,,,
6,,,,,,TfidfVectorizer(),"SVC(degree=2, kernel='poly')",0.793301,0.807732,0.789495,0.95,4000.0,0.002,"(1, 1)"
7,,,,,,TfidfVectorizer(),"SVC(degree=2, kernel='poly')",0.793301,0.807732,0.789495,0.99,4000.0,0.002,"(1, 1)"
8,,,,,,TfidfVectorizer(),"SVC(degree=2, kernel='poly')",0.793301,0.807732,0.789495,0.8,4000.0,0.002,"(1, 1)"


In [108]:
# Model names wouldn't sort properly until turned into strings
model_df['model'] = model_df['model'].astype(str) 

# Sorting df by model name, verbose
model_df.sort_values(by=['model', 'test_score']).to_csv('./data/verbose_sorted_gs_results.csv')

# Sorting df by model name and selecting condensed features
model_df.sort_values(by=['model', 'test_score'])[['model', 'cvec__ngram_range', 'tvec__ngram_range', 'train_score', 'test_score']].to_csv('./data/condensed_sorted_gs_results.csv')

## Insights from running GridSearchCV

The tokenizer I built in the third notebook ended up only being helpful when running Pipelines with `CountVectorizer()`. Stop words were also only helpful for `CountVectorizer()` pipelines.

Also, I originally had a CountVectorizer() --> SVC pipeline, but removed it because it was performing so poorly.

Ultimately, I'm going with the **Naive Bayes model** as my production model because of its interpretability and performance, though the SVC model has less variance, it's uninterpretable and would be difficult to explain to a stakeholder who isn't familiar with statistical models.

In [109]:
model_df.loc[4]

cvec__max_df                                                        0.8
cvec__max_features                                                 4000
cvec__min_df                                                          0
cvec__ngram_range                                                (1, 2)
cvec__stop_words      ['i', 'me', 'my', 'myself', 'we', 'our', 'our'...
vectorizer            CountVectorizer(tokenizer=<function tokenize_a...
model                                                   MultinomialNB()
train_score                                                    0.812131
test_score                                                     0.832811
svc__C                                                              NaN
tvec__max_df                                                        NaN
tvec__max_features                                                  NaN
tvec__min_df                                                        NaN
tvec__ngram_range                                               

## Building my production model and prepping dataframes for conclusions

Below, I'll instantiate and fit my production model and create some dataframes of results that I'll investigate further in my conclusions notebook.

In [110]:
# Rebuilding my best scoring model
cvec = CountVectorizer(
    tokenizer=tokenize_and_stem,
    max_df=0.99,
    min_df=0,
    max_features=4000,
    ngram_range=(1, 2),
    stop_words=processed_sw
)

nb = MultinomialNB()

# Count Vectorizing my training and testing data
X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)

# Fitting my best scoring model
nb.fit(X_train_cvec, y_train)

MultinomialNB()

In [111]:
# Final cross val score of my production model
cross_val_score(nb, X_train_cvec, y_train)

array([0.81533101, 0.81010453, 0.82055749, 0.81010453, 0.82024433])

In [112]:
# Final test score of my production model
nb.score(X_test_cvec, y_test)

0.832810867293626

### Creating Probabilities Dataframe

Now I want to create a dataframe that has the probabilities of an observation (post) being classified as belonging to the `poppunkers` or `punk` subreddits

In [113]:
# Storing the probabilities that a post belongs to one class or the other
probabilities = nb.predict_proba(X_test_cvec)

In [114]:
# Creating a dataframe of my results
proba_df = pd.DataFrame(probabilities,
                       columns=nb.classes_, # Getting class names
                       index=X_test.index # Setting original index of X_train
                       )

proba_df['orig_post'] = X_test

In [115]:
# Sorting so I can easily pull the top words for each subreddit.
sorted_probas = proba_df.sort_values(by='poppunkers', ascending=True)

In [27]:
# Saving this for later!
sorted_probas.to_csv('./data/prod_model_sorted_probas.csv')

## Creating a word importance dataframe
This dataframe will show how important a word was to our Naive Bayes classifier

In [116]:
# Summing the columns in the X_test_cvec array, thanks John Vinyard from
# Stack Overflow: https://stackoverflow.com/questions/13567345/how-to-calculate-the-sum-of-all-columns-of-a-2d-numpy-array-efficiently
word_freq = X_test_cvec.toarray().sum(axis=0)

In [117]:
# Creating a DataFrame for word importance
word_importance = pd.DataFrame(np.exp(nb.coef_.T), index=cvec.get_feature_names())
word_importance.columns = ['coefficient']
word_importance['testing_word_freq'] = word_freq

# Let's sort this by the Coefficient
word_importance_sorted = word_importance.sort_values(by='coefficient', ascending=False)

In [118]:
# Saving this for later
word_importance.to_csv('./data/word_importance.csv')

## Creating a dataframe of the predicted and true classes

Now I want to create a dataframe of the original posts and their predicted and actual subreddits. I'll take a look at which posts we didn't classify correctly and see which words were used most in those posts.

In [119]:
# Getting my y_preds
y_preds = nb.predict(X_test_cvec)

In [120]:
predictions = pd.DataFrame(data={'predicted_subreddit' : y_preds,
                                'actual_subreddit' : y_test.tolist(),
                                'orig_post' : X_test
                                },
                           index=X_test.index)

In [121]:
predictions.to_csv('./data/prod_model_predictions.csv')

## What's Next?

Onward to the conclusions!