### Comparing Models and Vectorization Strategies for Text Classification

This try-it focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem.  In order to consider each of these components, you should make use of the `Pipeline` and `GridSearchCV` objects in scikitlearn to try different combinations of vectorizers with different estimators.  For each of these, you also want to use the `.cv_results_` to examine the time for the estimator to fit the data.

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  You are to use the text column to classify whether or not the text was humorous.  It is loaded and displayed below.

**Note:** The original dataset contains 200K rows of data. It is best to try to use the full dtaset. If the original dataset is too large for your computer, please use the 'dataset-minimal.csv', which has been reduced to 100K.

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import time 

import nltk
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk import word_tokenize
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error

def stemmer(text):
    stem = PorterStemmer()
    return ' '.join([stem.stem(w) for w in word_tokenize(text)])

def lemmatizer(text):
    lemm = WordNetLemmatizer()
    return ' '.join([lemm.lemmatize(w) for w in word_tokenize(text)])

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\sspillane\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\sspillane\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sspillane\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [31]:
df = pd.read_csv('data/dataset-minimal.csv')

In [32]:
df.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


In [33]:
X = df.drop('humor', axis = 1)
y = df['humor']

# Stemmed data
X_stemmed = X['text'].apply(stemmer)
X_train_stemmed, X_test_stemmed, y_train_stemmed, y_test_stemmed = train_test_split(X_stemmed, y, random_state = 42)

# Lemmatized data
X_lemmed = X['text'].apply(lemmatizer)
X_train_lemmed, X_test_lemmed, y_train_lemmed, y_test_lemmed = train_test_split(X_lemmed, y, random_state = 42)


#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

Unnamed: 0_level_0,best_params,best_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1
Logistic,,
Decision Tree,,
Bayes,,


## Count Vectorizer Pipelines

In [39]:
# Log Reg
log_cvect_pipe = Pipeline([
                          ('cvect', CountVectorizer()),
                          ('lgr', LogisticRegression())
                          ])
log_cvect_pipe_stemmed = log_cvect_pipe.fit(X_train_stemmed, y_train_stemmed)
log_cvect_test_acc_stemmed = log_cvect_pipe_stemmed.score(X_test_stemmed, y_test_stemmed)

log_cvect_pipe_lemmed = log_cvect_pipe.fit(X_train_lemmed, y_train_lemmed)
log_cvect_test_acc_lemmed = log_cvect_pipe_lemmed.score(X_test_lemmed, y_test_lemmed)


# Decision Tree Classifier
dt_cvect_pipe = Pipeline([
                          ('cvect', CountVectorizer()),
                          ('lgr', DecisionTreeClassifier())
                          ])
dt_cvect_pipe_stemmed = dt_cvect_pipe.fit(X_train_stemmed, y_train_stemmed)
dt_cvect_test_acc_stemmed = dt_cvect_pipe_stemmed.score(X_test_stemmed, y_test_stemmed)

dt_cvect_pipe_lemmed = dt_cvect_pipe.fit(X_train_lemmed, y_train_lemmed)
dt_cvect_test_acc_lemmed = dt_cvect_pipe_lemmed.score(X_test_lemmed, y_test_lemmed)


# Naive Bayes Classifier
bayes_cvect_pipe = Pipeline([
                          ('cvect', CountVectorizer()),
                          ('bayes', MultinomialNB())
                          ])
bayes_cvect_pipe_stemmed = bayes_cvect_pipe.fit(X_train_stemmed, y_train_stemmed)
bayes_cvect_test_acc_stemmed = bayes_cvect_pipe_stemmed.score(X_test_stemmed, y_test_stemmed)

bayes_cvect_pipe_lemmed = bayes_cvect_pipe.fit(X_train_lemmed, y_train_lemmed)
bayes_cvect_test_acc_lemmed = bayes_cvect_pipe_lemmed.score(X_test_lemmed, y_test_lemmed)

## Grid Searches

In [40]:
cvect_params = {
                'cvect__max_features': [100, 500, 1000, 2000],
                'cvect__stop_words': ['english', None]
                }

In [47]:
# Log CVect Stem
log_cvect_stem_grid = GridSearchCV(log_cvect_pipe_stemmed, param_grid=cvect_params)
log_cvect_stem_grid.fit(X_train_stemmed, y_train_stemmed)
log_cvect_stem_test_acc = log_cvect_stem_grid.score(X_test_stemmed, y_test_stemmed)
print(log_cvect_stem_test_acc)
print(log_cvect_stem_grid.best_params_)

0.90892
{'cvect__max_features': 2000, 'cvect__stop_words': None}


In [48]:
# Log CVect Lemm
log_cvect_lemm_grid = GridSearchCV(log_cvect_pipe_lemmed, param_grid=cvect_params)
log_cvect_lemm_grid.fit(X_train_lemmed, y_train_lemmed)
log_cvect_lemm_test_acc = log_cvect_lemm_grid.score(X_test_lemmed, y_test_lemmed)
print(log_cvect_lemm_test_acc)
print(log_cvect_lemm_grid.best_params_)

0.90824
{'cvect__max_features': 2000, 'cvect__stop_words': None}


In [49]:
# DT CVect Stem
dt_cvect_stem_grid = GridSearchCV(dt_cvect_pipe_stemmed, param_grid=cvect_params)
dt_cvect_stem_grid.fit(X_train_stemmed, y_train_stemmed)
dt_cvect_stem_test_acc = dt_cvect_stem_grid.score(X_test_stemmed, y_test_stemmed)
print(dt_cvect_stem_test_acc)
print(dt_cvect_stem_grid.best_params_)

0.85384
{'cvect__max_features': 2000, 'cvect__stop_words': None}


In [50]:
# DT CVect Lemm
dt_cvect_lemm_grid = GridSearchCV(dt_cvect_pipe_lemmed, param_grid=cvect_params)
dt_cvect_lemm_grid.fit(X_train_lemmed, y_train_lemmed)
dt_cvect_lemm_test_acc = dt_cvect_lemm_grid.score(X_test_lemmed, y_test_lemmed)
print(dt_cvect_lemm_test_acc)
print(dt_cvect_lemm_grid.best_params_)

0.85368
{'cvect__max_features': 2000, 'cvect__stop_words': None}


In [51]:
# Bayes CVect Stem
bayes_cvect_stem_grid = GridSearchCV(bayes_cvect_pipe_stemmed, param_grid=cvect_params)
bayes_cvect_stem_grid.fit(X_train_stemmed, y_train_stemmed)
bayes_cvect_stem_test_acc = bayes_cvect_stem_grid.score(X_test_stemmed, y_test_stemmed)
print(bayes_cvect_stem_test_acc)
print(bayes_cvect_stem_grid.best_params_)

0.89056
{'cvect__max_features': 2000, 'cvect__stop_words': None}


In [52]:
# Bayes CVect Lemm
bayes_cvect_lemm_grid = GridSearchCV(bayes_cvect_pipe_lemmed, param_grid=cvect_params)
bayes_cvect_lemm_grid.fit(X_train_lemmed, y_train_lemmed)
bayes_cvect_lemm_test_acc = bayes_cvect_lemm_grid.score(X_test_lemmed, y_test_lemmed)
print(bayes_cvect_lemm_test_acc)
print(bayes_cvect_lemm_grid.best_params_)

0.88928
{'cvect__max_features': 2000, 'cvect__stop_words': None}


## TF IDF Pipelines

In [53]:
# Log Reg
log_tfidf_pipe = Pipeline([
                          ('tfidf', TfidfVectorizer()),
                          ('lgr', LogisticRegression())
                          ])
log_tfidf_pipe_stemmed = log_tfidf_pipe.fit(X_train_stemmed, y_train_stemmed)
log_tfidf_test_acc_stemmed = log_tfidf_pipe_stemmed.score(X_test_stemmed, y_test_stemmed)

log_tfidf_pipe_lemmed = log_tfidf_pipe.fit(X_train_lemmed, y_train_lemmed)
log_tfidf_test_acc_lemmed = log_tfidf_pipe_lemmed.score(X_test_lemmed, y_test_lemmed)


# Decision Tree Classifier
dt_tfidf_pipe = Pipeline([
                          ('tfidf', TfidfVectorizer()),
                          ('lgr', DecisionTreeClassifier())
                          ])
dt_tfidf_pipe_stemmed = dt_tfidf_pipe.fit(X_train_stemmed, y_train_stemmed)
dt_tfidf_test_acc_stemmed = dt_tfidf_pipe_stemmed.score(X_test_stemmed, y_test_stemmed)

dt_tfidf_pipe_lemmed = dt_tfidf_pipe.fit(X_train_lemmed, y_train_lemmed)
dt_tfidf_test_acc_lemmed = dt_tfidf_pipe_lemmed.score(X_test_lemmed, y_test_lemmed)


# Naive Bayes Classifier
bayes_tfidf_pipe = Pipeline([
                          ('tfidf', TfidfVectorizer()),
                          ('bayes', MultinomialNB())
                          ])
bayes_tfidf_pipe_stemmed = bayes_tfidf_pipe.fit(X_train_stemmed, y_train_stemmed)
bayes_tfidf_test_acc_stemmed = bayes_tfidf_pipe_stemmed.score(X_test_stemmed, y_test_stemmed)

bayes_tfidf_pipe_lemmed = bayes_tfidf_pipe.fit(X_train_lemmed, y_train_lemmed)
bayes_tfidf_test_acc_lemmed = bayes_tfidf_pipe_lemmed.score(X_test_lemmed, y_test_lemmed)


In [55]:
tfidf_params = {
                'tfidf__max_features': [100, 500, 1000, 2000],
                'tfidf__stop_words': ['english', None]
                }

In [56]:
# Log TF IDF Stem
log_tfidf_stem_grid = GridSearchCV(log_tfidf_pipe_stemmed, param_grid=tfidf_params)
log_tfidf_stem_grid.fit(X_train_stemmed, y_train_stemmed)
log_tfidf_stem_test_acc = log_tfidf_stem_grid.score(X_test_stemmed, y_test_stemmed)
print(log_tfidf_stem_test_acc)
print(log_tfidf_stem_grid.best_params_)

0.90488
{'tfidf__max_features': 2000, 'tfidf__stop_words': None}


In [57]:
# Log TF IDF Lemm
log_tfidf_lemm_grid = GridSearchCV(log_tfidf_pipe_lemmed, param_grid=tfidf_params)
log_tfidf_lemm_grid.fit(X_train_lemmed, y_train_lemmed)
log_tfidf_lemm_test_acc = log_tfidf_lemm_grid.score(X_test_lemmed, y_test_lemmed)
print(log_tfidf_lemm_test_acc)
print(log_tfidf_lemm_grid.best_params_)

0.90332
{'tfidf__max_features': 2000, 'tfidf__stop_words': None}


In [58]:
# DT TF IDF Stem
dt_tfidf_stem_grid = GridSearchCV(dt_tfidf_pipe_stemmed, param_grid=tfidf_params)
dt_tfidf_stem_grid.fit(X_train_stemmed, y_train_stemmed)
dt_tfidf_stem_test_acc = dt_tfidf_stem_grid.score(X_test_stemmed, y_test_stemmed)
print(dt_tfidf_stem_test_acc)
print(dt_tfidf_stem_grid.best_params_)

0.8454
{'tfidf__max_features': 2000, 'tfidf__stop_words': None}


In [59]:
# DT TF IDF Lemm
dt_tfidf_lemm_grid = GridSearchCV(dt_tfidf_pipe_lemmed, param_grid=tfidf_params)
dt_tfidf_lemm_grid.fit(X_train_lemmed, y_train_lemmed)
dt_tfidf_lemm_test_acc = dt_tfidf_lemm_grid.score(X_test_lemmed, y_test_lemmed)
print(dt_tfidf_lemm_test_acc)
print(dt_tfidf_lemm_grid.best_params_)

0.85072
{'tfidf__max_features': 2000, 'tfidf__stop_words': None}


In [60]:
# Bayes TF IDF Stem
bayes_tfidf_stem_grid = GridSearchCV(bayes_tfidf_pipe_stemmed, param_grid=tfidf_params)
bayes_tfidf_stem_grid.fit(X_train_stemmed, y_train_stemmed)
bayes_tfidf_stem_test_acc = bayes_tfidf_stem_grid.score(X_test_stemmed, y_test_stemmed)
print(bayes_tfidf_stem_test_acc)
print(bayes_tfidf_stem_grid.best_params_)

0.8858
{'tfidf__max_features': 2000, 'tfidf__stop_words': None}


In [61]:
# Bayes TF IDF Lemm
bayes_tfidf_lemm_grid = GridSearchCV(bayes_tfidf_pipe_lemmed, param_grid=tfidf_params)
bayes_tfidf_lemm_grid.fit(X_train_lemmed, y_train_lemmed)
bayes_tfidf_lemm_test_acc = bayes_tfidf_lemm_grid.score(X_test_lemmed, y_test_lemmed)
print(bayes_tfidf_lemm_test_acc)
print(bayes_tfidf_lemm_grid.best_params_)

0.884
{'tfidf__max_features': 2000, 'tfidf__stop_words': None}


## Results Dataframe

In [62]:
pd.DataFrame({'model': ['Logistic CountVectorized Stemmed', 'Logistic CountVectorized Lemmatized', 'Decision Tree CountVectorized Stemmed', 'Decision Tree CountVectorized Lemmatized', 'Bayes CountVectorized Stemmed', 'Bayes CountVectorized Lemmatized',
                        'Logistic TFIDFVectorized Stemmed', 'Logistic TFIDFVectorized Lemmatized', 'Decision Tree TFIDFVectorized Stemmed', 'Decision Tree TFIDFVectorized Lemmatized', 'Bayes TFIDFVectorized Stemmed', 'Bayes TFIDFVectorized Lemmatized'], 
             'best_params': [log_cvect_stem_grid.best_params_, log_cvect_lemm_grid.best_params_, dt_cvect_stem_grid.best_params_, dt_cvect_lemm_grid.best_params_, bayes_cvect_stem_grid.best_params_, bayes_cvect_lemm_grid.best_params_,
                             log_tfidf_stem_grid.best_params_, log_tfidf_lemm_grid.best_params_, dt_tfidf_stem_grid.best_params_, dt_tfidf_stem_grid.best_params_, bayes_tfidf_stem_grid.best_params_, bayes_tfidf_lemm_grid.best_params_],
             'best_score': [log_cvect_stem_test_acc, log_cvect_lemm_test_acc, dt_cvect_stem_test_acc, dt_cvect_lemm_test_acc, bayes_cvect_stem_test_acc, bayes_cvect_lemm_test_acc, 
                            log_tfidf_stem_test_acc, log_tfidf_lemm_test_acc, dt_tfidf_stem_test_acc, dt_tfidf_lemm_test_acc, bayes_tfidf_stem_test_acc, bayes_tfidf_lemm_test_acc]}).set_index('model')

Unnamed: 0_level_0,best_params,best_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1
Logistic CountVectorized Stemmed,"{'cvect__max_features': 2000, 'cvect__stop_wor...",0.90892
Logistic CountVectorized Lemmatized,"{'cvect__max_features': 2000, 'cvect__stop_wor...",0.90824
Decision Tree CountVectorized Stemmed,"{'cvect__max_features': 2000, 'cvect__stop_wor...",0.85384
Decision Tree CountVectorized Lemmatized,"{'cvect__max_features': 2000, 'cvect__stop_wor...",0.85368
Bayes CountVectorized Stemmed,"{'cvect__max_features': 2000, 'cvect__stop_wor...",0.89056
Bayes CountVectorized Lemmatized,"{'cvect__max_features': 2000, 'cvect__stop_wor...",0.88928
Logistic TFIDFVectorized Stemmed,"{'tfidf__max_features': 2000, 'tfidf__stop_wor...",0.90488
Logistic TFIDFVectorized Lemmatized,"{'tfidf__max_features': 2000, 'tfidf__stop_wor...",0.90332
Decision Tree TFIDFVectorized Stemmed,"{'tfidf__max_features': 2000, 'tfidf__stop_wor...",0.8454
Decision Tree TFIDFVectorized Lemmatized,"{'tfidf__max_features': 2000, 'tfidf__stop_wor...",0.85072
