### Comparing Models and Vectorization Strategies for Text Classification

This try-it focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem.  In order to consider each of these components, you should make use of the `Pipeline` and `GridSearchCV` objects in scikitlearn to try different combinations of vectorizers with different estimators.  For each of these, you also want to use the `.cv_results_` to examine the time for the estimator to fit the data.

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  You are to use the text column to classify whether or not the text was humorous.  It is loaded and displayed below.

**Note:** The original dataset contains 200K rows of data. It is best to try to use the full dtaset. If the original dataset is too large for your computer, please use the 'dataset-minimal.csv', which has been reduced to 100K.

In [1]:
import pandas as pd
import nltk
from nltk import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ryans\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ryans\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ryans\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
df = pd.read_csv('dataset.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'dataset.csv'

In [3]:
df.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


### Data Split

In [4]:
df.shape

(200000, 2)

In [5]:
X = df.drop('humor', axis = 1)
y = df['humor']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X['text'], y,test_size=0.25, random_state=24)

len(X_train), len(X_test), len(y_train), len(y_test)

(150000, 50000, 150000, 50000)

#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

### Preprocessing

In [7]:
def stemmer(text):
    stem = PorterStemmer()
    return ' '.join([stem.stem(word) for word in word_tokenize(text)])

In [8]:
def lemmatize(text):
    lemma = WordNetLemmatizer()
    return ' '.join([lemma.lemmatize(word) for word in word_tokenize(text)])

In [9]:
#df['text_stem'] = df['text'].apply(stemmer)
#df['text_lemma'] = df['text'].apply(lemmatize)

In [10]:
X_train_pp_stem = X_train.apply(stemmer)
X_train_pp_lemma = X_train.apply(lemmatize)
X_test_pp_stem = X_test.apply(stemmer)
X_test_pp_lemma = X_test.apply(lemmatize)

In [11]:
X_train.head(5)

Unnamed: 0,text
198244,I knew i was in trouble when the lady doing my...
111046,"How do gingers make friends? no seriously, im ..."
31297,What's the difference between 9/11 and a cow? ...
106857,Did at&t commit perjury? does at&t have 100 pe...
192741,Donald trump to meet with editors of new yorke...


In [12]:
X_train_pp_stem.head(5)

Unnamed: 0,text
198244,i knew i wa in troubl when the ladi do my nail...
111046,"how do ginger make friend ? no serious , im ge..."
31297,what 's the differ between 9/11 and a cow ? yo...
106857,did at & t commit perjuri ? doe at & t have 10...
192741,donald trump to meet with editor of new yorker...


In [13]:
X_train_pp_lemma.head(5)

Unnamed: 0,text
198244,I knew i wa in trouble when the lady doing my ...
111046,"How do ginger make friend ? no seriously , im ..."
31297,What 's the difference between 9/11 and a cow ...
106857,Did at & t commit perjury ? doe at & t have 10...
192741,Donald trump to meet with editor of new yorker...


### Classification

#### Logistic Regression

In [16]:
params = {'cvect__max_features': [100, 500, 1000, 2000],'cvect__stop_words': ['english', None]}

cv_lr_stem_pipe = Pipeline([('cvect', CountVectorizer()),('lgr', LogisticRegression(max_iter=1000))])
cv_lr_stem_grid = GridSearchCV(cv_lr_stem_pipe, param_grid=params)
cv_lr_stem_grid.fit(X_train_pp_stem, y_train)
cv_lr_stem_acc = cv_lr_stem_grid.score(X_test_pp_stem, y_test)

In [29]:
print(f"Logistic Regression With Bag Of Words Model (Stemming) : Best_Score {cv_lr_stem_acc:.3f}, Best_Params { cv_lr_stem_grid.best_params_}")

Logistic Regression With Bag Of Words Model (Stemming) : Best_Score 0.913, Best_Params {'cvect__max_features': 2000, 'cvect__stop_words': None}


In [18]:
params = {'cvect__max_features': [100, 500, 1000, 2000],'cvect__stop_words': ['english', None]}

cv_lr_lemma_pipe = Pipeline([('cvect', CountVectorizer()),('lgr', LogisticRegression(max_iter=1000))])
cv_lr_lemma_grid = GridSearchCV(cv_lr_lemma_pipe, param_grid=params)
cv_lr_lemma_grid.fit(X_train_pp_lemma, y_train)
cv_lr_lemma_acc = cv_lr_stem_grid.score(X_test_pp_lemma, y_test)

In [28]:
print(f"Logistic Regression With Bag Of Words Model (Lemmatizing) : Best_Score {cv_lr_lemma_acc:.3f}, Best_Params { cv_lr_lemma_grid.best_params_}")

Logistic Regression With Bag Of Words Model (Lemmatizing) : Best_Score 0.891, Best_Params {'cvect__max_features': 2000, 'cvect__stop_words': None}


In [21]:
params = {'tfidf__max_features': [100, 500, 1000, 2000],'tfidf__stop_words': ['english', None]}

tf_lr_stem_pipe = Pipeline([('tfidf', TfidfVectorizer()),('lgr', LogisticRegression(max_iter=1000))])
tf_lr_stem_grid = GridSearchCV(tf_lr_stem_pipe, param_grid=params)
tf_lr_stem_grid.fit(X_train_pp_stem, y_train)
tf_lr_stem_acc = tf_lr_stem_grid.score(X_test_pp_stem, y_test)

In [26]:
print(f"Logistic Regression With TF-IDF Model (Stemming) : Best_Score {tf_lr_stem_acc:.3f}, Best_Params { tf_lr_stem_grid.best_params_}")

Logistic Regression With TF-IDF Model (Stemming) : Best_Score 0.910, Best_Params {'tfidf__max_features': 2000, 'tfidf__stop_words': None}


In [23]:
params = {'tfidf__max_features': [100, 500, 1000, 2000],'tfidf__stop_words': ['english', None]}

tf_lr_lemma_pipe = Pipeline([('tfidf', TfidfVectorizer()),('lgr', LogisticRegression(max_iter=1000))])
tf_lr_lemma_grid = GridSearchCV(tf_lr_lemma_pipe, param_grid=params)
tf_lr_lemma_grid.fit(X_train_pp_lemma, y_train)
tf_lr_lemma_acc = tf_lr_lemma_grid.score(X_test_pp_lemma, y_test)

In [27]:
print(f"Logistic Regression With TF-IDF Model (Lemmatizing) : Best_Score {tf_lr_lemma_acc:.3f}, Best_Params { tf_lr_lemma_grid.best_params_}")

Logistic Regression With TF-IDF Model (Lemmatizing) : Best_Score 0.910, Best_Params {'tfidf__max_features': 2000, 'tfidf__stop_words': None}


#### Decision Tree

In [33]:
params = {'cvect__max_features': [100, 500, 1000, 2000],'cvect__stop_words': ['english', None]}

cv_dt_stem_pipe = Pipeline([('cvect', CountVectorizer()),('dt', DecisionTreeClassifier())])
cv_dt_stem_grid = GridSearchCV(cv_dt_stem_pipe, param_grid=params)
cv_dt_stem_grid.fit(X_train_pp_stem, y_train)
cv_dt_stem_acc = cv_lr_stem_grid.score(X_test_pp_stem, y_test)

In [34]:
print(f"Decision Tree With Bag Of Words Model (Stemming) : Best_Score {cv_dt_stem_acc:.3f}, Best_Params { cv_dt_stem_grid.best_params_}")

Decision Tree With Bag Of Words Model (Stemming) : Best_Score 0.913, Best_Params {'cvect__max_features': 2000, 'cvect__stop_words': None}


In [35]:
params = {'cvect__max_features': [100, 2000],'cvect__stop_words': ['english', None]}

cv_dt_lemma_pipe = Pipeline([('cvect', CountVectorizer()),('dt', DecisionTreeClassifier())])
cv_dt_lemma_grid = GridSearchCV(cv_dt_lemma_pipe, param_grid=params)
cv_dt_lemma_grid.fit(X_train_pp_lemma, y_train)
cv_dt_lemma_acc = cv_lr_stem_grid.score(X_test_pp_lemma, y_test)

In [37]:
print(f"Decision Tree With Bag Of Words Model (Lemmatizing) : Best_Score {cv_dt_lemma_acc:.3f}, Best_Params { cv_dt_lemma_grid.best_params_}")

Decision Tree With Bag Of Words Model (Lemmatizing) : Best_Score 0.891, Best_Params {'cvect__max_features': 2000, 'cvect__stop_words': None}


In [38]:
params = {'tfidf__max_features': [2000],'tfidf__stop_words': ['english', None]}

tf_dt_stem_pipe = Pipeline([('tfidf', TfidfVectorizer()),('dt', DecisionTreeClassifier())])
tf_dt_stem_grid = GridSearchCV(tf_dt_stem_pipe, param_grid=params)
tf_dt_stem_grid.fit(X_train_pp_stem, y_train)
tf_lr_stem_acc = tf_dt_stem_grid.score(X_test_pp_stem, y_test)

In [39]:
print(f"Decision Tree With TF-IDF Model (Stemming) : Best_Score {tf_lr_stem_acc:.3f}, Best_Params { tf_dt_stem_grid.best_params_}")

Decision Tree With TF-IDF Model (Stemming) : Best_Score 0.859, Best_Params {'tfidf__max_features': 2000, 'tfidf__stop_words': None}


In [None]:
params = {'tfidf__max_features': [2000],'tfidf__stop_words': ['english', None]}

tf_dt_lemma_pipe = Pipeline([('tfidf', TfidfVectorizer()),('dt', DecisionTreeClassifier())])
tf_dt_lemma_grid = GridSearchCV(tf_dt_lemma_pipe, param_grid=params)
tf_dt_lemma_grid.fit(X_train_pp_lemma, y_train)
tf_lr_lemma_acc = tf_dt_lemma_grid.score(X_test_pp_lemma, y_test)

In [None]:
print(f"Decision Tree With TF-IDF Model (Lemmatizing) : Best_Score {tf_lr_lemma_acc:.3f}, Best_Params { tf_dt_lemma_grid.best_params_}")

Logistic Regression With TF-IDF Model (Lemmatizing) : Best_Score 0.910, Best_Params {'tfidf__max_features': 2000, 'tfidf__stop_words': None}


#### Naive Bayes

In [44]:
params = {'cvect__max_features': [100, 500, 1000, 2000],'cvect__stop_words': ['english', None]}

cv_nv_stem_pipe = Pipeline([('cvect', CountVectorizer()),('bayes', MultinomialNB())])
cv_nv_stem_grid = GridSearchCV(cv_nv_stem_pipe, param_grid=params)
cv_nv_stem_grid.fit(X_train_pp_stem, y_train)
cv_nv_stem_acc = cv_nv_stem_grid.score(X_test_pp_stem, y_test)

In [45]:
print(f"Naive Bayes With Bag Of Words Model (Stemming) : Best_Score {cv_nv_stem_acc:.3f}, Best_Params { cv_nv_stem_grid.best_params_}")

Naive Bayes With Bag Of Words Model (Stemming) : Best_Score 0.892, Best_Params {'cvect__max_features': 2000, 'cvect__stop_words': None}


In [46]:
params = {'cvect__max_features': [100, 500, 1000, 2000],'cvect__stop_words': ['english', None]}

cv_nv_lemma_pipe = Pipeline([('cvect', CountVectorizer()),('bayes', MultinomialNB())])
cv_nv_lemma_grid = GridSearchCV(cv_nv_lemma_pipe, param_grid=params)
cv_nv_lemma_grid.fit(X_train_pp_lemma, y_train)
cv_nv_lemma_acc = cv_nv_lemma_grid.score(X_test_pp_lemma, y_test)

In [49]:
print(f"Naive Bayes With Bag Of Words Model (Lemmatizing) : Best_Score {cv_nv_lemma_acc:.3f}, Best_Params { cv_nv_lemma_grid.best_params_}")

Naive Bayes With Bag Of Words Model (Lemmatizing) : Best_Score 0.891, Best_Params {'cvect__max_features': 2000, 'cvect__stop_words': None}


In [50]:
params = {'tfidf__max_features': [100, 500, 1000, 2000],'tfidf__stop_words': ['english', None]}

cv_nv_stem_pipe = Pipeline([('tfidf', TfidfVectorizer()),('bayes', MultinomialNB())])
cv_nv_stem_grid = GridSearchCV(cv_nv_stem_pipe, param_grid=params)
cv_nv_stem_grid.fit(X_train_pp_stem, y_train)
cv_nv_stem_acc = cv_nv_stem_grid.score(X_test_pp_stem, y_test)

In [51]:
print(f"Naive Bayes With TF-IDF Model (Stemming) : Best_Score {cv_nv_stem_acc:.3f}, Best_Params { cv_nv_stem_grid.best_params_}")

Naive Bayes With TF-IDF Model (Stemming) : Best_Score 0.888, Best_Params {'tfidf__max_features': 2000, 'tfidf__stop_words': None}


In [52]:
params = {'tfidf__max_features': [100, 500, 1000, 2000],'tfidf__stop_words': ['english', None]}

tf_nv_lemma_pipe = Pipeline([('tfidf', TfidfVectorizer()),('bayes', MultinomialNB())])
tf_nv_lemma_grid = GridSearchCV(tf_nv_lemma_pipe, param_grid=params)
tf_nv_lemma_grid.fit(X_train_pp_lemma, y_train)
tf_nv_lemma_acc = tf_nv_lemma_grid.score(X_test_pp_lemma, y_test)

In [53]:
print(f"Naive Bayes With TF-IDF Model (Lemmatizing) : Best_Score {tf_nv_lemma_acc:.3f}, Best_Params { tf_nv_lemma_grid.best_params_}")

Naive Bayes With TF-IDF Model (Lemmatizing) : Best_Score 0.885, Best_Params {'tfidf__max_features': 2000, 'tfidf__stop_words': None}


In [56]:
pd.DataFrame({'model': ['Logistic', 'Decision Tree', 'Bayes'],
             'best_params': ["{'max_features': 2000, 'stop_words': None}", "{'max_features': 2000, 'stop_words': None}", "{'max_features': 2000, 'stop_words': None}"],
             'best_score': ['0.913', '0.891', '0.892'],
             'best_version': ['Bag Of Words (CountVectorizer)', 'Bag Of Words (CountVectorizer)', 'Bag Of Words (CountVectorizer)'],
              'best time': ['2 minutes','18 minutes','1 minute']
              }).set_index('model')

Unnamed: 0_level_0,best_params,best_score,best_version,best time
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Logistic,"{'max_features': 2000, 'stop_words': None}",0.913,Bag Of Words (CountVectorizer),2 minutes
Decision Tree,"{'max_features': 2000, 'stop_words': None}",0.891,Bag Of Words (CountVectorizer),18 minutes
Bayes,"{'max_features': 2000, 'stop_words': None}",0.892,Bag Of Words (CountVectorizer),1 minute
