### Comparing Models and Vectorization Strategies for Text Classification

This try-it focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem.  In order to consider each of these components, you should make use of the `Pipeline` and `GridSearchCV` objects in scikitlearn to try different combinations of vectorizers with different estimators.  For each of these, you also want to use the `.cv_results_` to examine the time for the estimator to fit the data.

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  You are to use the text column to classify whether or not the text was humorous.  It is loaded and displayed below.

**Note:** The original dataset contains 200K rows of data. It is best to try to use the full dtaset. If the original dataset is too large for your computer, please use the 'dataset-minimal.csv', which has been reduced to 100K.

In [27]:
import string
import numpy as np
import pandas as pd
from nltk import word_tokenize
from sklearn.pipeline import Pipeline
from plotly.figure_factory import create_table
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

import warnings

warnings.filterwarnings("ignore")

In [28]:
df = pd.read_csv('data/dataset.csv')
df.rename(columns={'text': 'content'}, inplace=True)

In [29]:
df.head()

Unnamed: 0,content,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

In [30]:
pd.DataFrame({'model': ['Logistic', 'Decision Tree', 'Bayes'], 
             'best_params': ['', '', ''],
             'best_score': ['', '', '']}).set_index('model')

Unnamed: 0_level_0,best_params,best_score
model,Unnamed: 1_level_1,Unnamed: 2_level_1
Logistic,,
Decision Tree,,
Bayes,,


### Pre-porocessing

In [31]:
def preprocess_text(text):
    # Stemming, Lemmatizing each word, and also removing the stop words
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    #swords = stopwords.words('english')

    # text stemmed, removed the stop words and then lemmatize each word
    tokens = word_tokenize(text)
    text_stemmer = [stemmer.stem(word) for word in tokens if word not in string.punctuation]
    text_lemmatizer = [lemmatizer.lemmatize(word) for word in text_stemmer]

    return ' '.join(w for w in text_lemmatizer)

In [32]:
df['content'] = df['content'].apply(preprocess_text)

In [33]:
df.head()

Unnamed: 0,content,humor
0,joe biden rule out 2020 bid 'guy i 'm not run,False
1,watch darvish gave hitter whiplash with slow p...,False
2,what do you call a turtl without it shell dead,True
3,5 reason the 2016 elect feel so person,False
4,pasco polic shot mexican migrant from behind n...,False


### Split Dataset on train and test subsets

In [34]:
X = df.drop('humor', axis=1)
y = df['humor']

In [35]:
X_train, X_test, y_train, y_test = train_test_split(X['content'], y, random_state = 42)

In [36]:
X_train.shape

(150000,)

##### Counter Vectorizer

In [37]:
pipelines = {
    'lr_cv': Pipeline([('cvect', CountVectorizer()), ('classifier', LogisticRegression(solver='lbfgs', max_iter=100))]),
#    'knn_cv': Pipeline([('cvect', CountVectorizer()), ('classifier', KNeighborsClassifier())]),
#    'dt_cv': Pipeline([('cvect', CountVectorizer()), ('classifier', DecisionTreeClassifier())]),
#    'svm_cv': Pipeline([('cvect', CountVectorizer()), ('classifier', SVC())]),
#    'gnb_cv': Pipeline([('cvect', CountVectorizer()), ('classifier', GaussianNB())]),
    'mnb_cv': Pipeline([('cvect', CountVectorizer()), ('classifier', MultinomialNB())]),
    'lr_tf': Pipeline([('tfidf', TfidfVectorizer()), ('classifier', LogisticRegression(solver='lbfgs', max_iter=100))]),
    #    'knn_tf': Pipeline([('tfidf', TfidfVectorizer()), ('classifier', KNeighborsClassifier())]),
    #    'dt_tf': Pipeline([('tfidf', TfidfVectorizer()), ('classifier', DecisionTreeClassifier())]),
    #    'svm_tf': Pipeline([('tfidf', TfidfVectorizer()), ('classifier', SVC())]),
    #    'gnb_tf': Pipeline([('tfidf', TfidfVectorizer()), ('classifier', GaussianNB())]),
    'mnb_tf': Pipeline([('tfidf', CountVectorizer()), ('classifier', MultinomialNB())]),
}

In [38]:
params_cv = {'cvect__max_features': [100, 500, 1000, 2000],
          'cvect__stop_words': ['english', None]}

params_tf = {'tfidf__max_features': [100, 500, 1000, 2000],
          'tfidf__stop_words': ['english', None]}

In [43]:
report_data = {'Model':[], 'best_parameters':[], 'best_score':[], 'fit_time':[]} 
for pipeline in list(pipelines.values()):
    params = params_cv if list(pipeline.named_steps.keys())[0] == 'cvect' else params_tf
    grid = GridSearchCV(pipeline, param_grid=params, scoring='accuracy', cv=5)
    grid.fit(X_train, y_train)
    best_score = grid.best_score_
    best_parameters = grid.best_params_
    fit_time = np.mean(grid.cv_results_['mean_fit_time'])
    report_data['Model'].append(f"{grid.estimator.named_steps['classifier'].__class__.__name__}_{ 'CountVector' if grid.best_estimator_.steps[0][0] == 'cvect' else 'TfidfVector'}" )
    report_data['best_parameters'].append(best_parameters)
    report_data['best_score'].append(best_score)
    report_data['fit_time'].append(fit_time)
    print(f'Best hyperparameters for {grid.estimator.named_steps['classifier'].__class__.__name__}: {best_parameters}')
    print(f'the {grid.estimator.named_steps['classifier'].__class__.__name__} Model accuracy score: {best_score}')
    print(f'the {grid.estimator.named_steps['classifier'].__class__.__name__} Model fit time: {fit_time}')

KeyboardInterrupt: 

In [None]:
df_scores = pd.DataFrame.from_dict(report_data)
df_scores.set_index('Model', inplace=True)
df_scores.head()

In [ ]:
create_table(df_scores, index_title='Model',index=True)