### Comparing Models and Vectorization Strategies for Text Classification

This try-it focuses on weighing the positives and negatives of different estimators and vectorization strategies for a text classification problem.  In order to consider each of these components, you should make use of the `Pipeline` and `GridSearchCV` objects in scikitlearn to try different combinations of vectorizers with different estimators.  For each of these, you also want to use the `.cv_results_` to examine the time for the estimator to fit the data.

### The Data

The dataset below is from [kaggle]() and contains a dataset named the "ColBert Dataset" created for this [paper](https://arxiv.org/pdf/2004.12765.pdf).  You are to use the text column to classify whether or not the text was humorous.  It is loaded and displayed below.

**Note:** The original dataset contains 200K rows of data. It is best to try to use the full dtaset. If the original dataset is too large for your computer, please use the 'dataset-minimal.csv', which has been reduced to 100K.

In [221]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.ensemble import RandomForestClassifier

from nltk.stem import PorterStemmer
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [138]:
df = pd.read_csv('text_data/dataset.csv')

In [139]:
df.head()

Unnamed: 0,text,humor
0,"Joe biden rules out 2020 bid: 'guys, i'm not r...",False
1,Watch: darvish gave hitter whiplash with slow ...,False
2,What do you call a turtle without its shell? d...,True
3,5 reasons the 2016 election feels so personal,False
4,"Pasco police shot mexican migrant from behind,...",False


#### Task


**Text preprocessing:** As a pre-processing step, perform both `stemming` and `lemmatizing` to normalize your text before classifying. For each technique use both the `CountVectorize`r and `TfidifVectorizer` and use options for stop words and max features to prepare the text data for your estimator.

**Classification:** Once you have prepared the text data with stemming lemmatizing techniques, consider `LogisticRegression`, `DecisionTreeClassifier`, and `MultinomialNB` as classification algorithms for the data. Compare their performance in terms of accuracy and speed.

Share the results of your best classifier in the form of a table with the best version of each estimator, a dictionary of the best parameters and the best score.

encoding humor column using pandas factorize function 

In [140]:
df['humor'] = pd.factorize(df['humor'])[0]
df['humor']

0         0
1         0
2         1
3         0
4         0
         ..
199995    0
199996    1
199997    1
199998    0
199999    1
Name: humor, Length: 200000, dtype: int64

function to stem data 

In [141]:
def stemmer(text):
    stem = PorterStemmer()
    return ' '.join([stem.stem(w) for w in word_tokenize(text)])

In [142]:
stemmed_text = df['text'].apply(stemmer)

In [143]:
stemmed_text

0         joe biden rule out 2020 bid : 'guy , i 'm not ...
1         watch : darvish gave hitter whiplash with slow...
2         what do you call a turtl without it shell ? de...
3                    5 reason the 2016 elect feel so person
4         pasco polic shot mexican migrant from behind ,...
                                ...                        
199995    conor maynard seamlessli fit old-school r & b ...
199996    how to you make holi water ? you boil the hell...
199997    how mani optometrist doe it take to screw in a...
199998    mcdonald 's will offici kick off all-day break...
199999    an irish man walk on the street and ignor a ba...
Name: text, Length: 200000, dtype: object

funciton to Lemmatize text 

In [144]:
def lemma(text):
    lemmatizer = WordNetLemmatizer()
    return ' '.join([lemmatizer.lemmatize    (w) for w in word_tokenize(text)])

In [145]:
lemma_text = df['text'].apply(lemma)

In [146]:
lemma_text

0         Joe biden rule out 2020 bid : 'guys , i 'm not...
1         Watch : darvish gave hitter whiplash with slow...
2         What do you call a turtle without it shell ? d...
3               5 reason the 2016 election feel so personal
4         Pasco police shot mexican migrant from behind ...
                                ...                        
199995    Conor maynard seamlessly fit old-school r & b ...
199996    How to you make holy water ? you boil the hell...
199997    How many optometrist doe it take to screw in a...
199998    Mcdonald 's will officially kick off all-day b...
199999    An irish man walk on the street and ignores a ...
Name: text, Length: 200000, dtype: object

# splittng the data into text train sets for stem

In [133]:
X_stem = stemmed_text
y = df['humor']

X_stem_train, X_stem_test, y_train, y_test = train_test_split(X_stem, y, test_size= 0.2, random_state= 42)

# splitting the data into train and test for lemmantizer

In [147]:
X_lemma = lemma_text
y = df['humor']

X_lemma_train, X_lemma_test, y_train, y_test = train_test_split(X_lemma, y, test_size= 0.2, random_state= 42)

# CountVectorizer + logistic regression for stemming with pipeline and GridsearchCV 

In [235]:
vect_pipe_logi = Pipeline([('vect', CountVectorizer(max_features=1000, stop_words = 'english')),
                       ('logi', LogisticRegression(max_iter=1000, solver= 'liblinear', random_state= 42))])

params_vect_logi = {'logi__penalty' : ['l1','l2'], 
                  'logi__C'       : np.logspace(-3,3,7)}


In [236]:
grid_vect_pipe_logi_stem = GridSearchCV(vect_pipe_logi, param_grid = params_vect_logi)
grid_vect_pipe_logi_stem.fit(X_stem_train, y_train)
test_score_vect_logi_stem = grid_vect_pipe_logi_stem.score(X_stem_test, y_test)

print('Test score with Logistic Regression on stemmed text and CountVectorizer:', test_score_vect_logi_stem)

Test score with Logistic Regression on stemmed text and CountVectorizer: 0.841025


In [237]:
print('best_score_LR_step: ', grid_vect_pipe_logi_stem.best_score_)
print(" ")
print('cv_results_LR_step: ', grid_vect_pipe_logi_stem.cv_results_)
print(" ")
print('get_params_LR_step: ',grid_vect_pipe_logi_stem.get_params)
print(" ")
print('predict_proba_LR_step:', grid_vect_pipe_logi_stem.predict_proba)
print(" ")
print('best_params_LR_step :', grid_vect_pipe_logi_stem.best_params_ )


best_score_LR_step:  0.837275
 
cv_results_LR_step:  {'mean_fit_time': array([1.06186023, 1.06010809, 1.04157467, 1.07797341, 1.0865943 ,
       1.15008526, 1.07936502, 1.27405901, 1.07435842, 1.43321724,
       1.07599201, 1.65512667, 1.07207575, 1.67797041]), 'std_fit_time': array([0.0275226 , 0.01263166, 0.01143579, 0.00506952, 0.01956631,
       0.00803853, 0.02010524, 0.03306385, 0.00998632, 0.03460773,
       0.00366258, 0.04636879, 0.00532559, 0.04597968]), 'mean_score_time': array([0.23044219, 0.22313571, 0.22480512, 0.22287779, 0.22704682,
       0.2224    , 0.22540259, 0.2240026 , 0.22406545, 0.22535434,
       0.22527575, 0.2267952 , 0.22439933, 0.22699618]), 'std_score_time': array([0.01225924, 0.00169614, 0.00265604, 0.00184226, 0.00184092,
       0.00198335, 0.00188803, 0.00249301, 0.00157104, 0.00201366,
       0.0018713 , 0.0041223 , 0.00179024, 0.00387541]), 'param_logi__C': masked_array(data=[0.001, 0.001, 0.01, 0.01, 0.1, 0.1, 1.0, 1.0, 10.0,
                   10.0,

# Tfidf + logitic regression for stem with pipeline and GridSearchCV

In [238]:
tfidf_pipe = Pipeline([('tfidf', TfidfVectorizer(max_features= 1000, stop_words= 'english')),
                            ('logi', LogisticRegression(max_iter= 1000, solver= 'liblinear' ,random_state= 42))])
tfidf_params = {'logi__penalty': ['l1', 'l2'], 
                'logi__C': np.logspace(-3,3,7) }

params_vect_logi = {'logi__penalty' : ['l1','l2'], 
                  'logi__C'       : np.logspace(-3,3,7)}

In [239]:
grid_tfidf_pipe_logi_stem = GridSearchCV(tfidf_pipe, param_grid= tfidf_params)
grid_tfidf_pipe_logi_stem.fit(X_stem_train, y_train)
grid_tfidf_pipe_logi_stem.score(X_stem_test, y_test)
print('Test score with Logistic Regression on stemmed text and Tfidf:', grid_tfidf_pipe_logi_stem)

Test score with Logistic Regression on stemmed text and Tfidf: GridSearchCV(estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(max_features=1000,
                                                        stop_words='english')),
                                       ('logi',
                                        LogisticRegression(max_iter=1000,
                                                           random_state=42,
                                                           solver='liblinear'))]),
             param_grid={'logi__C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
                         'logi__penalty': ['l1', 'l2']})


In [240]:
print('best_score_tiidf_logi_step: ', grid_tfidf_pipe_logi_stem.best_score_)
print(" ")
print('cv_results_tfidf_logi_step: ', grid_tfidf_pipe_logi_stem.cv_results_)
print(" ")
print('get_params_tfidf_logi_step: ',grid_tfidf_pipe_logi_stem.get_params)
print(" ")
print('predict_proba_tfidf_logi_step:', grid_tfidf_pipe_logi_stem.predict_proba)
print(" ")
print('best_params_tfidf_logi_step :', grid_tfidf_pipe_logi_stem.best_params_ )

best_score_tiidf_logi_step:  0.8403937499999999
 
cv_results_tfidf_logi_step:  {'mean_fit_time': array([1.071351  , 1.04827867, 1.03607264, 1.06717935, 1.11909685,
       1.0793622 , 1.08110299, 1.22582564, 1.14199638, 1.31118197,
       1.094771  , 1.42250419, 1.07952132, 1.44474163]), 'std_fit_time': array([0.05820728, 0.03760584, 0.04642379, 0.05343631, 0.02502679,
       0.02790069, 0.0122472 , 0.05735116, 0.03301738, 0.01948987,
       0.03871906, 0.01677982, 0.03343963, 0.0439909 ]), 'mean_score_time': array([0.23536215, 0.22997642, 0.22752261, 0.228864  , 0.23482704,
       0.22761226, 0.22490554, 0.23599763, 0.23421283, 0.23959236,
       0.2287858 , 0.22676735, 0.22731938, 0.2247602 ]), 'std_score_time': array([0.00675636, 0.00690087, 0.00799636, 0.00919379, 0.01192795,
       0.00222562, 0.00225433, 0.01054513, 0.00313514, 0.01082954,
       0.00256676, 0.00197121, 0.00232754, 0.00129493]), 'param_logi__C': masked_array(data=[0.001, 0.001, 0.01, 0.01, 0.1, 0.1, 1.0, 1.0, 10.0

# CountVectorizer + logistic Rgression for Lemma with pipeline and gridsearchCV

In [241]:
grid_vect_pipe_LR_lemma = GridSearchCV(vect_pipe_logi, param_grid= params_vect_logi)
grid_vect_pipe_LR_lemma.fit(X_lemma_train, y_train)
vect_lemma_test_score = grid_vect_pipe_LR_lemma.score(X_lemma_test, y_test)

print('Test score: Logistic Regression on lemmantized text + CountVectorizer:', vect_lemma_test_score)

Test score: Logistic Regression on lemmantized text + CountVectorizer: 0.83335


In [242]:
print('best_score_LR_lemma: ', grid_vect_pipe_LR_lemma.best_score_)
print(" ")
print('cv_results_LR_lemma: ', grid_vect_pipe_LR_lemma.cv_results_)
print(" ")
print('get_params_LR_lemma: ',grid_vect_pipe_LR_lemma.get_params)
print(" ")
print('predict_proba_LR_lemma:', grid_vect_pipe_LR_lemma.predict_proba)
print(" ")
print('best_params_LR_lemma :', grid_vect_pipe_LR_lemma.best_params_ )

best_score_LR_lemma:  0.82938125
 
cv_results_LR_lemma:  {'mean_fit_time': array([1.12016058, 1.16496506, 1.12641087, 1.10572138, 1.15072269,
       1.22848978, 1.11352658, 1.2592237 , 1.18623304, 1.50994353,
       1.14869924, 1.70105386, 1.15777249, 1.62456017]), 'std_fit_time': array([0.04290032, 0.05950973, 0.02105538, 0.03552165, 0.04720622,
       0.02813765, 0.01997664, 0.03351569, 0.01753018, 0.02283815,
       0.04588849, 0.01898968, 0.02760195, 0.04053697]), 'mean_score_time': array([0.22686148, 0.23654995, 0.22926378, 0.21909175, 0.23059297,
       0.22798457, 0.22231565, 0.22562394, 0.23807497, 0.24063506,
       0.23033671, 0.23483181, 0.23336053, 0.22638073]), 'std_score_time': array([0.00529059, 0.00801755, 0.00414777, 0.0049074 , 0.00957485,
       0.00215392, 0.0028929 , 0.00626409, 0.00521045, 0.01185345,
       0.00890224, 0.00819725, 0.00378743, 0.00605415]), 'param_logi__C': masked_array(data=[0.001, 0.001, 0.01, 0.01, 0.1, 0.1, 1.0, 1.0, 10.0,
                   1

# Tfidf + logitic regression for lemma with pipeline and GridSearchCV

In [243]:
grid_tfidf_pipe_logi_lemma = GridSearchCV(tfidf_pipe, param_grid= tfidf_params)
grid_tfidf_pipe_logi_lemma.fit(X_lemma_train, y_train)
grid_tfidf_pipe_logi_lemma.score(X_lemma_test, y_test)
print('Test score with Logistic Regression on lemmantized text and Tfidf:', grid_tfidf_pipe_logi_lemma)

Test score with Logistic Regression on lemmantized text and Tfidf: GridSearchCV(estimator=Pipeline(steps=[('tfidf',
                                        TfidfVectorizer(max_features=1000,
                                                        stop_words='english')),
                                       ('logi',
                                        LogisticRegression(max_iter=1000,
                                                           random_state=42,
                                                           solver='liblinear'))]),
             param_grid={'logi__C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03]),
                         'logi__penalty': ['l1', 'l2']})


In [244]:
print('best_score_tiidf_logi_lemma: ', grid_tfidf_pipe_logi_lemma.best_score_)
print(" ")
print('cv_results_tfidf_logi_lemma: ', grid_tfidf_pipe_logi_lemma.cv_results_)
print(" ")
print('get_params_tfidf_logi_lemma: ',grid_tfidf_pipe_logi_lemma.get_params)
print(" ")
print('predict_proba_tfidf_logi_lemma:', grid_tfidf_pipe_logi_lemma.predict_proba)
print(" ")
print('best_params_tfidf_logi_lemma :', grid_tfidf_pipe_logi_lemma.best_params_)

best_score_tiidf_logi_lemma:  0.8300749999999999
 
cv_results_tfidf_logi_lemma:  {'mean_fit_time': array([1.03434153, 1.10246758, 1.10015993, 1.11090407, 1.11286507,
       1.13320503, 1.11573553, 1.16650319, 1.08843122, 1.26637416,
       1.16651363, 1.48363905, 1.19852486, 1.65222397]), 'std_fit_time': array([0.03448356, 0.0261893 , 0.03727306, 0.03504684, 0.01392108,
       0.0290086 , 0.02616761, 0.01344687, 0.0337909 , 0.01467346,
       0.04579799, 0.04478316, 0.02473891, 0.04716765]), 'mean_score_time': array([0.22808285, 0.240417  , 0.23341742, 0.22666073, 0.22501378,
       0.22847419, 0.22395988, 0.22656274, 0.22141452, 0.22190099,
       0.22923484, 0.23471313, 0.2309515 , 0.23189378]), 'std_score_time': array([0.01010736, 0.00680846, 0.01076328, 0.00539039, 0.00598951,
       0.00746732, 0.00441027, 0.00614214, 0.00067419, 0.00218063,
       0.00600055, 0.00973421, 0.00262156, 0.00753103]), 'param_logi__C': masked_array(data=[0.001, 0.001, 0.01, 0.01, 0.1, 0.1, 1.0, 1.0, 10

# CountVectorizer + Decision Tree on Stemmed text with Pipeline and GridsearchCV

In [245]:
vect_pipe_DT = Pipeline([('vect', CountVectorizer(max_features = 1000, stop_words= 'english')),
                          ('dtc', DecisionTreeClassifier(random_state= 42))
                          ])

params_vect_DT = {'dtc__criterion': ['gini', 'entropy', 'log_loss' ] }



In [175]:
grid_vect_pipe_DT_stem = GridSearchCV(vect_pipe_DT, param_grid= params_vect_DT)
grid_vect_pipe_DT_stem.fit(X_stem_train, y_train)
test_score_vect_DT_stem = grid_vect_pipe_DT_stem.score(X_stem_test, y_test)

print('Test score: Decision Tree Classifier on Stemmed text + CountVectorizer:', test_score_vect_DT_stem)

5 fits failed out of a total of 15.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 937, in fit
    super().fit(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 352, in fit
    criterion = CRITERIA_CLF[self.criterion](
KeyError: 'log

Test score: Decision Tree Classifier on Stemmed text + CountVectorizer: 0.795125
 Best Params for DTC Stemmed text:  {'dtc__criterion': 'gini'}


In [217]:
print('best_score_DT_stem: ', grid_vect_pipe_DT_stem.best_score_)
print(" ")
print('cv_results_DT_stem: ', grid_vect_pipe_DT_stem.cv_results_)
print(" ")
print('get_params_DT_stem: ',grid_vect_pipe_DT_stem.get_params)
print(" ")
print('predict_proba_DT_stem: ', grid_vect_pipe_DT_stem.predict_proba)
print(" ")
print('best_params_DT_stem :', grid_vect_pipe_DT_stem.best_params_ )

best_score_DT_stem:  0.7906000000000001
 
cv_results_DT_stem:  {'mean_fit_time': array([17.94860601, 20.33583522,  1.2809526 ]), 'std_fit_time': array([1.8315248 , 0.1944396 , 0.02366985]), 'mean_score_time': array([0.28974557, 0.33905592, 0.        ]), 'std_score_time': array([0.036345  , 0.01968207, 0.        ]), 'param_dtc__criterion': masked_array(data=['gini', 'entropy', 'log_loss'],
             mask=[False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'dtc__criterion': 'gini'}, {'dtc__criterion': 'entropy'}, {'dtc__criterion': 'log_loss'}], 'split0_test_score': array([0.78665625, 0.79725   ,        nan]), 'split1_test_score': array([0.7926875 , 0.79909375,        nan]), 'split2_test_score': array([0.792     , 0.79878125,        nan]), 'split3_test_score': array([0.79071875, 0.79746875,        nan]), 'split4_test_score': array([0.7909375, 0.7955   ,       nan]), 'mean_test_score': array([0.7906    , 0.79761875,        nan]), 'std_test_score': array

# Tfidf + Decision Tree on Stemmed text with Pipeline and GridsearchCV

In [247]:
tfidf_pipe_DT = Pipeline([('tfidf', TfidfVectorizer(max_features = 1000, stop_words= 'english')),
                          ('dtc', DecisionTreeClassifier(random_state= 42))
                          ])

params_tfidf_DT = {'dtc__criterion': ['gini', 'entropy', 'log_loss' ] }


In [248]:
grid_tfidf_pipe_DT_stem = GridSearchCV(tfidf_pipe_DT, param_grid= params_tfidf_DT)
grid_tfidf_pipe_DT_stem.fit(X_stem_train, y_train)
test_score_tfidf_DT_stem = grid_vect_pipe_DT_stem.score(X_stem_test, y_test)

print('Test score: Decision Tree Classifier on Stemmed text + Tfidf:', test_score_tfidf_DT_stem)

5 fits failed out of a total of 15.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 937, in fit
    super().fit(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 352, in fit
    criterion = CRITERIA_CLF[self.criterion](
KeyError: 'log

Test score: Decision Tree Classifier on Stemmed text + Tfidf: 0.795125


In [249]:
print('best_score_tfidf_DT_stem: ', grid_tfidf_pipe_DT_stem.best_score_)
print(" ")
print('cv_results_tfidf_DT_stem: ', grid_tfidf_pipe_DT_stem.cv_results_)
print(" ")
print('get_params_tfidf_DT_stem: ',grid_tfidf_pipe_DT_stem.get_params)
print(" ")
print('predict_proba_tfidf_DT_stem: ', grid_tfidf_pipe_DT_stem.predict_proba)
print(" ")
print('best_params_tfidf_DT_stem: ', grid_tfidf_pipe_DT_stem.best_params_ )

best_score_tfidf_DT_stem:  0.80124375
 
cv_results_tfidf_DT_stem:  {'mean_fit_time': array([19.55327625, 19.16711926,  0.96753936]), 'std_fit_time': array([0.36389745, 0.11108896, 0.00573259]), 'mean_score_time': array([0.25955434, 0.24992895, 0.        ]), 'std_score_time': array([0.01153308, 0.00253141, 0.        ]), 'param_dtc__criterion': masked_array(data=['gini', 'entropy', 'log_loss'],
             mask=[False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'dtc__criterion': 'gini'}, {'dtc__criterion': 'entropy'}, {'dtc__criterion': 'log_loss'}], 'split0_test_score': array([0.7990625 , 0.79915625,        nan]), 'split1_test_score': array([0.80134375, 0.80075   ,        nan]), 'split2_test_score': array([0.80165625, 0.802375  ,        nan]), 'split3_test_score': array([0.80109375, 0.8029375 ,        nan]), 'split4_test_score': array([0.8030625, 0.80175  ,       nan]), 'mean_test_score': array([0.80124375, 0.80139375,        nan]), 'std_test_score': a

# Countvectorize + Decision Tree on Lemmatized text with Pipeline and GridsearchCV

In [170]:
grid_vect_pipe_DT_lemma = GridSearchCV(vect_pipe_DT, param_grid= params_vect_DT)
grid_vect_pipe_DT_lemma.fit(X_lemma_train, y_train)
test_score_vect_DT_lemma = grid_vect_pipe_DT_stem.score(X_lemma_test, y_test)

print('Test score: Decision Tree Classifier on Lemmantized text + CountVectorizer:', test_score_vect_DT_lemma)


15 fits failed out of a total of 45.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
15 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 937, in fit
    super().fit(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 352, in fit
    criterion = CRITERIA_CLF[self.criterion](
KeyError: 'l

In [216]:
print('best_score_DT_Lemma: ', grid_vect_pipe_DT_lemma.best_score_)
print(" ")
print('cv_results_DT_Lemma: ', grid_vect_pipe_DT_lemma.cv_results_)
print(" ")
print('get_params_DT_Lemma: ',grid_vect_pipe_DT_lemma.get_params)
print(" ")
print('predict_proba_DT_Lemma :', grid_vect_pipe_DT_lemma.predict_proba)
print(" ")
print('best_params_DT_lemma :', grid_vect_pipe_DT_lemma.best_params_ )

best_score_DT_Lemma:  0.602275
 
cv_results_DT_Lemma:  {'mean_fit_time': array([1.10141706, 1.16836252, 1.27634659, 1.07619867, 1.21876235,
       1.40477114, 1.0813302 , 1.06762977, 1.03658552]), 'std_fit_time': array([0.01928752, 0.01724247, 0.02429805, 0.04743374, 0.02181738,
       0.03056986, 0.02118325, 0.01355709, 0.01598051]), 'mean_score_time': array([0.2349504 , 0.22671204, 0.22993083, 0.22712793, 0.23590784,
       0.23761978, 0.        , 0.        , 0.        ]), 'std_score_time': array([0.00409131, 0.00618552, 0.00639956, 0.00988274, 0.00722432,
       0.00523684, 0.        , 0.        , 0.        ]), 'param_dtc__criterion': masked_array(data=['gini', 'gini', 'gini', 'entropy', 'entropy',
                   'entropy', 'log_loss', 'log_loss', 'log_loss'],
             mask=[False, False, False, False, False, False, False, False,
                   False],
       fill_value='?',
            dtype=object), 'param_dtc__max_depth': masked_array(data=[5, 10, 15, 5, 10, 15, 5, 10

# # Tfidf + Decision Tree on Lemmatized text with Pipeline and GridsearchCV

In [251]:
grid_tfidf_pipe_DT_lemma = GridSearchCV(tfidf_pipe_DT, param_grid= params_tfidf_DT)
grid_tfidf_pipe_DT_lemma.fit(X_lemma_train, y_train)
test_score_tfidf_DT_lemma = grid_vect_pipe_DT_lemma.score(X_lemma_test, y_test)

print('Test score: Decision Tree Classifier on lemmantized text + Tfidf:', test_score_tfidf_DT_lemma)

5 fits failed out of a total of 15.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 937, in fit
    super().fit(
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/tree/_classes.py", line 352, in fit
    criterion = CRITERIA_CLF[self.criterion](
KeyError: 'log

Test score: Decision Tree Classifier on lemmantized text + Tfidf: 0.60085


In [252]:
print('best_score_tfidf_DT_Lemma: ', grid_tfidf_pipe_DT_lemma.best_score_)
print(" ")
print('cv_results_tfidf_DT_Lemma: ', grid_tfidf_pipe_DT_lemma.cv_results_)
print(" ")
print('get_params_tfidf_DT_Lemma: ',grid_tfidf_pipe_DT_lemma.get_params)
print(" ")
print('predict_proba_tfidf_DT_Lemma :', grid_tfidf_pipe_DT_lemma.predict_proba)
print(" ")
print('best_params_tfidf_DT_lemma :', grid_tfidf_pipe_DT_lemma.best_params_ )

best_score_tfidf_DT_Lemma:  0.7960375000000001
 
cv_results_tfidf_DT_Lemma:  {'mean_fit_time': array([17.39331403, 17.16291919,  0.96584253]), 'std_fit_time': array([0.33448142, 0.11615594, 0.00370483]), 'mean_score_time': array([0.25666113, 0.24981961, 0.        ]), 'std_score_time': array([0.01399089, 0.00713088, 0.        ]), 'param_dtc__criterion': masked_array(data=['gini', 'entropy', 'log_loss'],
             mask=[False, False, False],
       fill_value='?',
            dtype=object), 'params': [{'dtc__criterion': 'gini'}, {'dtc__criterion': 'entropy'}, {'dtc__criterion': 'log_loss'}], 'split0_test_score': array([0.7925625, 0.7958125,       nan]), 'split1_test_score': array([0.79628125, 0.798     ,        nan]), 'split2_test_score': array([0.7985625 , 0.79859375,        nan]), 'split3_test_score': array([0.7964375, 0.7950625,       nan]), 'split4_test_score': array([0.79634375, 0.79540625,        nan]), 'mean_test_score': array([0.7960375, 0.796575 ,       nan]), 'std_test_score

# Countvectorize + Random Forest Classifier on Stem text with Pipeline and GridsearchCV


In [205]:
vect_pipe_RFC = Pipeline([('vect', CountVectorizer( max_features = 1000, stop_words= 'english')),
                          ('rfc', RandomForestClassifier())
                           ])


param_vect_RFC = { 'rfc__criterion': ['gini', 'entropy', 'log_loss'], 
                  'rfc__n_jobs': [1,2,5,10]}

In [206]:
grid_vect_pipe_RFC_stem = GridSearchCV(vect_pipe_RFC, param_grid= param_vect_RFC)
grid_vect_pipe_RFC_stem.fit(X_stem_train, y_train)

20 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 450, in fit
    trees = Parallel(
  File "/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "/

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(max_features=1000,
                                                        stop_words='english')),
                                       ('rfc', RandomForestClassifier())]),
             param_grid={'rfc__criterion': ['gini', 'entropy', 'log_loss'],
                         'rfc__n_jobs': [1, 2, 5, 10]})

In [207]:
test_score_vect_RFC_stem = grid_vect_pipe_RFC_stem.score(X_stem_test, y_test)
print('Test score: Random Forest Classifier on Stem text + CountVectorizer:', test_score_vect_RFC_stem)


Test score: Random Forest Classifier on Stem text + CountVectorizer: 0.83625


In [215]:
print('best_score_RFC_stem: ', grid_vect_pipe_RFC_stem.best_score_)
print(" ")
print('cv_results_RFC_stem: ', grid_vect_pipe_RFC_stem.cv_results_)
print(" ")
print('get_params_RFC_stem: ',grid_vect_pipe_RFC_stem.get_params)
print(" ")
print('predict_proba_RFC_stem :', grid_vect_pipe_RFC_stem.predict_proba)
print(" ")
print('best_params_RFC_stem :', grid_vect_pipe_RFC_stem.best_params_ )

best_score_RFC_stem:  0.83065625
 
cv_results_RFC_stem:  {'mean_fit_time': array([319.71623297, 107.38558207,  37.58774652,  23.7258646 ,
       170.35893054,  87.45952563,  36.68449726,  23.252424  ,
         1.28851886,   3.06733699,   3.13025565,   3.59499063]), 'std_fit_time': array([1.78138375e+02, 3.59607784e+01, 8.81114694e-01, 9.49034278e-01,
       1.76861816e+00, 6.07508289e-01, 9.52376734e-01, 8.19896806e-01,
       1.39184788e-02, 2.20401073e-01, 7.16591894e-02, 8.87789190e-02]), 'mean_score_time': array([3.37415047, 1.88157434, 0.97681351, 0.76567464, 3.20777636,
       1.81988239, 0.94949322, 0.72363472, 0.        , 0.        ,
       0.        , 0.        ]), 'std_score_time': array([0.0548496 , 0.05340991, 0.02810889, 0.05024164, 0.0570834 ,
       0.07468387, 0.01435573, 0.00835839, 0.        , 0.        ,
       0.        , 0.        ]), 'param_rfc__criterion': masked_array(data=['gini', 'gini', 'gini', 'gini', 'entropy', 'entropy',
                   'entropy', 'entr

# Tfidf + Random Forest Classifier on Stem text with Pipeline and GridsearchCV

In [255]:
tfidf_pipe_RFC = Pipeline([('tfidf', TfidfVectorizer(max_features = 1000, stop_words= 'english')),
                          ('rfc', RandomForestClassifier())])


param_tfidf_RFC = { 'rfc__criterion': ['gini', 'entropy', 'log_loss'], 
                  'rfc__n_jobs': [1,2,5,10]}


In [257]:
grid_tfidf_pipe_RFC_stem = GridSearchCV(tfidf_pipe_RFC, param_grid= param_tfidf_RFC)
grid_tfidf_pipe_RFC_stem.fit(X_stem_train, y_train)
test_score_tfidf_RFC_stem = grid_tfidf_pipe_RFC_stem.fit(X_stem_train, y_train).score(X_lemma_test, y_test)

print('Test score: Decision Tree Classifier on lemmantized text + Tfidf:', test_score_tfidf_RFC_stem)

20 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 450, in fit
    trees = Parallel(
  File "/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "/

Test score: Decision Tree Classifier on lemmantized text + Tfidf: 0.79065


In [258]:
print('best_score_tfidf_RFC_stem: ', grid_tfidf_pipe_RFC_stem.best_score_)
print(" ")
print('cv_results_tfidf_RFC_stem: ', grid_tfidf_pipe_RFC_stem.cv_results_)
print(" ")
print('get_params_tfidf_RFC_stem: ',grid_tfidf_pipe_RFC_stem.get_params)
print(" ")
print('predict_proba_tfidf_RFC_stem :', grid_tfidf_pipe_RFC_stem.predict_proba)
print(" ")
print('best_params_tfidf__RFC_stem :', grid_tfidf_pipe_RFC_stem.best_params_ )

best_score_tfidf_RFC_stem:  0.8379125000000001
 
cv_results_tfidf_RFC_stem:  {'mean_fit_time': array([140.45250025,  73.90172224,  31.50893736,  19.8700695 ,
       144.11713786,  75.67553549, 222.45520182, 205.08293481,
         1.06569953,   2.51379042,   2.47373896,   2.85783   ]), 'std_fit_time': array([7.33671683e-01, 5.98563505e-01, 7.12743681e-01, 7.38251578e-01,
       1.29985209e+00, 1.22535588e+00, 3.57408853e+02, 3.64686994e+02,
       4.39632353e-02, 2.59924574e-01, 5.65009607e-03, 9.37935870e-02]), 'mean_score_time': array([2.47154565, 1.44298205, 0.75289965, 0.60675235, 2.58078265,
       1.45500422, 0.76734905, 0.59035411, 0.        , 0.        ,
       0.        , 0.        ]), 'std_score_time': array([0.00941668, 0.01254161, 0.02504234, 0.02048082, 0.05987365,
       0.03380705, 0.02047809, 0.01426659, 0.        , 0.        ,
       0.        , 0.        ]), 'param_rfc__criterion': masked_array(data=['gini', 'gini', 'gini', 'gini', 'entropy', 'entropy',
               

# Countvectorize + Random Forest Classifier on Lemmatized text with Pipeline and GridsearchCV

In [250]:
grid_vect_pipe_RFC_lemma = GridSearchCV(vect_pipe_RFC, param_grid= param_vect_RFC)
grid_vect_pipe_RFC_lemma.fit(X_lemma_train, y_train)

20 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 450, in fit
    trees = Parallel(
  File "/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "/

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(max_features=1000,
                                                        stop_words='english')),
                                       ('rfc', RandomForestClassifier())]),
             param_grid={'rfc__criterion': ['gini', 'entropy', 'log_loss'],
                         'rfc__n_jobs': [1, 2, 5, 10]})

In [212]:
grid_vect_pipe_RFC_lemma.score(X_lemma_test, y_test)
print('Test score: Random Forest Classifier on lemmatized text + CountVectorizer:', test_score_vect_RFC_stem)


Test score: Random Forest Classifier on lemmatized text + CountVectorizer: 0.83625


In [214]:
print('best_score_RFC_lemma: ', grid_vect_pipe_RFC_lemma.best_score_)
print(" ")
print('cv_results_RFC_lemma: ', grid_vect_pipe_RFC_lemma.cv_results_)
print(" ")
print('get_params_RFC_lemma: ',grid_vect_pipe_RFC_lemma.get_params)
print(" ")
print('predict_proba_RFC_lemma :', grid_vect_pipe_RFC_lemma.predict_proba)
print(" ")
print('best_params_RFC_lemma :', grid_vect_pipe_RFC_lemma.best_params_ )

best_score_RFC_lemma:  0.8201124999999999
 
cv_results_RFC_lemma:  {'mean_fit_time': array([162.83813338,  84.36173062,  35.42500658,  22.09844961,
       217.64355721,  82.61667848,  34.65750914,  21.92023978,
         1.36544604,   3.27338338,   3.30593739,   3.9070004 ]), 'std_fit_time': array([5.41588404e-01, 5.71412229e-01, 8.96745648e-01, 1.24160422e+00,
       7.02675278e+01, 9.16624685e-01, 8.35294365e-01, 4.43151889e-01,
       6.66687922e-02, 2.63099918e-01, 9.01955743e-02, 2.42789820e-01]), 'mean_score_time': array([3.66975856, 2.01393967, 1.08475819, 0.80479031, 3.64413738,
       1.93968697, 0.97585382, 0.75836082, 0.        , 0.        ,
       0.        , 0.        ]), 'std_score_time': array([0.05698716, 0.02804092, 0.0700268 , 0.04098161, 0.04566177,
       0.03301177, 0.03069398, 0.06721933, 0.        , 0.        ,
       0.        , 0.        ]), 'param_rfc__criterion': masked_array(data=['gini', 'gini', 'gini', 'gini', 'entropy', 'entropy',
                   'entro

# # Tfidf + Random Forest Classifier on Lemmantized text with Pipeline and GridsearchCV

In [259]:
grid_tfidf_pipe_RFC_lemma = GridSearchCV(tfidf_pipe_RFC, param_grid= param_tfidf_RFC)
grid_tfidf_pipe_RFC_lemma.fit(X_lemma_train, y_train)
test_score_tfidf_RFC_lemma = grid_tfidf_pipe_RFC_lemma.fit(X_lemma_train, y_train).score(X_lemma_test, y_test)

print('Test score: Decision Tree Classifier on lemmantized text + Tfidf:', test_score_tfidf_RFC_lemma)

20 fits failed out of a total of 60.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "/opt/anaconda3/lib/python3.9/site-packages/sklearn/ensemble/_forest.py", line 450, in fit
    trees = Parallel(
  File "/opt/anaconda3/lib/python3.9/site-packages/joblib/parallel.py", line 1043, in __call__
    if self.dispatch_one_batch(iterator):
  File "/

Test score: Decision Tree Classifier on lemmantized text + Tfidf: 0.834325


In [260]:
print('best_score_tfidf_RFC_lemma: ', grid_tfidf_pipe_RFC_lemma.best_score_)
print(" ")
print('cv_results_tfidf_RFC_lemma: ', grid_tfidf_pipe_RFC_lemma.cv_results_)
print(" ")
print('get_params_tfidf_RFC_lemma: ',grid_tfidf_pipe_RFC_lemma.get_params)
print(" ")
print('predict_proba_tfidf_RFC_lemma :', grid_tfidf_pipe_RFC_lemma.predict_proba)
print(" ")
print('best_params_tfidf__RFC_lemma :', grid_tfidf_pipe_RFC_lemma.best_params_)

best_score_tfidf_RFC_lemma:  0.82870625
 
cv_results_tfidf_RFC_lemma:  {'mean_fit_time': array([708.03427825, 626.61444545, 424.41285996, 114.26163349,
       873.27211375,  74.7644455 , 281.33600974, 403.80933499,
         1.01472926,   2.46250696,   2.54475303,   2.83853884]), 'std_fit_time': array([4.04737149e+02, 4.53186336e+02, 4.82295889e+02, 1.18729421e+02,
       4.27222336e+02, 9.49557483e-01, 3.64418178e+02, 4.69748393e+02,
       5.16259644e-02, 1.05164244e-01, 4.22225737e-02, 3.62872316e-02]), 'mean_score_time': array([2.80669322, 1.71330171, 0.80729403, 0.64736476, 2.88351564,
       1.63491554, 0.83855925, 0.74726415, 0.        , 0.        ,
       0.        , 0.        ]), 'std_score_time': array([0.03096762, 0.27353339, 0.02372289, 0.04223033, 0.06189575,
       0.03547313, 0.01222378, 0.14049374, 0.        , 0.        ,
       0.        , 0.        ]), 'param_rfc__criterion': masked_array(data=['gini', 'gini', 'gini', 'gini', 'entropy', 'entropy',
                   'e

# Summary of Results CountVectorize

In [268]:
Summary_CV = pd. DataFrame({'Stemmed Text':  ' ',
                            'Model': [ 'Logistic Regression - Stem', 'Logistic Regression - Lemma', 'Decision Tree - Stem', 'Decision Tree - Lemma', 'RandomForestClassifier - Stem', 'RandomForestClassifier - Lemma'], 
                         ' Best Scores': [grid_vect_pipe_logi_stem.best_score_, grid_vect_pipe_LR_lemma.best_score_, grid_vect_pipe_DT_stem.best_score_, grid_vect_pipe_DT_lemma.best_score_, grid_vect_pipe_RFC_stem.best_score_, grid_vect_pipe_RFC_lemma.best_score_],
                         'Cross Validation Results': [grid_vect_pipe_logi_stem.cv_results_, grid_vect_pipe_LR_lemma.cv_results_, grid_vect_pipe_DT_stem.cv_results_, grid_vect_pipe_DT_lemma.cv_results_, grid_vect_pipe_RFC_stem.cv_results_, grid_vect_pipe_RFC_lemma.cv_results_],
                         'Predict Proba': [grid_vect_pipe_logi_stem.predict_proba, grid_vect_pipe_LR_lemma.predict_proba, grid_vect_pipe_DT_stem.predict_proba, grid_vect_pipe_DT_lemma.predict_proba, grid_vect_pipe_RFC_stem.predict_proba, grid_vect_pipe_RFC_lemma.predict_proba],
                         'Best Parameters':[grid_vect_pipe_logi_stem.best_params_, grid_vect_pipe_LR_lemma.best_params_, grid_vect_pipe_DT_stem.best_params_, grid_vect_pipe_DT_lemma.best_params_, grid_vect_pipe_RFC_stem.best_params_, grid_vect_pipe_RFC_lemma.best_params_]
                         })
Summary_CV

Unnamed: 0,Stemmed Text,Model,Best Scores,Cross Validation Results,Predict Proba,Best Parameters
0,,Logistic Regression - Stem,0.837275,"{'mean_fit_time': [1.0618602275848388, 1.06010...",<function BaseSearchCV.predict_proba at 0x7fa3...,"{'logi__C': 1.0, 'logi__penalty': 'l2'}"
1,,Logistic Regression - Lemma,0.829381,"{'mean_fit_time': [1.1201605796813965, 1.16496...",<function BaseSearchCV.predict_proba at 0x7fa3...,"{'logi__C': 100.0, 'logi__penalty': 'l2'}"
2,,Decision Tree - Stem,0.7906,"{'mean_fit_time': [17.94860601425171, 20.33583...",<function BaseSearchCV.predict_proba at 0x7fa3...,{'dtc__criterion': 'gini'}
3,,Decision Tree - Lemma,0.602275,"{'mean_fit_time': [1.101417064666748, 1.168362...",<function BaseSearchCV.predict_proba at 0x7fa3...,"{'dtc__criterion': 'gini', 'dtc__max_depth': 5}"
4,,RandomForestClassifier - Stem,0.830656,"{'mean_fit_time': [319.7162329673767, 107.3855...",<function BaseSearchCV.predict_proba at 0x7fa3...,"{'rfc__criterion': 'gini', 'rfc__n_jobs': 1}"
5,,RandomForestClassifier - Lemma,0.820044,"{'mean_fit_time': [519.6760025024414, 678.1045...",<function BaseSearchCV.predict_proba at 0x7fa3...,"{'rfc__criterion': 'gini', 'rfc__n_jobs': 1}"


# Summary of Results - Tfidf

In [269]:
Summary_Tfidf = pd. DataFrame({'Lemmanzited': ' ', 
                               'Model': [ 'Logistic Regression - Stem', 'Logistic Regression - Lemma', 'Decision Tree - Stem', 'Decision Tree - Lemma', 'RandomForestClassifier - Stem', 'RandomForestClassifier - Lemma'], 
                               ' Best Scores': [grid_tfidf_pipe_logi_stem.best_score_, grid_tfidf_pipe_logi_lemma.best_score_, grid_tfidf_pipe_DT_stem.best_score_, grid_tfidf_pipe_DT_lemma.best_score_, grid_tfidf_pipe_RFC_stem.best_score_, grid_tfidf_pipe_RFC_lemma.best_score_],
                               'Cross Validation Results': [grid_tfidf_pipe_logi_stem.cv_results_, grid_tfidf_pipe_logi_lemma.cv_results_, grid_tfidf_pipe_DT_stem.cv_results_, grid_tfidf_pipe_DT_lemma.cv_results_, grid_tfidf_pipe_RFC_stem.cv_results_, grid_tfidf_pipe_RFC_lemma.cv_results_],
                               'Predict Proba': [grid_tfidf_pipe_logi_stem.predict_proba, grid_tfidf_pipe_logi_lemma.predict_proba, grid_tfidf_pipe_DT_stem.predict_proba, grid_tfidf_pipe_DT_lemma.predict_proba, grid_tfidf_pipe_RFC_stem.predict_proba, grid_tfidf_pipe_RFC_lemma.predict_proba],
                               'Best Parameters':[grid_tfidf_pipe_logi_stem.best_params_, grid_tfidf_pipe_logi_lemma.best_params_, grid_tfidf_pipe_DT_stem.best_params_, grid_tfidf_pipe_DT_lemma.best_params_, grid_tfidf_pipe_RFC_stem.best_params_, grid_tfidf_pipe_RFC_lemma.best_params_]
                         })

Summary_Tfidf

Unnamed: 0,Lemmanzited,Model,Best Scores,Cross Validation Results,Predict Proba,Best Parameters
0,,Logistic Regression - Stem,0.840394,"{'mean_fit_time': [1.0713510036468505, 1.04827...",<function BaseSearchCV.predict_proba at 0x7fa3...,"{'logi__C': 10.0, 'logi__penalty': 'l2'}"
1,,Logistic Regression - Lemma,0.830075,"{'mean_fit_time': [1.034341526031494, 1.102467...",<function BaseSearchCV.predict_proba at 0x7fa3...,"{'logi__C': 10.0, 'logi__penalty': 'l1'}"
2,,Decision Tree - Stem,0.801244,"{'mean_fit_time': [19.55327625274658, 19.16711...",<function BaseSearchCV.predict_proba at 0x7fa3...,{'dtc__criterion': 'gini'}
3,,Decision Tree - Lemma,0.796038,"{'mean_fit_time': [17.393314027786253, 17.1629...",<function BaseSearchCV.predict_proba at 0x7fa3...,{'dtc__criterion': 'gini'}
4,,RandomForestClassifier - Stem,0.837913,"{'mean_fit_time': [140.45250024795533, 73.9017...",<function BaseSearchCV.predict_proba at 0x7fa3...,"{'rfc__criterion': 'gini', 'rfc__n_jobs': 1}"
5,,RandomForestClassifier - Lemma,0.828706,"{'mean_fit_time': [708.0342782497406, 626.6144...",<function BaseSearchCV.predict_proba at 0x7fa3...,"{'rfc__criterion': 'gini', 'rfc__n_jobs': 1}"
