# Table of Contents

 - [Logistic Regression](#Logistic-Regression)
 - [BOW and LogisticRegression](#BOW-and-LogisticRegression)
 - [Bigram and LogisticRigression](#Bigram-and-LogisticRigression)
 - [TF-IDF and LogisticRegression](#TF-IDF-and-LogisticRegression)
 - [TF-IDF, Ngarms and LogisticRegression](#TF-IDF,-Ngarms-and-LogisticRegression)
 - [GridSearch with BOW and Logistic Regression](#GridSearch-with-BOW-and-Logistic-Regression)
 - [GridSearch with TF-IDF and Logistic Regression](#GridSearch-with-TF-IDF-and-Logistic-Regression)

# Logistic Regression

In the preceding notebook, the initial Logistic Regression model are built and examined. Also, two techniques of Countvectorizer and TF-IDF are employed to vectorize the text. The main goal is to tune the parameters of these two vectorization techniques.

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import precision_recall_fscore_support as score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

In [2]:
df = pd.read_csv('Library/cleaned_text_train_df.csv')
df.head()

Unnamed: 0,clean_text,toxic_type
0,explanation edit make username hardcore metall...,0
1,aww match background colour seemingly stuck th...,0
2,hey man really not try edit war guy constantly...,0
3,make real suggestion improvement wonder sectio...,0
4,sir hero chance remember page,0


In [3]:
df.isna().sum()

clean_text    54
toxic_type     0
dtype: int64

In [4]:
df.dropna(inplace=True)

In [22]:
X = df['clean_text']
y = df['toxic_type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
y_train, y_test = y_train.values, y_test.values

I start with the bag of word text vectorization in conjunction with the logistic regression.

## BOW and LogisticRegression

In [23]:
n_features = 5000
vectorizer = CountVectorizer(max_features=n_features)
train_matrix = vectorizer.fit_transform(X_train)
count_array = train_matrix.toarray()
bow_train = pd.DataFrame(data=count_array, columns=vectorizer.get_feature_names())
bow_train.index = X_train.index
bow_train.head()

Unnamed: 0,aa,aaron,ab,abandon,abbreviation,abc,abide,ability,able,abortion,...,young,yourselfgo,youth,youtube,ytmnd,yugoslavia,zealand,zero,zionist,zone
18679,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4329,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
153619,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
45013,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
93942,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
X_train = bow_train.values

In [25]:
n_features = 5000
vectorizer = CountVectorizer(max_features=n_features)
test_matrix = vectorizer.fit_transform(X_test)
count_array = test_matrix.toarray()
bow_test = pd.DataFrame(data=count_array, columns=vectorizer.get_feature_names())
bow_test.index = X_test.index
bow_test.head()

Unnamed: 0,ab,abandon,abc,abide,ability,able,abortion,absence,absolute,absolutely,...,youth,youtube,yugoslav,yugoslavia,zealand,zero,zinc,zionism,zionist,zone
18491,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9389,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
48545,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
135944,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
114289,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [26]:
X_test = bow_test.values

In [27]:
print("X: ", type(X), X.shape)
print("y: ", type(y), y.shape)

X:  <class 'pandas.core.series.Series'> (159517,)
y:  <class 'pandas.core.series.Series'> (159517,)


In [29]:
print("X train: ", type(X_train), X_train.shape)
print("y train: ", type(y_train), y_train.shape)

X train:  <class 'numpy.ndarray'> (127613, 5000)
y train:  <class 'numpy.ndarray'> (127613,)


In [30]:
print("X test: ", type(X_test), X_test.shape)
print("y test: ", type(y_test), y_test.shape)

X test:  <class 'numpy.ndarray'> (31904, 5000)
y test:  <class 'numpy.ndarray'> (31904,)


In [33]:
clf = LogisticRegression(solver='lbfgs', max_iter=20000, random_state=1)
clf_model = clf.fit(X_train, y_train)

y_pred_train = clf_model.predict(X_train)

y_pred_test = clf_model.predict(X_test)

In [34]:
print("Test Classification Report")
print(classification_report(y_test, y_pred_test))

Test Classification Report
              precision    recall  f1-score   support

           0       0.90      0.90      0.90     28566
           1       0.11      0.10      0.11      3338

    accuracy                           0.82     31904
   macro avg       0.50      0.50      0.50     31904
weighted avg       0.81      0.82      0.82     31904



In [35]:
print("Training Classification Report")
print(classification_report(y_train, y_pred_train))

Training Classification Report
              precision    recall  f1-score   support

           0       0.97      0.99      0.98    114726
           1       0.92      0.71      0.80     12887

    accuracy                           0.96    127613
   macro avg       0.95      0.85      0.89    127613
weighted avg       0.96      0.96      0.96    127613



This model predict the non-toxic comments with a high precision. However, it did not perform well in detecting the toxic comments, even for the train set. Recall score of the toxic comments for the train and test datasets are 0.772 and 0.685, respectively.
<br> How does Ngrams perform? Let's see!

## Bigram and LogisticRigression

In [36]:
n_features = 5000
vectorizer = CountVectorizer(max_features=n_features, ngram_range = (2,2))
text_matrix = vectorizer.fit_transform(df.clean_text)
count_array = text_matrix.toarray()
bow_bi_df = pd.DataFrame(data=count_array, columns=vectorizer.get_feature_names())
bow_bi_df.index = df.index
bow_bi_df.head()

Unnamed: 0,able edit,able find,able get,absolutely no,absolutely nothing,abuse power,abusing power,accept appropriate,accept copyright,accept notable,...,yet bitch,yet not,yet still,york city,york times,youbollocks youbollocks,yourselfgo fuck,youtube com,youtube video,ytmnd name
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [37]:
X = bow_bi_df.values
y = df['toxic_type'].values

In [38]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [40]:
clf = LogisticRegression(solver='lbfgs', max_iter=20000, random_state=1)
clf_model = clf.fit(X_train, y_train)

y_pred_train = clf_model.predict(X_train)

y_pred_test = clf_model.predict(X_test)

In [41]:
print("Test Classification Report")
print(classification_report(y_test, y_pred_test))

Test Classification Report
              precision    recall  f1-score   support

           0       0.91      1.00      0.95     28566
           1       0.79      0.16      0.26      3338

    accuracy                           0.91     31904
   macro avg       0.85      0.58      0.61     31904
weighted avg       0.90      0.91      0.88     31904



In [42]:
print("Training Classification Report")
print(classification_report(y_train, y_pred_train))

Training Classification Report
              precision    recall  f1-score   support

           0       0.92      1.00      0.95    114726
           1       0.87      0.18      0.29     12887

    accuracy                           0.91    127613
   macro avg       0.89      0.59      0.62    127613
weighted avg       0.91      0.91      0.89    127613



Both precision and recall scores of the toxic category are lower than those for the first model. It seems that bigram doesn't perform well.
<br> Now, I perform the same process with the TF-IDF vectorization in conjunction with logistic regression.

## TF-IDF and LogisticRegression

In [43]:
n_features = 5000
tfidf = TfidfVectorizer(min_df=10, max_df=0.95, use_idf=True, max_features=n_features)
train_tfidf = tfidf.fit_transform(df['clean_text'])
tfidf_array = train_tfidf.toarray()
tfidf_df = pd.DataFrame(tfidf_array, columns=tfidf.get_feature_names())
tfidf_df.head()

Unnamed: 0,aa,ab,abandon,abbreviation,abc,abide,ability,able,abortion,abraham,...,york,young,youth,youtube,yugoslav,yugoslavia,zealand,zero,zionist,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.2608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [44]:
X = tfidf_df.values
y = df['toxic_type'].values

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [46]:
clf = LogisticRegression(solver='lbfgs', max_iter=20000, random_state=0)
clf_model = clf.fit(X_train, y_train)

y_pred_train = clf_model.predict(X_train)

y_pred_test = clf_model.predict(X_test)

In [47]:
print("Test Classification Report")
print(classification_report(y_test, y_pred_test))

Test Classification Report
              precision    recall  f1-score   support

           0       0.96      0.99      0.98     28566
           1       0.90      0.64      0.75      3338

    accuracy                           0.95     31904
   macro avg       0.93      0.81      0.86     31904
weighted avg       0.95      0.95      0.95     31904



In [48]:
print("Training Classification Report")
print(classification_report(y_train, y_pred_train))

Training Classification Report
              precision    recall  f1-score   support

           0       0.96      0.99      0.98    114726
           1       0.93      0.66      0.77     12887

    accuracy                           0.96    127613
   macro avg       0.95      0.83      0.88    127613
weighted avg       0.96      0.96      0.96    127613



## TF-IDF, Ngarms and LogisticRegression

In [49]:
n_features = 10000
tfidf = TfidfVectorizer(min_df=10, max_df=0.95, use_idf=True, max_features=n_features, ngram_range=(2,2))
train_tfidf = tfidf.fit_transform(df['clean_text'])
tfidf_array = train_tfidf.toarray()
tfidf_df = pd.DataFrame(tfidf_array, columns=tfidf.get_feature_names())
tfidf_df.head()

Unnamed: 0,ability create,able edit,able find,able get,able help,able make,able see,able use,absolutely no,absolutely not,...,yet another,yet no,yet not,yet see,yet still,york city,york times,young man,youtube com,youtube video
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [50]:
X = tfidf_df.values
y = df['toxic_type'].values

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [52]:
clf = LogisticRegression(solver='lbfgs', max_iter=20000, random_state=0)
clf_model = clf.fit(X_train, y_train)

y_pred_train = clf_model.predict(X_train)

y_pred_test = clf_model.predict(X_test)

In [53]:
print("Test Classification Report")
print(classification_report(y_test, y_pred_test))

Test Classification Report
              precision    recall  f1-score   support

           0       0.91      1.00      0.95     28566
           1       0.89      0.17      0.28      3338

    accuracy                           0.91     31904
   macro avg       0.90      0.58      0.62     31904
weighted avg       0.91      0.91      0.88     31904



In [54]:
print("Training Classification Report")
print(classification_report(y_train, y_pred_train))

Training Classification Report
              precision    recall  f1-score   support

           0       0.91      1.00      0.95    114726
           1       0.92      0.17      0.29     12887

    accuracy                           0.92    127613
   macro avg       0.92      0.59      0.62    127613
weighted avg       0.92      0.92      0.89    127613



# GridSearch with BOW and Logistic Regression

In [55]:
pipeline = Pipeline([('countvectorizer', CountVectorizer()), 
                     ('clf', LogisticRegression(solver='lbfgs', max_iter=20000))])

parameters = {
    'countvectorizer__max_features': (2500, 5000, 1000, 15000, 20000),
    'countvectorizer__ngram_range': ((1,1), (2,2))
}

In [56]:
grid_search = GridSearchCV(pipeline, parameters, cv=5)
X = df.clean_text
y = df.toxic_type
gs = grid_search.fit(X, y)

In [57]:
print(gs.best_estimator_.steps)

[('countvectorizer', CountVectorizer(max_features=20000)), ('clf', LogisticRegression(max_iter=20000))]


In [58]:
pd.DataFrame(gs.cv_results_).sort_values('mean_test_score', ascending=False)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_countvectorizer__max_features,param_countvectorizer__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
8,12.174575,1.135326,0.782658,0.019074,20000,"(1, 1)","{'countvectorizer__max_features': 20000, 'coun...",0.956651,0.957059,0.956587,0.956869,0.956995,0.956832,0.000185,1
6,13.943474,1.724176,1.173358,0.276607,15000,"(1, 1)","{'countvectorizer__max_features': 15000, 'coun...",0.9564,0.956808,0.956117,0.956556,0.956242,0.956425,0.000242,2
2,10.917253,0.646965,0.863884,0.084232,5000,"(1, 1)","{'countvectorizer__max_features': 5000, 'count...",0.955241,0.954645,0.954957,0.954518,0.955396,0.954952,0.000335,3
0,10.529709,1.066831,0.818591,0.0642,2500,"(1, 1)","{'countvectorizer__max_features': 2500, 'count...",0.953423,0.952733,0.952669,0.952575,0.953484,0.952977,0.000393,4
4,8.349462,0.681444,0.72897,0.021935,1000,"(1, 1)","{'countvectorizer__max_features': 1000, 'count...",0.946433,0.948878,0.946839,0.94756,0.94872,0.947686,0.000979,5
9,15.6416,0.854656,1.029621,0.022725,20000,"(2, 2)","{'countvectorizer__max_features': 20000, 'coun...",0.914211,0.915308,0.913958,0.915682,0.914898,0.914812,0.000648,6
7,17.786582,2.410246,1.149909,0.115855,15000,"(2, 2)","{'countvectorizer__max_features': 15000, 'coun...",0.913177,0.913428,0.912046,0.914334,0.913049,0.913207,0.000734,7
3,17.729048,1.523099,1.216547,0.095294,5000,"(2, 2)","{'countvectorizer__max_features': 5000, 'count...",0.90929,0.908131,0.908096,0.909664,0.909382,0.908913,0.000664,8
1,15.834648,0.35511,1.132169,0.035889,2500,"(2, 2)","{'countvectorizer__max_features': 2500, 'count...",0.908538,0.908225,0.906623,0.908692,0.90841,0.908098,0.000753,9
5,15.975378,1.569348,1.271804,0.252898,1000,"(2, 2)","{'countvectorizer__max_features': 1000, 'count...",0.906563,0.907034,0.905902,0.907689,0.906968,0.906831,0.000588,10


# GridSearch with BOW and Logistic Regression

In [59]:
X = df['clean_text']
y = df['toxic_type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
y_train = y_train.values
y_test = y_test.values

In [64]:
pipeline = Pipeline([('countvectorizer', CountVectorizer()), 
                     ('clf', LogisticRegression(solver='lbfgs', max_iter=20000))])

parameters = {
    'countvectorizer__max_features': (2500, 5000, 1000, 15000, 20000),
    'countvectorizer__ngram_range': ((1,1), (2,2))
}
gs_clf = GridSearchCV(pipeline, parameters, cv=5)
gs_clf_mdoel = gs_clf.fit(X_train, y_train)

In [65]:
y_pred = gs_clf_mdoel.predict(X_test)

In [66]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98     28566
           1       0.86      0.68      0.76      3338

    accuracy                           0.96     31904
   macro avg       0.91      0.84      0.87     31904
weighted avg       0.95      0.96      0.95     31904



In [67]:
y_pred_train = gs_clf_mdoel.predict(X_train)

In [68]:
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.98      1.00      0.99    114726
           1       0.96      0.78      0.86     12887

    accuracy                           0.97    127613
   macro avg       0.97      0.89      0.92    127613
weighted avg       0.97      0.97      0.97    127613



In [71]:
print(gs_clf_mdoel.best_estimator_.steps)

[('countvectorizer', CountVectorizer(max_features=20000)), ('clf', LogisticRegression(max_iter=20000))]


In [72]:
pd.DataFrame(gs_clf_mdoel.cv_results_).sort_values('mean_test_score', ascending=False)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_countvectorizer__max_features,param_countvectorizer__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
8,9.443439,0.660097,0.695917,0.067717,20000,"(1, 1)","{'countvectorizer__max_features': 20000, 'coun...",0.956079,0.958782,0.956588,0.956508,0.956861,0.956964,0.000943,1
6,8.603516,0.255752,0.621717,0.005848,15000,"(1, 1)","{'countvectorizer__max_features': 15000, 'coun...",0.955099,0.959057,0.956392,0.956273,0.95596,0.956556,0.001329,2
2,7.755386,0.256373,0.606523,0.009012,5000,"(1, 1)","{'countvectorizer__max_features': 5000, 'count...",0.954864,0.957999,0.954394,0.953883,0.954079,0.955044,0.001514,3
0,8.05414,0.51426,0.59012,0.018794,2500,"(1, 1)","{'countvectorizer__max_features': 2500, 'count...",0.95365,0.955452,0.952396,0.951493,0.952041,0.953006,0.001413,4
4,6.178097,0.46584,0.551079,0.004517,1000,"(1, 1)","{'countvectorizer__max_features': 1000, 'count...",0.948517,0.950515,0.947302,0.947927,0.947771,0.948407,0.001124,5
9,14.315039,2.002166,0.917382,0.096741,20000,"(2, 2)","{'countvectorizer__max_features': 20000, 'coun...",0.915018,0.916076,0.915096,0.91572,0.91666,0.915714,0.000615,6
7,13.376594,1.447049,0.948526,0.177171,15000,"(2, 2)","{'countvectorizer__max_features': 15000, 'coun...",0.912118,0.914234,0.912824,0.914544,0.914074,0.913559,0.000928,7
3,12.27789,0.445875,0.88012,0.098344,5000,"(2, 2)","{'countvectorizer__max_features': 5000, 'count...",0.909219,0.909807,0.908749,0.908785,0.909568,0.909226,0.000419,8
1,11.862666,0.514666,0.828159,0.037932,2500,"(2, 2)","{'countvectorizer__max_features': 2500, 'count...",0.908122,0.90918,0.907691,0.908236,0.908941,0.908434,0.000548,9
5,11.315789,0.338314,0.804489,0.013253,1000,"(2, 2)","{'countvectorizer__max_features': 1000, 'count...",0.906633,0.907652,0.907103,0.907178,0.90757,0.907227,0.000366,10


# GridSearch with TF-IDF and Logistic Regression

In [74]:
X = df['clean_text']
y = df['toxic_type']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
y_train = y_train.values
y_test = y_test.values

In [76]:
pipeline = Pipeline([('tfidf', TfidfVectorizer()), 
                     ('clf', LogisticRegression(solver='lbfgs', max_iter=20000))])

parameters = {
    'tfidf__max_features': (2500, 5000, 1000, 15000, 20000),
    'tfidf__ngram_range': ((1,1), (2,2))
}
tf_clf = GridSearchCV(pipeline, parameters, cv=5)
tf_clf_model = gs_clf.fit(X_train, y_train)

In [79]:
y_pred_test = tf_clf_model.predict(X_test)

In [81]:
print(classification_report(y_test, y_pred_test))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98     28566
           1       0.90      0.64      0.75      3338

    accuracy                           0.95     31904
   macro avg       0.93      0.81      0.86     31904
weighted avg       0.95      0.95      0.95     31904



In [82]:
y_pred_train = tf_clf_model.predict(X_train)

In [84]:
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.96      0.99      0.98    114726
           1       0.93      0.66      0.77     12887

    accuracy                           0.96    127613
   macro avg       0.95      0.83      0.87    127613
weighted avg       0.96      0.96      0.96    127613



In [85]:
print(tf_clf_model.best_estimator_.steps)

[('tfidf', TfidfVectorizer(max_features=5000)), ('clf', LogisticRegression(max_iter=20000))]


In [87]:
pd.DataFrame(tf_clf_model.cv_results_).sort_values('mean_test_score', ascending=False)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_tfidf__max_features,param_tfidf__ngram_range,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
2,3.904038,0.446561,0.636236,0.072595,5000,"(1, 1)","{'tfidf__max_features': 5000, 'tfidf__ngram_ra...",0.956196,0.958665,0.956314,0.957213,0.956155,0.956909,0.00096,1
6,3.674541,0.111841,0.604927,0.034133,15000,"(1, 1)","{'tfidf__max_features': 15000, 'tfidf__ngram_r...",0.956627,0.957842,0.955883,0.956782,0.955724,0.956572,0.000755,2
8,3.661916,0.274466,0.586216,0.005234,20000,"(1, 1)","{'tfidf__max_features': 20000, 'tfidf__ngram_r...",0.956353,0.957372,0.956196,0.956547,0.955803,0.956454,0.00052,3
0,3.640266,0.070743,0.604778,0.03285,2500,"(1, 1)","{'tfidf__max_features': 2500, 'tfidf__ngram_ra...",0.954903,0.957646,0.954786,0.954157,0.954118,0.955122,0.001302,4
4,3.220113,0.055676,0.544155,0.013458,1000,"(1, 1)","{'tfidf__max_features': 1000, 'tfidf__ngram_ra...",0.950398,0.950672,0.949261,0.950866,0.950357,0.950311,0.000557,5
9,9.993143,0.048513,0.860431,0.025296,20000,"(2, 2)","{'tfidf__max_features': 20000, 'tfidf__ngram_r...",0.914156,0.915527,0.915057,0.91525,0.915602,0.915118,0.000519,6
7,9.821631,0.140152,0.846451,0.036268,15000,"(2, 2)","{'tfidf__max_features': 15000, 'tfidf__ngram_r...",0.912824,0.915096,0.913842,0.91427,0.913721,0.913951,0.000741,7
3,10.072934,0.470357,0.870955,0.073766,5000,"(2, 2)","{'tfidf__max_features': 5000, 'tfidf__ngram_ra...",0.909297,0.909846,0.908592,0.909137,0.908941,0.909163,0.000415,8
1,9.655236,0.3499,0.785197,0.038589,2500,"(2, 2)","{'tfidf__max_features': 2500, 'tfidf__ngram_ra...",0.9082,0.908553,0.908122,0.90804,0.90804,0.908191,0.000191,9
5,9.409683,0.167591,0.749158,0.01088,1000,"(2, 2)","{'tfidf__max_features': 1000, 'tfidf__ngram_ra...",0.907143,0.907417,0.906986,0.906708,0.906982,0.907047,0.000232,10


The performance of TF-IDF is similar to countvectorizer but with lower number of features than BOW.