# Modeling with preprocessing the data and TFIDF instead of Bag of Words

## Albina Jetybayeva. DSE511

In this part Bag of words and TFIDF will be compared. Bag of Words just creates a set of vectors containing the count of word occurrences in the document (reviews), while the TF-IDF model contains information on the more important words and the less important ones as well. TFIDF stands for “Term Frequency — Inverse Document Frequency”. Term Frequency (tf): gives us the frequency of the word in each document in the corpus. It is the ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Inverse Data Frequency (idf): used to calculate the weight of rare words across all documents in the corpus. The words that occur rarely in the corpus have a high IDF score. Combining these two we come up with the TF-IDF score (w) for a word in a document in the corpus.

In [1]:
#Import the basic important libraries
import numpy as np
import pandas as pd

In [2]:
#Extract data from the file:
data = pd.read_csv('train.csv')
data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [3]:
print('There are {} rows and {} columns in train'.format(data.shape[0],data.shape[1]))

There are 7613 rows and 5 columns in train


In [4]:
# drop id, keyword, and location columns for train datasets as only tweet text will be used

cols_to_drop = ["id", "keyword", "location"]
data_train = data.drop(cols_to_drop, axis=1)

data_train.head()

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
#First, basic preprocessing will be donw with the removal of hyperlinks, punctuation, non-ascii, numbers

import re
import string
def preprocess(text):

    text=text.lower()
    # remove hyperlinks
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)
    text = re.sub(r'http?:\/\/.*[\r\n]*', '', text)
    #Replace &amp, &lt, &gt with &,<,> respectively
    text=text.replace(r'&amp;?',r'and')
    text=text.replace(r'&lt;',r'<')
    text=text.replace(r'&gt;',r'>')
    #remove mentions
    text = re.sub(r"(?:\@)\w+", '', text)
    #remove non ascii chars
    text=text.encode("ascii",errors="ignore").decode()
    #remove some puncts (except . ! ?)
    text=re.sub(r'[:"#$%&\*+,-/:;<=>@\\^_`{|}~]+','',text)
    text=re.sub(r"'","",text)
    text=re.sub(r"\(","",text)
    text=re.sub(r"\)","",text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    # remove numbers
    text=re.sub(r'\d+', '', text)
    text=" ".join(text.split())
    return text
data1=data_train
data1['text'] = data1['text'].apply(preprocess)
data_processed = data1[data1["text"]!=''] #removes empty rows of text
data_processed.head()
data_processed


Unnamed: 0,text,target
0,our deeds are the reason of this earthquake ma...,1
1,forest fire near la ronge sask canada,1
2,all residents asked to shelter in place are be...,1
3,people receive wildfires evacuation orders in ...,1
4,just got sent this photo from ruby alaska as s...,1
...,...,...
7608,two giant cranes holding a bridge collapse int...,1
7609,the out of control wild fires in california ev...,1
7610,m utckm s of volcano hawaii,1
7611,police investigating after an ebike collided w...,1


In [6]:
#Split the data first and do all feature transformations after the test_train splitting on the train set only to avoid data leakage

from sklearn.model_selection import train_test_split
X = data_processed[["text"]] # Features
y = data_processed[["target"]] #Labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

print("Training Data", len(y_train))
print("Testing Data", len(y_test))

Training Data 6804
Testing Data 757


In [7]:
#Import needed libraries

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
#Apply stopwprds removal on X_train

stop = stopwords.words('english')
X_train['tweet_without_stopwords'] = X_train['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
X_train

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['tweet_without_stopwords'] = X_train['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


Unnamed: 0,text,tweet_without_stopwords
5319,breaking th death confirmed in legionnaires ou...,breaking th death confirmed legionnaires outbr...
5793,people arent rioting because justice has been ...,people arent rioting justice served murderer b...
4503,hurricane stm quem lembra,hurricane stm quem lembra
7394,my fifty online dates and why im still single ...,fifty online dates im still single michael win...
226,day of tryouts went good minus the fact i stop...,day tryouts went good minus fact stopped quick...
...,...,...
916,ive just watched episode se of bloody monday,ive watched episode se bloody monday
5228,the removal of all traces of something obliter...,removal traces something obliteration
4009,etp bengal floods cm mamata banerjee blames dv...,etp bengal floods cm mamata banerjee blames dv...
243,hell is just a fraction of his belief of total...,hell fraction belief total annihilation destru...


In [9]:
#Apply stopwprds removal on X_test

stop = stopwords.words('english')
X_test['tweet_without_stopwords'] = X_test['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
X_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['tweet_without_stopwords'] = X_test['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


Unnamed: 0,text,tweet_without_stopwords
1763,tells of the tragic midmorning collision which...,tells tragic midmorning collision claimed life...
834,a blizzard would be clutch asf,blizzard would clutch asf
6756,im a tornado looking for a soul to take,im tornado looking soul take
5338,world class tgirl ass scene pandemonium,world class tgirl ass scene pandemonium
5731,video were picking up bodies from water rescue...,video picking bodies water rescuers searching ...
...,...,...
561,do your own thing the battle of internal vs ex...,thing battle internal vs external motivation
6773,hey lets challenge then to a tornado tag tlc m...,hey lets challenge tornado tag tlc match winne...
7112,dramatic video shows plane landing during viol...,dramatic video shows plane landing violent storm
2357,dont think they will paint the lab building ca...,dont think paint lab building cause planning d...


In [10]:
#Apply lemmatizing function to X_train 

nltk.download('wordnet')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

X_train['text_lemmatized'] = X_train.tweet_without_stopwords.apply(lemmatize_text)
X_train['text_lemmatized'] = X_train['text_lemmatized'].apply(lambda x: " ".join(x)) #convert list to one string

X_train


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['text_lemmatized'] = X_train.tweet_without_stopwords.apply(lemmatize_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['text_lemmatized'] = X_train['text_lemmatized'].apply(lambda x: " ".join(x)) #convert list to one string


Unnamed: 0,text,tweet_without_stopwords,text_lemmatized
5319,breaking th death confirmed in legionnaires ou...,breaking th death confirmed legionnaires outbr...,breaking th death confirmed legionnaire outbre...
5793,people arent rioting because justice has been ...,people arent rioting justice served murderer b...,people arent rioting justice served murderer b...
4503,hurricane stm quem lembra,hurricane stm quem lembra,hurricane stm quem lembra
7394,my fifty online dates and why im still single ...,fifty online dates im still single michael win...,fifty online date im still single michael wind...
226,day of tryouts went good minus the fact i stop...,day tryouts went good minus fact stopped quick...,day tryout went good minus fact stopped quickl...
...,...,...,...
916,ive just watched episode se of bloody monday,ive watched episode se bloody monday,ive watched episode se bloody monday
5228,the removal of all traces of something obliter...,removal traces something obliteration,removal trace something obliteration
4009,etp bengal floods cm mamata banerjee blames dv...,etp bengal floods cm mamata banerjee blames dv...,etp bengal flood cm mamata banerjee blame dvc ...
243,hell is just a fraction of his belief of total...,hell fraction belief total annihilation destru...,hell fraction belief total annihilation destru...


In [11]:
#Apply lemmatizing function to X_test 

X_test['text_lemmatized'] = X_test.tweet_without_stopwords.apply(lemmatize_text)
X_test['text_lemmatized'] = X_test['text_lemmatized'].apply(lambda x: " ".join(x)) #convert list to one string

X_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['text_lemmatized'] = X_test.tweet_without_stopwords.apply(lemmatize_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['text_lemmatized'] = X_test['text_lemmatized'].apply(lambda x: " ".join(x)) #convert list to one string


Unnamed: 0,text,tweet_without_stopwords,text_lemmatized
1763,tells of the tragic midmorning collision which...,tells tragic midmorning collision claimed life...,tell tragic midmorning collision claimed life ...
834,a blizzard would be clutch asf,blizzard would clutch asf,blizzard would clutch asf
6756,im a tornado looking for a soul to take,im tornado looking soul take,im tornado looking soul take
5338,world class tgirl ass scene pandemonium,world class tgirl ass scene pandemonium,world class tgirl as scene pandemonium
5731,video were picking up bodies from water rescue...,video picking bodies water rescuers searching ...,video picking body water rescuer searching hun...
...,...,...,...
561,do your own thing the battle of internal vs ex...,thing battle internal vs external motivation,thing battle internal v external motivation
6773,hey lets challenge then to a tornado tag tlc m...,hey lets challenge tornado tag tlc match winne...,hey let challenge tornado tag tlc match winner...
7112,dramatic video shows plane landing during viol...,dramatic video shows plane landing violent storm,dramatic video show plane landing violent storm
2357,dont think they will paint the lab building ca...,dont think paint lab building cause planning d...,dont think paint lab building cause planning d...


## TFIDF

In [12]:
#Import the needed libraries

import nltk
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer

# create tfidf representation of the text data

vectorizer3 = TfidfVectorizer(
                                 strip_accents='unicode',
                                 tokenizer=word_tokenize,
                                 ngram_range=(1,1), # keep the unigram
                                 analyzer='word', # feature should be made of word n-gram
                                 min_df=10, # ignore terms appeared less than 10 times
                                 max_df=0.75) # ignore terms appeared more than 75% of the tweets available)

In [13]:
X_train = vectorizer3.fit_transform(X_train['text_lemmatized'])
X_test = vectorizer3.transform(X_test['text_lemmatized'])

In [14]:
bow_features = pd.DataFrame(X_train.toarray(), columns=sorted(vectorizer3.vocabulary_))
bow_features

Unnamed: 0,aba,abc,ablaze,absolutely,accident,according,account,across,act,action,...,yet,york,youll,young,youre,youth,youtube,youve,yr,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6799,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6800,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6801,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6802,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Models

In [15]:
#Import important libraries first

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV

## Logistic Regression

In [16]:
%%time
#Model training
Model = LogisticRegression(max_iter=200)
Model.fit(X_train, y_train.values.ravel())

Wall time: 62.9 ms


LogisticRegression(max_iter=200)

In [17]:
%%time
#Model predicting
y_pred = Model.predict(X_test)
print("Acuracy", accuracy_score(y_test, y_pred))
print("Macro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred, average='macro'))
print("Micro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred, average='micro'))
print("Weighted precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred, average='weighted'))

Acuracy 0.7833553500660502
Macro precision_recall_fscore_support
(0.7834703175612266, 0.768071599318647, 0.7725493932109984, None)
Micro precision_recall_fscore_support
(0.7833553500660502, 0.7833553500660502, 0.7833553500660502, None)
Weighted precision_recall_fscore_support
(0.7834054780734258, 0.7833553500660502, 0.780342780276157, None)
Wall time: 19 ms


In [18]:
# Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

#Logistic Regression Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

Accuracy Score : 0.7833553500660502
Precision Score : 0.7838827838827839
Recall Score : 0.670846394984326
F1 Score : 0.7229729729729728
Confusion Matrix : 
[[379  59]
 [105 214]]


In [19]:
%%time
#Grid Search
clf = LogisticRegression(max_iter=200)
grid_values = {'solver': ['lbfgs', 'newton-cg', 'sag', 'saga'], 'penalty': ['none', 'l2'],'C':[1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]} #'solver': ['newton-cg', 'lbfgs', 'liblinear'], , 'elasticnet', 'none'
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring = 'recall')
grid_clf_acc.fit(X_train, y_train.values.ravel())

#Predict values based on new parameters
y_pred_acc = grid_clf_acc.predict(X_test)

# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc)))

#Confusion matrix
confusion_matrix(y_test,y_pred_acc)

result = grid_clf_acc.fit(X_train, y_train.values.ravel())
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Accuracy Score : 0.750330250990753
Precision Score : 0.7044025157232704
Recall Score : 0.7021943573667712
F1 Score : 0.7032967032967032


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Best Score: 0.703215775502465
Best Hyperparameters: {'C': 1e-05, 'penalty': 'none', 'solver': 'newton-cg'}
Wall time: 3min 3s


## K Neighbor Classifier

In [20]:
%%time
#Model training
Model2 = KNeighborsClassifier()
Model2.fit(X_train, y_train.values.ravel())

Wall time: 2.16 ms


KNeighborsClassifier()

In [21]:
%%time
#Model predicting
y_pred2 = Model2.predict(X_test)
print("Acuracy", accuracy_score(y_test, y_pred2))
print("Macro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred2, average='macro'))
print("Micro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred2, average='micro'))
print("Weighted precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred2, average='weighted'))

Acuracy 0.7173051519154557
Macro precision_recall_fscore_support
(0.7422498465316145, 0.6807589356006928, 0.6804773175542407, None)
Micro precision_recall_fscore_support
(0.7173051519154557, 0.7173051519154557, 0.7173051519154559, None)
Weighted precision_recall_fscore_support
(0.7347348625839616, 0.7173051519154557, 0.6975298945541807, None)
Wall time: 171 ms


In [22]:
# Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred2)))
print('Precision Score : ' + str(precision_score(y_test,y_pred2)))
print('Recall Score : ' + str(recall_score(y_test,y_pred2)))
print('F1 Score : ' + str(f1_score(y_test,y_pred2)))

#Logistic Regression Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred2)))

Accuracy Score : 0.7173051519154557
Precision Score : 0.7900552486187845
Recall Score : 0.4482758620689655
F1 Score : 0.572
Confusion Matrix : 
[[400  38]
 [176 143]]


In [23]:
%%time

#Grid Search
from sklearn.model_selection import GridSearchCV
clf2 = KNeighborsClassifier()


k_range = list(range(1, 31))
param_grid2 = dict(n_neighbors=k_range)


grid_clf_acc2 = GridSearchCV(clf2, param_grid = param_grid2,scoring = 'recall')
grid_clf_acc2.fit(X_train, y_train.values.ravel())

#Predict values based on new parameters
y_pred_acc2 = grid_clf_acc2.predict(X_test)

# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc2)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc2)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc2)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc2)))

#Confusion matrix
confusion_matrix(y_test,y_pred_acc2)

result2 = grid_clf_acc2.fit(X_train, y_train.values.ravel())
print('Best Score: %s' % result2.best_score_)
print('Best Hyperparameters: %s' % result2.best_params_)

Accuracy Score : 0.7093791281373845
Precision Score : 0.7004048582995951
Recall Score : 0.542319749216301
F1 Score : 0.6113074204946997
Best Score: 0.5477722353490272
Best Hyperparameters: {'n_neighbors': 1}
Wall time: 1min 39s


## Random Forest Classifier

In [24]:
%%time
#Model training
Model3 = RandomForestClassifier(random_state=0)
Model3.fit(X_train, y_train.values.ravel())


Wall time: 9.48 s


RandomForestClassifier(random_state=0)

In [25]:
%%time
#Model predicting
y_pred3 = Model3.predict(X_test)

print("Acuracy", accuracy_score(y_test, y_pred3))
print("Macro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred3, average='macro'))
print("Micro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred3, average='micro'))
print("Weighted precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred3, average='weighted'))

Acuracy 0.7859973579920739
Macro precision_recall_fscore_support
(0.7819259248440946, 0.7763165428493723, 0.7785302531206657, None)
Micro precision_recall_fscore_support
(0.7859973579920739, 0.7859973579920739, 0.7859973579920739, None)
Weighted precision_recall_fscore_support
(0.7849730980806326, 0.7859973579920739, 0.784922954413454, None)
Wall time: 139 ms


In [26]:
# Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred3)))
print('Precision Score : ' + str(precision_score(y_test,y_pred3)))
print('Recall Score : ' + str(recall_score(y_test,y_pred3)))
print('F1 Score : ' + str(f1_score(y_test,y_pred3)))

#Logistic Regression Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred3)))

Accuracy Score : 0.7859973579920739
Precision Score : 0.7625418060200669
Recall Score : 0.7147335423197492
F1 Score : 0.7378640776699029
Confusion Matrix : 
[[367  71]
 [ 91 228]]


In [27]:
%%time
#Grid Search
from sklearn.model_selection import GridSearchCV
clf3 = RandomForestClassifier()


# Create the parameter grid based on the results of random search 
param_grid3 = {
    
    'n_estimators': [10, 50, 100, 200, 300]
}


grid_clf_acc3 = GridSearchCV(clf3, param_grid = param_grid3,scoring = 'recall')
grid_clf_acc3.fit(X_train, y_train.values.ravel())

#Predict values based on new parameters
y_pred_acc3 = grid_clf_acc3.predict(X_test)

# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc3)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc3)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc3)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc3)))

#Confusion matrix
confusion_matrix(y_test,y_pred_acc3)

result3 = grid_clf_acc3.fit(X_train, y_train.values.ravel())
print('Best Score: %s' % result3.best_score_)
print('Best Hyperparameters: %s' % result3.best_params_)

Accuracy Score : 0.7899603698811096
Precision Score : 0.7721088435374149
Recall Score : 0.7115987460815048
F1 Score : 0.7406199021207177
Best Score: 0.6950147311922056
Best Hyperparameters: {'n_estimators': 100}
Wall time: 4min 49s


# Summary

* It can be seen that for Logistic Regression there was no large difference in results between tfidf and bag of words, and Recall was even lower in tfidf one. So for this model bag of words is more preferable.
* For KNN although precision was higher with tfidf, the recall is still lower, which is not favorable as the focus is to improve the recall firstly.
* For Random forest the results are similar and recall is higher for bag of words case, so again bag of words is more preferable then over tfidf in this case.
* Such effect of tfidf can be explained by more processing of data that is done by tfidf and as the result the removal possible important keywords for models.