# Modeling with preprocessing the data with only stopwords

## Albina Jetybayeva. DSE511

Now to see the effect of preprocessing the data, removal of only stopwords will be applied. Stopwords are the English words which does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence. For example, the words like the, he, have etc. So removing of such words will be tested in this section. 

In [14]:
#Import the basic important libraries
import numpy as np
import pandas as pd

In [15]:
#Extract data from the file:
data = pd.read_csv('train.csv')
data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [16]:
print('There are {} rows and {} columns in train'.format(data.shape[0],data.shape[1]))

There are 7613 rows and 5 columns in train


In [17]:
# drop id, keyword, and location columns for train datasets as only tweet text will be used

cols_to_drop = ["id", "keyword", "location"]
data_train = data.drop(cols_to_drop, axis=1)

data_train.head()

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


In [18]:
#Small step preprocessing - lowercasing the words, so that stopwords such as "the" and "The" both can be removed

import re
import string
def preprocess(text):

    text=text.lower()
    
    return text
data1=data_train
data1['text'] = data1['text'].apply(preprocess)
data_processed = data1[data1["text"]!=''] #removes empty rows of text
data_processed.head()
data_processed


Unnamed: 0,text,target
0,our deeds are the reason of this #earthquake m...,1
1,forest fire near la ronge sask. canada,1
2,all residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,just got sent this photo from ruby #alaska as ...,1
...,...,...
7608,two giant cranes holding a bridge collapse int...,1
7609,@aria_ahrary @thetawniest the out of control w...,1
7610,m1.94 [01:04 utc]?5km s of volcano hawaii. htt...,1
7611,police investigating after an e-bike collided ...,1


In [19]:
#Split the data first and do all feature transformations after the test_train splitting on the train set only to avoid data leakage

from sklearn.model_selection import train_test_split
X = data_processed[["text"]] # Features
y = data_processed[["target"]] #Labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

print("Training Data", len(y_train))
print("Testing Data", len(y_test))


Training Data 6851
Testing Data 762


In [20]:
#Import libraries for stopwords

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [21]:
#Apply stopwords removal to X_train

stop = stopwords.words('english')
X_train['tweet_without_stopwords'] = X_train['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
X_train

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['tweet_without_stopwords'] = X_train['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


Unnamed: 0,text,tweet_without_stopwords
3172,our builder is having a dental emergency. whic...,builder dental emergency. ruined plan emotiona...
2188,experts leave lab as malaysia confirms debris ...,experts leave lab malaysia confirms debris #mh...
1554,please stand up for bees against profit-hungry...,please stand bees profit-hungry chemical compa...
4485,#hot c-130 specially modified to land in a st...,#hot c-130 specially modified land stadium res...
2262,do you feel deluged by low self-image? take th...,feel deluged low self-image? take quiz: http:/...
...,...,...
905,all i need in this life of sin\nis just me and...,need life sin girlfriend ride till bloody end ...
5192,drunk meals 101: what to cook when you're tota...,drunk meals 101: cook totally obliterated http...
3980,monsoon flooding kills dozens in myanmar promp...,monsoon flooding kills dozens myanmar promptin...
235,just completely annihilated cech with paul kee...,completely annihilated cech paul keegan time a...


In [22]:
#Apply stopwords removal to X_test

stop = stopwords.words('english')
X_test['tweet_without_stopwords'] = X_test['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
X_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['tweet_without_stopwords'] = X_test['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


Unnamed: 0,text,tweet_without_stopwords
3228,goulburn man henry van bilsen missing: emergen...,goulburn man henry van bilsen missing: emergen...
3706,the things we fear most in organizations--fluc...,things fear organizations--fluctuations distur...
6957,@tsunami_esh ?? hey esh,@tsunami_esh ?? hey esh
2887,@potus you until you drown by water entering t...,@potus drown water entering lungs. alive cause...
7464,crawling in my skin\nthese wounds they will no...,crawling skin wounds hea
...,...,...
7587,sirens everywhere!,sirens everywhere!
2584,@alanhahn @hdumpty39 daughtery would get destr...,@alanhahn @hdumpty39 daughtery would get destr...
7180,please allow me to reiterate it's not the weap...,please allow reiterate weapon mindset individu...
3658,#shipping #logistics enca | fatalities as migr...,#shipping #logistics enca | fatalities migrant...


## Bag of words

To convert data into a numerical vector use Bag of Words first.

In [23]:
#Import the needed libraries
import nltk
from nltk import word_tokenize
nltk.download('punkt')
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# create bag of word representation of the text data
bow_vectorizer = CountVectorizer(dtype=np.float32,
                                 strip_accents='unicode',
                                 tokenizer=word_tokenize,
                                 ngram_range=(1,1), # keep the unigram
                                 analyzer='word', # feature should be made of word n-gram
                                 min_df=10, # ignore terms appeared less than 10 times
                                 max_df=0.75) # ignore terms appeared more than 75% of the tweets available

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [24]:
X_train = bow_vectorizer.fit_transform(X_train['tweet_without_stopwords'])
X_test = bow_vectorizer.transform(X_test['tweet_without_stopwords'])

In [25]:
bow_features = pd.DataFrame(X_train.toarray(), columns=sorted(bow_vectorizer.vocabulary_))
bow_features

Unnamed: 0,!,#,$,%,&,','conclusively,'d,'food,'it,...,young,youth,youtube,zone,|,~,u_,ui,uiwhen,uo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6846,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6847,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6848,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6849,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Models

In [26]:
#Import important libraries first

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV

## Logistic Regression Model

In [27]:
%%time
#Model training
Model = LogisticRegression(max_iter=200)
Model.fit(X_train, y_train.values.ravel())

Wall time: 186 ms


LogisticRegression(max_iter=200)

In [28]:
%%time
#Model predictng
y_pred = Model.predict(X_test)
print("Acuracy", accuracy_score(y_test, y_pred))
print("Macro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred, average='macro'))
print("Micro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred, average='micro'))
print("Weighted precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred, average='weighted'))


Acuracy 0.800524934383202
Macro precision_recall_fscore_support
(0.8005597343832638, 0.7883265685777101, 0.7922588513828606, None)
Micro precision_recall_fscore_support
(0.800524934383202, 0.800524934383202, 0.800524934383202, None)
Weighted precision_recall_fscore_support
(0.8005388543832268, 0.800524934383202, 0.7984584136331168, None)
Wall time: 11 ms


In [29]:
# Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

#Logistic Regression Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

Accuracy Score : 0.800524934383202
Precision Score : 0.8006993006993007
Recall Score : 0.7067901234567902
F1 Score : 0.7508196721311476
Confusion Matrix : 
[[381  57]
 [ 95 229]]


In [30]:
%%time
#Grid Search
from sklearn.model_selection import GridSearchCV
clf = LogisticRegression(max_iter=500)
grid_values = {'solver': ['lbfgs', 'newton-cg', 'sag', 'saga'], 'penalty': ['none', 'l2'],'C':[1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]} #'solver': ['newton-cg', 'lbfgs', 'liblinear'], , 'elasticnet', 'none'
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring = 'recall')
grid_clf_acc.fit(X_train, y_train.values.ravel())

#Predict values based on new parameters
y_pred_acc = grid_clf_acc.predict(X_test)

# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc)))

#Logistic Regression (Grid Search) Confusion matrix
confusion_matrix(y_test,y_pred_acc)

result = grid_clf_acc.fit(X_train, y_train.values.ravel())
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt



Accuracy Score : 0.7979002624671916
Precision Score : 0.7891156462585034
Recall Score : 0.7160493827160493
F1 Score : 0.750809061488673


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt



Best Score: 0.7146240395959829
Best Hyperparameters: {'C': 1e-05, 'penalty': 'none', 'solver': 'sag'}
Wall time: 13min 54s




## K Neighbors Classifier

In [31]:
%%time

#Model training
Model2 = KNeighborsClassifier()
Model2.fit(X_train, y_train.values.ravel())

Wall time: 996 µs


KNeighborsClassifier()

In [32]:
%%time
#Model predicting
y_pred2 = Model2.predict(X_test)
print("Acuracy", accuracy_score(y_test, y_pred2))
print("Macro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred2, average='macro'))
print("Micro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred2, average='micro'))
print("Weighted precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred2, average='weighted'))

Acuracy 0.6758530183727034
Macro precision_recall_fscore_support
(0.7085946573751452, 0.6328851682732961, 0.620507981442164, None)
Micro precision_recall_fscore_support
(0.6758530183727034, 0.6758530183727034, 0.6758530183727034, None)
Weighted precision_recall_fscore_support
(0.7006191299258324, 0.6758530183727034, 0.6421895423015506, None)
Wall time: 182 ms


In [33]:
# Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred2)))
print('Precision Score : ' + str(precision_score(y_test,y_pred2)))
print('Recall Score : ' + str(recall_score(y_test,y_pred2)))
print('F1 Score : ' + str(f1_score(y_test,y_pred2)))

#Logistic Regression Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred2)))

Accuracy Score : 0.6758530183727034
Precision Score : 0.7619047619047619
Recall Score : 0.345679012345679
F1 Score : 0.4755838641188959
Confusion Matrix : 
[[403  35]
 [212 112]]


In [34]:
%%time

#Grid Search
clf2 = KNeighborsClassifier()


k_range = list(range(1, 31))
param_grid2 = dict(n_neighbors=k_range)


grid_clf_acc2 = GridSearchCV(clf2, param_grid = param_grid2,scoring = 'recall')
grid_clf_acc2.fit(X_train, y_train.values.ravel())

#Predict values based on new parameters
y_pred_acc2 = grid_clf_acc2.predict(X_test)

# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc2)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc2)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc2)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc2)))

#Confusion matrix
confusion_matrix(y_test,y_pred_acc2)

result2 = grid_clf_acc2.fit(X_train, y_train.values.ravel())
print('Best Score: %s' % result2.best_score_)
print('Best Hyperparameters: %s' % result2.best_params_)

Accuracy Score : 0.678477690288714
Precision Score : 0.6436363636363637
Recall Score : 0.5462962962962963
F1 Score : 0.5909849749582639
Best Score: 0.5096670599407211
Best Hyperparameters: {'n_neighbors': 1}
Wall time: 1min 22s


## Random Forest Classifier

In [35]:
%%time
#Model training
Model3 = RandomForestClassifier(random_state=0)
Model3.fit(X_train, y_train.values.ravel())

Wall time: 3.86 s


RandomForestClassifier(random_state=0)

In [36]:
%%time
#Model predicting
y_pred3 = Model3.predict(X_test)

print("Acuracy", accuracy_score(y_test, y_pred3))
print("Macro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred3, average='macro'))
print("Micro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred3, average='micro'))
print("Weighted precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred3, average='weighted'))

Acuracy 0.7965879265091863
Macro precision_recall_fscore_support
(0.8052326255946165, 0.7780737358362929, 0.7841379499410588, None)
Micro precision_recall_fscore_support
(0.7965879265091863, 0.7965879265091863, 0.7965879265091863, None)
Weighted precision_recall_fscore_support
(0.8013219283893028, 0.7965879265091863, 0.7918936730490727, None)
Wall time: 51.9 ms


In [37]:
# Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred3)))
print('Precision Score : ' + str(precision_score(y_test,y_pred3)))
print('Recall Score : ' + str(recall_score(y_test,y_pred3)))
print('F1 Score : ' + str(f1_score(y_test,y_pred3)))

#Logistic Regression Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred3)))

Accuracy Score : 0.7965879265091863
Precision Score : 0.8313725490196079
Recall Score : 0.654320987654321
F1 Score : 0.7322970639032816
Confusion Matrix : 
[[395  43]
 [112 212]]


In [38]:
%%time
#Grid Search
clf3 = RandomForestClassifier()

# Create the parameter grid based on the results of random search 
param_grid3 = {
    
    'n_estimators': [10, 50, 100, 200, 300]
}


grid_clf_acc3 = GridSearchCV(clf3, param_grid = param_grid3,scoring = 'recall')
grid_clf_acc3.fit(X_train, y_train.values.ravel())

#Predict values based on new parameters
y_pred_acc3 = grid_clf_acc3.predict(X_test)

# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc3)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc3)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc3)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc3)))

#Confusion matrix
confusion_matrix(y_test,y_pred_acc3)

result3 = grid_clf_acc3.fit(X_train, y_train.values.ravel())
print('Best Score: %s' % result3.best_score_)
print('Best Hyperparameters: %s' % result3.best_params_)

Accuracy Score : 0.7782152230971129
Precision Score : 0.8112449799196787
Recall Score : 0.6234567901234568
F1 Score : 0.7050610820244327
Best Score: 0.629789070818106
Best Hyperparameters: {'n_estimators': 200}
Wall time: 3min 8s


# Summary

* For Logistic Regression model the removal of stopwords did not improve much the performance and recall was even slightly lower compared to unprocessed data.
* For KNN and Random Forest the performance was almost same compared to unprocessed data, a slightly better results were for Random Forest.
* Thus, it can be concluded that application of stopwords removal might be useful, however, its alone application is not enough to improve the overall performance of the models. So this technique needs to be applied in a combination with other data preprocessing.