# Modeling with preprocessing the data (all techniques)

## Albina Jetybayeva. DSE511

Now to see the effect of preprocessing the data, all preprocessing techniques will be combined: removing the things like punctuation, numbers, links, ascii, the removal of stopwords and lemmatizing.

In [12]:
#Import the basic important libraries
import numpy as np
import pandas as pd

In [13]:
#Extract data from the file:
data = pd.read_csv('train.csv')
data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [14]:
print('There are {} rows and {} columns in train'.format(data.shape[0],data.shape[1]))

There are 7613 rows and 5 columns in train


In [15]:
# drop id, keyword, and location columns for train datasets as only tweet text will be used

cols_to_drop = ["id", "keyword", "location"]
data_train = data.drop(cols_to_drop, axis=1)

data_train.head()

Unnamed: 0,text,target
0,Our Deeds are the Reason of this #earthquake M...,1
1,Forest fire near La Ronge Sask. Canada,1
2,All residents asked to 'shelter in place' are ...,1
3,"13,000 people receive #wildfires evacuation or...",1
4,Just got sent this photo from Ruby #Alaska as ...,1


In [16]:
#First, basic preprocessing will be donw with the removal of hyperlinks, punctuation, non-ascii, numbers

import re
import string
def preprocess(text):

    text=text.lower()
    # remove hyperlinks
    text = re.sub(r'https?:\/\/.*[\r\n]*', '', text)
    text = re.sub(r'http?:\/\/.*[\r\n]*', '', text)
    #Replace &amp, &lt, &gt with &,<,> respectively
    text=text.replace(r'&amp;?',r'and')
    text=text.replace(r'&lt;',r'<')
    text=text.replace(r'&gt;',r'>')
    #remove mentions
    text = re.sub(r"(?:\@)\w+", '', text)
    #remove non ascii chars
    text=text.encode("ascii",errors="ignore").decode()
    #remove some puncts (except . ! ?)
    text=re.sub(r'[:"#$%&\*+,-/:;<=>@\\^_`{|}~]+','',text)
    text=re.sub(r"'","",text)
    text=re.sub(r"\(","",text)
    text=re.sub(r"\)","",text)
    text = re.sub('<.*?>+', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    # remove numbers
    text=re.sub(r'\d+', '', text)
    text=" ".join(text.split())
    return text
data1=data_train
data1['text'] = data1['text'].apply(preprocess)
data_processed = data1[data1["text"]!=''] #removes empty rows of text
data_processed.head()
data_processed


Unnamed: 0,text,target
0,our deeds are the reason of this earthquake ma...,1
1,forest fire near la ronge sask canada,1
2,all residents asked to shelter in place are be...,1
3,people receive wildfires evacuation orders in ...,1
4,just got sent this photo from ruby alaska as s...,1
...,...,...
7608,two giant cranes holding a bridge collapse int...,1
7609,the out of control wild fires in california ev...,1
7610,m utckm s of volcano hawaii,1
7611,police investigating after an ebike collided w...,1


In [17]:
#Split the data first and do all feature transformations after the test_train splitting on the train set only to avoid data leakage

from sklearn.model_selection import train_test_split
X = data_processed[["text"]] # Features
y = data_processed[["target"]] #Labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)

print("Training Data", len(y_train))
print("Testing Data", len(y_test))

Training Data 6804
Testing Data 757


In [18]:
#Import libraries for stopwords

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [19]:
#Apply stopwords removal to X_train

stop = stopwords.words('english')
X_train['tweet_without_stopwords'] = X_train['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
X_train

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['tweet_without_stopwords'] = X_train['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


Unnamed: 0,text,tweet_without_stopwords
5319,breaking th death confirmed in legionnaires ou...,breaking th death confirmed legionnaires outbr...
5793,people arent rioting because justice has been ...,people arent rioting justice served murderer b...
4503,hurricane stm quem lembra,hurricane stm quem lembra
7394,my fifty online dates and why im still single ...,fifty online dates im still single michael win...
226,day of tryouts went good minus the fact i stop...,day tryouts went good minus fact stopped quick...
...,...,...
916,ive just watched episode se of bloody monday,ive watched episode se bloody monday
5228,the removal of all traces of something obliter...,removal traces something obliteration
4009,etp bengal floods cm mamata banerjee blames dv...,etp bengal floods cm mamata banerjee blames dv...
243,hell is just a fraction of his belief of total...,hell fraction belief total annihilation destru...


In [20]:
#Apply stopwords removal to X_test

stop = stopwords.words('english')
X_test['tweet_without_stopwords'] = X_test['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
X_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['tweet_without_stopwords'] = X_test['text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))


Unnamed: 0,text,tweet_without_stopwords
1763,tells of the tragic midmorning collision which...,tells tragic midmorning collision claimed life...
834,a blizzard would be clutch asf,blizzard would clutch asf
6756,im a tornado looking for a soul to take,im tornado looking soul take
5338,world class tgirl ass scene pandemonium,world class tgirl ass scene pandemonium
5731,video were picking up bodies from water rescue...,video picking bodies water rescuers searching ...
...,...,...
561,do your own thing the battle of internal vs ex...,thing battle internal vs external motivation
6773,hey lets challenge then to a tornado tag tlc m...,hey lets challenge tornado tag tlc match winne...
7112,dramatic video shows plane landing during viol...,dramatic video shows plane landing violent storm
2357,dont think they will paint the lab building ca...,dont think paint lab building cause planning d...


In [21]:
#Import lemmatizing libraries
nltk.download('wordnet')
w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

#Apply lemmatizing function to X_train 

def lemmatize_text(text):
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text)]

X_train['text_lemmatized'] = X_train.tweet_without_stopwords.apply(lemmatize_text)
X_train['text_lemmatized'] = X_train['text_lemmatized'].apply(lambda x: " ".join(x)) #convert list to one string

X_train


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['text_lemmatized'] = X_train.tweet_without_stopwords.apply(lemmatize_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_train['text_lemmatized'] = X_train['text_lemmatized'].apply(lambda x: " ".join(x)) #convert list to one string


Unnamed: 0,text,tweet_without_stopwords,text_lemmatized
5319,breaking th death confirmed in legionnaires ou...,breaking th death confirmed legionnaires outbr...,breaking th death confirmed legionnaire outbre...
5793,people arent rioting because justice has been ...,people arent rioting justice served murderer b...,people arent rioting justice served murderer b...
4503,hurricane stm quem lembra,hurricane stm quem lembra,hurricane stm quem lembra
7394,my fifty online dates and why im still single ...,fifty online dates im still single michael win...,fifty online date im still single michael wind...
226,day of tryouts went good minus the fact i stop...,day tryouts went good minus fact stopped quick...,day tryout went good minus fact stopped quickl...
...,...,...,...
916,ive just watched episode se of bloody monday,ive watched episode se bloody monday,ive watched episode se bloody monday
5228,the removal of all traces of something obliter...,removal traces something obliteration,removal trace something obliteration
4009,etp bengal floods cm mamata banerjee blames dv...,etp bengal floods cm mamata banerjee blames dv...,etp bengal flood cm mamata banerjee blame dvc ...
243,hell is just a fraction of his belief of total...,hell fraction belief total annihilation destru...,hell fraction belief total annihilation destru...


In [22]:
#Apply lemmatizing function to X_test

X_test['text_lemmatized'] = X_test.tweet_without_stopwords.apply(lemmatize_text)
X_test['text_lemmatized'] = X_test['text_lemmatized'].apply(lambda x: " ".join(x)) #convert list to one string

X_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['text_lemmatized'] = X_test.tweet_without_stopwords.apply(lemmatize_text)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_test['text_lemmatized'] = X_test['text_lemmatized'].apply(lambda x: " ".join(x)) #convert list to one string


Unnamed: 0,text,tweet_without_stopwords,text_lemmatized
1763,tells of the tragic midmorning collision which...,tells tragic midmorning collision claimed life...,tell tragic midmorning collision claimed life ...
834,a blizzard would be clutch asf,blizzard would clutch asf,blizzard would clutch asf
6756,im a tornado looking for a soul to take,im tornado looking soul take,im tornado looking soul take
5338,world class tgirl ass scene pandemonium,world class tgirl ass scene pandemonium,world class tgirl as scene pandemonium
5731,video were picking up bodies from water rescue...,video picking bodies water rescuers searching ...,video picking body water rescuer searching hun...
...,...,...,...
561,do your own thing the battle of internal vs ex...,thing battle internal vs external motivation,thing battle internal v external motivation
6773,hey lets challenge then to a tornado tag tlc m...,hey lets challenge tornado tag tlc match winne...,hey let challenge tornado tag tlc match winner...
7112,dramatic video shows plane landing during viol...,dramatic video shows plane landing violent storm,dramatic video show plane landing violent storm
2357,dont think they will paint the lab building ca...,dont think paint lab building cause planning d...,dont think paint lab building cause planning d...


## Bag of words

To convert data into a numerical vector use Bag of Words first.

In [23]:
#Import the needed libraries
import nltk
from nltk import word_tokenize
nltk.download('punkt')
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
# create bag of word representation of the text data
bow_vectorizer = CountVectorizer(dtype=np.float32,
                                 strip_accents='unicode',
                                 tokenizer=word_tokenize,
                                 ngram_range=(1,1), # keep the unigram
                                 analyzer='word', # feature should be made of word n-gram
                                 min_df=10, # ignore terms appeared less than 10 times
                                 max_df=0.75) # ignore terms appeared more than 75% of the tweets available

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [24]:
X_train = bow_vectorizer.fit_transform(X_train['text_lemmatized'])
X_test = bow_vectorizer.transform(X_test['text_lemmatized'])

In [25]:
bow_features = pd.DataFrame(X_train.toarray(), columns=sorted(bow_vectorizer.vocabulary_))
bow_features

Unnamed: 0,aba,abc,ablaze,absolutely,accident,according,account,across,act,action,...,yet,york,youll,young,youre,youth,youtube,youve,yr,zone
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6799,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6800,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6801,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6802,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Models

In [26]:
#Import important libraries first

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV

## Logistic Regression

In [27]:
%%time
#Model training
Model = LogisticRegression(max_iter=200)
Model.fit(X_train, y_train.values.ravel())

Wall time: 93 ms


LogisticRegression(max_iter=200)

In [28]:
%%time
#Model predicting
y_pred = Model.predict(X_test)
print("Acuracy", accuracy_score(y_test, y_pred))
print("Macro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred, average='macro'))
print("Micro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred, average='micro'))
print("Weighted precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred, average='weighted'))

Acuracy 0.7886393659180978
Macro precision_recall_fscore_support
(0.7857395690456175, 0.7773221110490831, 0.7803568837951547, None)
Micro precision_recall_fscore_support
(0.7886393659180978, 0.7886393659180978, 0.7886393659180978, None)
Weighted precision_recall_fscore_support
(0.787711430918904, 0.7886393659180978, 0.7870617502756325, None)
Wall time: 16.7 ms


In [29]:
# Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

#Logistic Regression Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred)))

Accuracy Score : 0.7886393659180978
Precision Score : 0.7731958762886598
Recall Score : 0.7053291536050157
F1 Score : 0.737704918032787
Confusion Matrix : 
[[372  66]
 [ 94 225]]


In [30]:
%%time
#Grid Search
clf = LogisticRegression(max_iter=500)
grid_values = {'solver': ['lbfgs', 'newton-cg', 'sag', 'saga'], 'penalty': ['none', 'l2'],'C':[1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]} #'solver': ['newton-cg', 'lbfgs', 'liblinear'], , 'elasticnet', 'none'
grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring = 'recall')
grid_clf_acc.fit(X_train, y_train.values.ravel())

#Predict values based on new parameters
y_pred_acc = grid_clf_acc.predict(X_test)

# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc)))

# Confusion matrix
confusion_matrix(y_test,y_pred_acc)

result = grid_clf_acc.fit(X_train, y_train.values.ravel())
print('Best Score: %s' % result.best_score_)
print('Best Hyperparameters: %s' % result.best_params_)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(




Accuracy Score : 0.7569352708058125
Precision Score : 0.7129337539432177
Recall Score : 0.7084639498432602
F1 Score : 0.7106918238993711


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(




Best Score: 0.7056071876549692
Best Hyperparameters: {'C': 0.0001, 'penalty': 'none', 'solver': 'sag'}
Wall time: 8min 41s




## K Neighbors Classifier

In [31]:
%%time
#Model training
Model2 = KNeighborsClassifier()
Model2.fit(X_train, y_train.values.ravel())

Wall time: 1.96 ms


KNeighborsClassifier()

In [32]:
%%time
#Model predicting 
y_pred2 = Model2.predict(X_test)
print("Acuracy", accuracy_score(y_test, y_pred2))
print("Macro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred2, average='macro'))
print("Micro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred2, average='micro'))
print("Weighted precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred2, average='weighted'))

Acuracy 0.7278731836195509
Macro precision_recall_fscore_support
(0.7459024513247239, 0.6954273485922045, 0.6977446465398273, None)
Micro precision_recall_fscore_support
(0.7278731836195509, 0.7278731836195509, 0.7278731836195509, None)
Weighted precision_recall_fscore_support
(0.7399261759572711, 0.7278731836195509, 0.7127458846673883, None)
Wall time: 232 ms


In [33]:
# Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred2)))
print('Precision Score : ' + str(precision_score(y_test,y_pred2)))
print('Recall Score : ' + str(recall_score(y_test,y_pred2)))
print('F1 Score : ' + str(f1_score(y_test,y_pred2)))

#Logistic Regression Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred2)))

Accuracy Score : 0.7278731836195509
Precision Score : 0.7839195979899497
Recall Score : 0.4890282131661442
F1 Score : 0.6023166023166023
Confusion Matrix : 
[[395  43]
 [163 156]]


In [34]:
%%time

#Grid Search
from sklearn.model_selection import GridSearchCV
clf2 = KNeighborsClassifier()


k_range = list(range(1, 31))
param_grid2 = dict(n_neighbors=k_range)


grid_clf_acc2 = GridSearchCV(clf2, param_grid = param_grid2,scoring = 'recall')
grid_clf_acc2.fit(X_train, y_train.values.ravel())

#Predict values based on new parameters
y_pred_acc2 = grid_clf_acc2.predict(X_test)

# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc2)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc2)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc2)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc2)))

#Confusion matrix
confusion_matrix(y_test,y_pred_acc2)


result2 = grid_clf_acc2.fit(X_train, y_train.values.ravel())
print('Best Score: %s' % result2.best_score_)
print('Best Hyperparameters: %s' % result2.best_params_)

Accuracy Score : 0.7014531043593131
Precision Score : 0.6703296703296703
Recall Score : 0.5736677115987461
F1 Score : 0.6182432432432433
Best Score: 0.5669198681485371
Best Hyperparameters: {'n_neighbors': 1}
Wall time: 1min 34s


## Random Forest Classifier

In [35]:
%%time
#Model training

Model3 = RandomForestClassifier(random_state=0)
Model3.fit(X_train, y_train.values.ravel())

Wall time: 5.87 s


RandomForestClassifier(random_state=0)

In [36]:
%%time
#Model predicting 

y_pred3 = Model3.predict(X_test)

print("Acuracy", accuracy_score(y_test, y_pred3))
print("Macro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred3, average='macro'))
print("Micro precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred3, average='micro'))
print("Weighted precision_recall_fscore_support")
print(precision_recall_fscore_support(y_test, y_pred3, average='weighted'))

Acuracy 0.7939233817701453
Macro precision_recall_fscore_support
(0.7892430940822932, 0.7865726227795193, 0.7877600609624592, None)
Micro precision_recall_fscore_support
(0.7939233817701453, 0.7939233817701453, 0.7939233817701453, None)
Weighted precision_recall_fscore_support
(0.7932499590812457, 0.7939233817701453, 0.793445604963348, None)
Wall time: 113 ms


In [37]:
# Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred3)))
print('Precision Score : ' + str(precision_score(y_test,y_pred3)))
print('Recall Score : ' + str(recall_score(y_test,y_pred3)))
print('F1 Score : ' + str(f1_score(y_test,y_pred3)))

#Logistic Regression Classifier Confusion matrix
print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred3)))

Accuracy Score : 0.7939233817701453
Precision Score : 0.7637540453074434
Recall Score : 0.7398119122257053
F1 Score : 0.751592356687898
Confusion Matrix : 
[[365  73]
 [ 83 236]]


In [38]:
%%time
#Grid Search
from sklearn.model_selection import GridSearchCV
clf3 = RandomForestClassifier()


# Create the parameter grid based on the results of random search 
param_grid3 = {
    
    'n_estimators': [10, 50, 100, 200, 300]
}


grid_clf_acc3 = GridSearchCV(clf3, param_grid = param_grid3,scoring = 'recall')
grid_clf_acc3.fit(X_train, y_train.values.ravel())

#Predict values based on new parameters
y_pred_acc3 = grid_clf_acc3.predict(X_test)

# New Model Evaluation metrics 
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc3)))
print('Precision Score : ' + str(precision_score(y_test,y_pred_acc3)))
print('Recall Score : ' + str(recall_score(y_test,y_pred_acc3)))
print('F1 Score : ' + str(f1_score(y_test,y_pred_acc3)))

# Confusion matrix
confusion_matrix(y_test,y_pred_acc3)

result3 = grid_clf_acc3.fit(X_train, y_train.values.ravel())
print('Best Score: %s' % result3.best_score_)
print('Best Hyperparameters: %s' % result3.best_params_)

Accuracy Score : 0.7965653896961691
Precision Score : 0.7704918032786885
Recall Score : 0.7366771159874608
F1 Score : 0.7532051282051281
Best Score: 0.7134465155625567
Best Hyperparameters: {'n_estimators': 100}
Wall time: 4min 48s


# Summary

* Interestingly, for Logistic Regression model application of the combined data preprocessing techniques decreased the evaluation parameters of model. This can be explained by the important feature removal, which for this specific model made the performance worse.
* For KNN though the results improved significantly.
* While for Random Forest with the expect of only precision, the other parameters such as accuracy, f1score and recall are much higher for processed data than unprocessed one. As it was explained before recall is very important parameter and Random Forest has the highest recall (0.736) among all models and observed cases of data processing so far. Thus, this is the best model for this task for now. 