# Risk Analysis of Covid19 Virus
As the outbreak of coronavirus disease 2019 (COVID-19) progresses, epidemiological data are needed to guide situational awareness and intervention strategies. The enormous impact of the COVID-19 pandemic is obvious. What many still haven’t realized, however, is that the impact on ongoing data science production setups has been dramatic, too. Artificial Intelligence is actively used in identifying high risk patients at an earlier stage and therefore helps to control the spread of the infection in real-time. This becomes particularly important at this time of crisis because real time monitoring is the best option for people to self-isolate and mitigate the spread of the virus.

Objective: This is an open-research project that a healthcare institute will use to determine the risk factor of second-level contacts traced for a COVID19 Positive patient. This study will also lead to deciding which all factors should be considered for a healthcare institute to open a dedicated testing-quarantine labs and/or predict the possibility of a zone turning into a hotspot.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn import metrics
from sklearn import tree
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

  from numpy.core.umath_tests import inner1d


# 1. Data Collection:Get the data from APIs

Data has been collected by Tanmoy Mukherjee (https://www.kaggle.com/tanmoyx).Dataset has been taken from kaggle with URL as: https://www.kaggle.com/tanmoyx/covid19-patient-precondition-dataset?select=covid.csv .
Data is being collected by the Mexican goverment, it contains a large number of anonymised patient-related  information such as pre-existing conditions, gender, age, symptom date and entry date. The data contains patient information around the world.

Feature Set:
he data set contains 993,197 records, the data dictionary is provided as below:

1. id: The identification number of the patient.

2. sex: Identify gender of the patient, 1 as female and 2 as male.

3. patient_type: Type of patient, 1 for not hospitalized and 2 for hosptalized.

4. entry_date: The date that the patient went to the hospital.

5. date_symptoms: The date that the patient started to show symptoms.

6. date_died: The date that the patient died, "9999-99-99" stands for recovered.

7. intubed: Intubation is a procedure that's used when you can't breathe on your own. Your doctor puts a tube down your throat and into your windpipe to make it easier to get air into and out of your lungs. A machine called a ventilator pumps in air with extra oxygen. Then it helps you breathe out air that’s full of carbon dioxide (CO2). "1" denotes that the patient used ventilator and "2" denotes that the patient did not, "97" "98" "99" means not specified.

8. pneumonia: Indicates whether the patient already have air sacs inflammation or not "1" for yes, "2" for no, "97" "98" "99" means not specified.

9. age: Specifies the age of the patient.

10. pregnancy: Indicates whether the patient is pregnant or not, "1" for yes, "2" for no, "97" "98" "99" means not specified.

11. diabetes: Indicates whether the patient has diabetes or not, "1" for yes, "2" for no, "97" "98" "99" means not specified.

12. copd: Indicates whether the patient has Chronic obstructive pulmonary disease (COPD) or not, "1" for yes, "2" for no, "97" "98" "99" means not specified.

13. asthma: Indiactes whether the patient has asthma or not, "1" for yes, "2" for no, "97" "98" "99" means not specified.

14. inmsupr: Indicates whether the patient is immunosuppressed or not, "1" for yes, "2" for no, "97" "98" "99" means not specified.

15. hypertension: Indicates whether the patient has hypertension or not, "1" for yes, "2" for no, "97" "98" "99" means not specified.

16. other_disease: Indicates whether the patient has other disease or not, "1" for yes, "2" for no, "97" "98" "99" means not specified.

17. cardiovascular: Indicates whether if the patient has heart or blood vessels realted disease, "1" for yes, "2" for no, "97" "98" "99" means not specified.

18. obesity: Indicates whether the patient is obese or not,  "1" for yes, "2" for no, "97" "98" "99" means not specified.

19. renal_chronic: Indicates whether the patient has chronic renal disease or not,  "1" for yes, "2" for no, "97" "98" "99" means not specified.

20. tobacco: Indicates whether if the patient is a tobacco user, "1" for yes, "2" for no, "97" "98" "99" means not specified.

21. contact_other_covid: Indicates whether if the patient has contacted another covid19 patient.

22. covid_res: RESULT Identifies the result of the analysis of the sample reported by the laboratory of the National Network of Epidemiological Surveillance Laboratories (INDRE, LESP and LAVE). ( "1" for positive,"2" for negative, "3" for result is pending.

22. icu: Indicates whether the if the patient had been admitted to an Intensive Care Unit (ICU), "1" for yes, "2" for no, "97" "98" "99" means not specified.


In [2]:
df=pd.read_csv(r'C:\Users\Sony\Desktop\python\covid19\another dataset\covid.csv')
df.head()

Unnamed: 0,id,sex,patient_type,entry_date,date_symptoms,date_died,intubed,pneumonia,age,pregnancy,...,inmsupr,hypertension,other_disease,cardiovascular,obesity,renal_chronic,tobacco,contact_other_covid,covid_res,icu
0,16169f,2,1,4/5/2020,2/5/2020,9999-99-99,97,2,27,97,...,2,2,2,2,2,2,2,2,1,97
1,1009bf,2,1,19-03-2020,17-03-2020,9999-99-99,97,2,24,97,...,2,2,2,2,2,2,2,99,1,97
2,167386,1,2,6/4/2020,1/4/2020,9999-99-99,2,2,54,2,...,2,2,2,2,1,2,2,99,1,2
3,0b5948,2,2,17-04-2020,10/4/2020,9999-99-99,2,1,30,97,...,2,2,2,2,2,2,2,99,1,2
4,0d01b5,1,2,13-04-2020,13-04-2020,22-04-2020,2,2,60,2,...,2,1,2,1,2,2,2,99,1,2


In [3]:
df=df.drop(['id','entry_date','date_symptoms','date_died'],axis=1)

In [4]:
df.rename(columns={'covid_res':'Test_result'},inplace=True)

In [5]:
df['Test_result'].unique()

array([1, 2, 3], dtype=int64)

In [6]:
#converting numerical features to proper categories
df.sex.replace({1: 'Female', 2: 'Male'}, inplace=True)
df.patient_type.replace({1: 'Outpatient', 2: 'Inpatient'}, inplace=True)
df.intubed.replace({1: 'Yes', 2: 'No',97:'Not Specified', 98:'Not Specified',99:'Not Specified'}, inplace=True)
df.pneumonia.replace({1: 'Yes', 2: 'No', 98:'Not Specified',99:'Not Specified', 97:'Not Specified'}, inplace=True)
df.pregnancy.replace({1: 'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
df.diabetes.replace({1:'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
df.copd.replace({1:'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
df.asthma.replace({1:'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
df.inmsupr.replace({1:'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
df.hypertension.replace({1:'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
df.other_disease.replace({1:'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
df.cardiovascular.replace({1:'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
df.obesity.replace({1:'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
df.renal_chronic.replace({1:'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
df.tobacco.replace({1:'Yes', 2: 'No', 99:'Not Specified',98:'Not Specified', 97:'Not Specified'}, inplace=True)
df.contact_other_covid.replace({1: 'Yes', 2: 'No', 97:'Not Specified',99:'Not Specified',98:'Not Specified'}, inplace=True)
df.Test_result.replace({1: 'Positive', 2: 'Negative', 3:'Awaiting Results'}, inplace=True)
df.icu.replace({1: 'Yes', 2: 'No', 97:'Not Specified',98:'Not Specified', 99:'Not Specified'}, inplace=True)


In [46]:
df['Test_result'].unique()

array(['Positive', 'Negative', 'Awaiting Results'], dtype=object)

In [7]:
# Get names of indexes for which column Stock has value No
indexNames = df[ df['Test_result'] == 'Awaiting Results' ].index
# Delete these row indexes from dataFrame
df.drop(indexNames , inplace=True)
df.head()

Unnamed: 0,sex,patient_type,intubed,pneumonia,age,pregnancy,diabetes,copd,asthma,inmsupr,hypertension,other_disease,cardiovascular,obesity,renal_chronic,tobacco,contact_other_covid,Test_result,icu
0,Male,Outpatient,Not Specified,No,27,Not Specified,No,No,No,No,No,No,No,No,No,No,No,Positive,Not Specified
1,Male,Outpatient,Not Specified,No,24,Not Specified,No,No,No,No,No,No,No,No,No,No,Not Specified,Positive,Not Specified
2,Female,Inpatient,No,No,54,No,No,No,No,No,No,No,No,Yes,No,No,Not Specified,Positive,No
3,Male,Inpatient,No,Yes,30,Not Specified,No,No,No,No,No,No,No,No,No,No,Not Specified,Positive,No
4,Female,Inpatient,No,No,60,No,Yes,No,No,No,Yes,No,Yes,No,No,No,Not Specified,Positive,No


In [49]:
df['Test_result'].unique()

array(['Positive', 'Negative'], dtype=object)

# Find Categorical Features and Numerical Features

In [8]:
cat_features=df.select_dtypes(include='object').columns.tolist()
print("Categorical Features:",cat_features)
num_features=df.select_dtypes(exclude='object').columns.tolist()
print("Numerical Features:",num_features)

Categorical Features: ['sex', 'patient_type', 'intubed', 'pneumonia', 'pregnancy', 'diabetes', 'copd', 'asthma', 'inmsupr', 'hypertension', 'other_disease', 'cardiovascular', 'obesity', 'renal_chronic', 'tobacco', 'contact_other_covid', 'Test_result', 'icu']
Numerical Features: ['age']


In [9]:

#No null values to handle
df.isna().sum() 
df.isnull().sum()

sex                    0
patient_type           0
intubed                0
pneumonia              0
age                    0
pregnancy              0
diabetes               0
copd                   0
asthma                 0
inmsupr                0
hypertension           0
other_disease          0
cardiovascular         0
obesity                0
renal_chronic          0
tobacco                0
contact_other_covid    0
Test_result            0
icu                    0
dtype: int64

# Label Encoding

In [10]:
from sklearn.preprocessing import StandardScaler,LabelEncoder
cat_features=df.select_dtypes(include='object').columns.tolist()
le=LabelEncoder()
for col in cat_features:
    df[col]=le.fit_transform(df[col])
    

# Splitting data in X and Y

In [11]:
from sklearn.model_selection import train_test_split
x = df.drop('Test_result', axis = 1)
y = df.Test_result
from sklearn.preprocessing import RobustScaler
x= RobustScaler().fit_transform(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size= 0.3)

# CLASSIFICATION

# Logistic regression without CV

In [28]:
from sklearn.linear_model import LogisticRegression
log = LogisticRegression(class_weight='balanced')
log.fit(x_train, y_train)
log_pred = log.predict(x_test)
log_accuracy = metrics.accuracy_score(y_test, log_pred)
print("log_accuracy",log_accuracy)

log_precision=metrics.precision_score(y_test, log_pred,pos_label=0)
print("log_precision",log_precision)

log_recall=metrics.recall_score(y_test, log_pred,pos_label=0)
print("log_recall",log_recall)

log_f1_score= metrics.f1_score(y_test, log_pred,pos_label=0)
print("log_f1_score",log_f1_score)

print(confusion_matrix(y_test,log_pred))
print(classification_report(y_test,log_pred))

log_accuracy 0.6198001440883742
log_precision 0.6427699136658561
log_recall 0.7215798006771256
log_f1_score 0.6798986818531562
[[60529 23355]
 [33640 32384]]
             precision    recall  f1-score   support

          0       0.64      0.72      0.68     83884
          1       0.58      0.49      0.53     66024

avg / total       0.62      0.62      0.61    149908



# Logistic regression with CV

In [29]:
log_cross_val = cross_val_score(log, x, y, cv=10, scoring='accuracy')
log_cv_accuracy = log_cross_val.mean()
print("log_cv_accuracy",log_cv_accuracy)

log_cross_val_pre = cross_val_score(log, x, y, cv=10, scoring='precision_macro')
log_cv_precision = log_cross_val_pre.mean()
print("log_cv_precision",log_cv_precision)

log_cross_val_re = cross_val_score(log, x, y, cv=10, scoring='recall_macro')
log_cv_recall = log_cross_val_re.mean()
print("log_cv_recall",log_cv_recall)

log_cross_val_f1 = cross_val_score(log, x, y, cv=10, scoring='f1_macro')
log_cv_f1_score = log_cross_val_f1.mean()
print("log_cv_f1_score",log_cv_f1_score)

log_cv_accuracy 0.619969910050355
log_cv_precision 0.6123884330573607
log_cv_recall 0.6061262551909335
log_cv_f1_score 0.60583146561793


In [15]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.metrics import classification_report,accuracy_score,confusion_matrix
from sklearn import metrics


In [16]:
from sklearn import metrics
from sklearn import tree

dtree = tree.DecisionTreeClassifier(class_weight='balanced')
dtree = dtree.fit(x_train, y_train)

dtree_pred = dtree.predict(x_test)

dtree_accuracy = metrics.accuracy_score(y_test, dtree_pred)
print("dtree_accuracy",dtree_accuracy)

dtree_precision=metrics.precision_score(y_test, dtree_pred,pos_label=0)
print("dtree_precision",dtree_precision)

dtree_recall=metrics.recall_score(y_test, dtree_pred,pos_label=0)
print("dtree_recall",dtree_recall)

dtree_f1_score= metrics.f1_score(y_test, dtree_pred,pos_label=0)
print("dtree_f1_score",dtree_f1_score)

print(confusion_matrix(y_test,dtree_pred))
print(classification_report(y_test,dtree_pred))

dtree_accuracy 0.6162779838300825
dtree_precision 0.6379333800768128
dtree_recall 0.7267059272328453
dtree_f1_score 0.6794322367797772
[[60959 22925]
 [34598 31426]]
             precision    recall  f1-score   support

          0       0.64      0.73      0.68     83884
          1       0.58      0.48      0.52     66024

avg / total       0.61      0.62      0.61    149908



In [17]:
from sklearn.model_selection import cross_val_score

dtree_cross_val_acc = cross_val_score(dtree, x, y, cv=10, scoring='accuracy')

dtree_cv_accuracy = dtree_cross_val_acc.mean()
print("dtree_cv_accuracy",dtree_cv_accuracy)

dtree_cross_val_pre = cross_val_score(dtree, x, y, cv=10, scoring='precision_macro')
dtree_cv_precision = dtree_cross_val_pre.mean()
print("dtree_cv_precision",dtree_cv_precision)

dtree_cross_val_re = cross_val_score(dtree, x, y, cv=10, scoring='recall_macro')
dtree_cv_recall = dtree_cross_val_re.mean()
print("dtree_cv_recall",dtree_cv_recall)

dtree_cross_val_f1 = cross_val_score(dtree, x, y, cv=10, scoring='f1_macro')
dtree_cv_f1_score = dtree_cross_val_f1.mean()
print("dtree_cv_f1_score",dtree_cv_f1_score)

dtree_cv_accuracy 0.6168079537189033
dtree_cv_precision 0.6092311370390523
dtree_cv_recall 0.601929606710069
dtree_cv_f1_score 0.6011316461859828


# Random Forest

In [25]:
from sklearn.ensemble import RandomForestClassifier
rnd_for = RandomForestClassifier(class_weight='balanced')
rnd_for.fit(x_train, y_train)
rnd_for_pred = rnd_for.predict(x_test)
rnd_for_accuracy = metrics.accuracy_score(y_test, rnd_for_pred)
print("rnd_for_accuracy",rnd_for_accuracy)

rnd_for_precision=metrics.precision_score(y_test, rnd_for_pred,pos_label=0)
print("rnd_for_precision",rnd_for_precision)

rnd_for_recall=metrics.recall_score(y_test, rnd_for_pred,pos_label=0)
print("rnd_for_recall",rnd_for_recall)

rnd_for_f1_score= metrics.f1_score(y_test, rnd_for_pred,pos_label=0)
print("rnd_for_f1_score",rnd_for_f1_score)

print(confusion_matrix(y_test,rnd_for_pred))
print(classification_report(y_test,rnd_for_pred))

rnd_for_accuracy 0.6178989780398645
rnd_for_precision 0.6361849379581473
rnd_for_recall 0.7407848934242526
rnd_for_f1_score 0.684512007050011
[[62140 21744]
 [35536 30488]]
             precision    recall  f1-score   support

          0       0.64      0.74      0.68     83884
          1       0.58      0.46      0.52     66024

avg / total       0.61      0.62      0.61    149908



# Random Forest with CV

In [26]:
rnd_for_cross_val = cross_val_score(rnd_for, x, y, cv=10, scoring='accuracy')
rnd_for_cv_accuracy = rnd_for_cross_val.mean()
print("rnd_for_cv_accuracy",rnd_for_cv_accuracy)

rnd_for_cross_val_pre = cross_val_score(rnd_for, x, y, cv=10, scoring='precision_macro')
rnd_for_cv_precision = rnd_for_cross_val_pre.mean()
print("rnd_for_cv_precision",rnd_for_cv_precision)

rnd_for_cross_val_re = cross_val_score(rnd_for, x, y, cv=10, scoring='recall_macro')
rnd_for_cv_recall = rnd_for_cross_val_re.mean()
print("rnd_for_cv_recall",rnd_for_cv_recall)

rnd_for_cross_val_f1 = cross_val_score(rnd_for, x, y, cv=10, scoring='f1_macro')
rnd_for_cv_f1_score = rnd_for_cross_val_f1.mean()
print("rnd_for_cv_f1_score",rnd_for_cv_f1_score)

rnd_for_cv_accuracy 0.6180447170303747
rnd_for_cv_precision 0.610446508631751
rnd_for_cv_recall 0.6018465028192616
rnd_for_cv_f1_score 0.601048303719441


# Adaboost without CV

In [31]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics


Ada_model=AdaBoostClassifier(DecisionTreeClassifier(max_depth=1),n_estimators=500,learning_rate=0.1)
Ada_model.fit(x_train,y_train)
ada_pred=Ada_model.predict(x_test)
ada_accuracy = metrics.accuracy_score(y_test, ada_pred)
print(ada_accuracy)

ada_precision=metrics.precision_score(y_test, ada_pred,pos_label=0)
print("ada_precision",ada_precision)

ada_recall=metrics.recall_score(y_test, ada_pred,pos_label=0)
print("ada_recall",ada_recall)

ada_f1_score= metrics.f1_score(y_test, ada_pred,pos_label=0)
print("ada_f1_score",ada_f1_score)

print(confusion_matrix(y_test,ada_pred))
print(classification_report(y_test,ada_pred))

0.638278143928276
ada_precision 0.6327012733666812
ada_recall 0.8428901816794622
ada_f1_score 0.7228256702532777
[[70705 13179]
 [41046 24978]]
             precision    recall  f1-score   support

          0       0.63      0.84      0.72     83884
          1       0.65      0.38      0.48     66024

avg / total       0.64      0.64      0.62    149908



# Adaboost with CV

In [32]:
ada_cross_val = cross_val_score(Ada_model, x, y, cv=10, scoring='accuracy')
ada_cv_accuracy = log_cross_val.mean()
print(ada_cv_accuracy)

ada_cross_val_pre = cross_val_score(Ada_model, x, y, cv=10, scoring='precision_macro')
ada_cv_precision = ada_cross_val_pre.mean()
print("ada_cv_precision",ada_cv_precision)

ada_cross_val_re = cross_val_score(Ada_model, x, y, cv=10, scoring='recall_macro')
ada_cv_recall = ada_cross_val_re.mean()
print("ada_cv_recall",ada_cv_recall)

ada_cross_val_f1 = cross_val_score(Ada_model, x, y, cv=10, scoring='f1_macro')
ada_cv_f1_score = ada_cross_val_f1.mean()
print("ada_cv_f1_score",ada_cv_f1_score)


0.619969910050355
ada_cv_precision 0.6441011377031093
ada_cv_recall 0.6099584968514766
ada_cv_f1_score 0.5997822347540784


# Gradient Boosting Classifier

In [18]:
from sklearn.ensemble import GradientBoostingClassifier

Grad_boost = GradientBoostingClassifier()
Grad_boost.fit(x_train, y_train)
Grad_boost_pred = Grad_boost.predict(x_test)
Grad_boost_accuracy = metrics.accuracy_score(y_test, Grad_boost_pred)
print("Grad_boost_accuracy",Grad_boost_accuracy)

Grad_boost_precision=metrics.precision_score(y_test, Grad_boost_pred,pos_label=0)
print("Grad_boost_precision",Grad_boost_precision)

Grad_boost_recall=metrics.recall_score(y_test, Grad_boost_pred,pos_label=0)
print("Grad_boost_recall",Grad_boost_recall)

Grad_boost_f1_score= metrics.f1_score(y_test, Grad_boost_pred,pos_label=0)
print("Grad_boost_f1_score",Grad_boost_f1_score)

print(confusion_matrix(y_test,Grad_boost_pred))
print(classification_report(y_test,Grad_boost_pred))

Grad_boost_accuracy 0.6409531179123196
Grad_boost_precision 0.6351667326162812
Grad_boost_recall 0.8419722473892518
Grad_boost_f1_score 0.7240926799261842
[[70628 13256]
 [40568 25456]]
             precision    recall  f1-score   support

          0       0.64      0.84      0.72     83884
          1       0.66      0.39      0.49     66024

avg / total       0.65      0.64      0.62    149908



# Gradient Boosting Classifier with Cross Validation

In [19]:
Grad_boost_cross_val = cross_val_score(Grad_boost, x, y, cv=10, scoring='accuracy')
Grad_boost_cv_accuracy = Grad_boost_cross_val.mean()
print(Grad_boost_cv_accuracy)

Grad_boost_cross_val_pre = cross_val_score(Grad_boost, x, y, cv=10, scoring='precision_macro')
Grad_boost_cv_precision = Grad_boost_cross_val_pre.mean()
print(Grad_boost_cv_precision)

Grad_boost_cross_val_re = cross_val_score(Grad_boost, x, y, cv=10, scoring='recall_macro')
Grad_boost_cv_recall = Grad_boost_cross_val_re.mean()
print(Grad_boost_cv_recall)

Grad_boost_cross_val_f1 = cross_val_score(Grad_boost, x, y, cv=10, scoring='f1_macro')
Grad_boost_cv_f1_score = Grad_boost_cross_val_f1.mean()
print(Grad_boost_cv_f1_score)

0.6399762318052351
0.6467705007359406
0.6128142671574783
0.603300823261115


# Multilayer Perceptron

In [21]:
from sklearn.neural_network import MLPClassifier
MLP = MLPClassifier(max_iter=200,activation='logistic')
MLP.fit(x_train, y_train)
MLP_pred = MLP.predict(x_test)
MLP_accuracy = metrics.accuracy_score(y_test, MLP_pred)
print("MLP_accuracy",MLP_accuracy)

MLP_precision=metrics.precision_score(y_test, MLP_pred,pos_label=0)
print("MLP_precision",MLP_precision)

MLP_recall=metrics.recall_score(y_test, MLP_pred,pos_label=0)
print("MLP_recall",MLP_recall)

MLP_f1_score= metrics.f1_score(y_test, MLP_pred,pos_label=0)
print("MLP_f1_score",MLP_f1_score)

print(confusion_matrix(y_test,MLP_pred))
print(classification_report(y_test,MLP_pred))

MLP_accuracy 0.6411265576220082
MLP_precision 0.6351669482083169
MLP_recall 0.8426994420866911
MLP_f1_score 0.7243616018362914
[[70689 13195]
 [40603 25421]]
             precision    recall  f1-score   support

          0       0.64      0.84      0.72     83884
          1       0.66      0.39      0.49     66024

avg / total       0.65      0.64      0.62    149908



# Multilayer Perceptron with Cross Validation

In [22]:
MLP_cross_val = cross_val_score(MLP, x, y, cv=10, scoring='accuracy')
MLP_cv_accuracy = MLP_cross_val.mean()
print("MLP_cv_accuracy",MLP_cv_accuracy)

MLP_cross_val_pre = cross_val_score(MLP, x, y, cv=10, scoring='precision_macro')
MLP_cv_precision = MLP_cross_val_pre.mean()
print("MLP_cv_precision",MLP_cv_precision)

MLP_cross_val_re = cross_val_score(MLP, x, y, cv=10, scoring='recall_macro')
MLP_cv_recall = MLP_cross_val_re.mean()
print("MLP_cv_recall",MLP_cv_recall)

MLP_cross_val_f1 = cross_val_score(MLP, x, y, cv=10, scoring='f1_macro')
MLP_cv_f1_score = MLP_cross_val_f1.mean()
print("MLP_cv_f1_score",MLP_cv_f1_score)

MLP_cv_accuracy 0.639712071908395
MLP_cv_precision 0.6461451129944953
MLP_cv_recall 0.6153165229291576
MLP_cv_f1_score 0.6056863019545992


# Accuracy comparision

In [35]:
no_cv_acc = [dtree_accuracy, rnd_for_accuracy, log_accuracy,ada_accuracy,Grad_boost_accuracy,MLP_accuracy]
with_cv_acc = [dtree_cv_accuracy, rnd_for_cv_accuracy, log_cv_accuracy,ada_cv_accuracy,Grad_boost_cv_accuracy,MLP_cv_accuracy]

no_cv_pre = [dtree_precision,rnd_for_precision, log_precision,ada_precision,Grad_boost_precision,MLP_precision]
with_cv_pre = [dtree_cv_precision, rnd_for_cv_precision, log_cv_precision,ada_cv_precision,Grad_boost_cv_precision,MLP_cv_precision]

no_cv_re = [dtree_recall, rnd_for_recall, log_recall,ada_recall,Grad_boost_recall,MLP_recall]
with_cv_re = [dtree_cv_recall, rnd_for_cv_recall, log_cv_recall,ada_cv_recall,Grad_boost_cv_recall,MLP_cv_recall]

no_cv_f1 = [dtree_f1_score, rnd_for_f1_score, log_f1_score,ada_f1_score,Grad_boost_f1_score,MLP_f1_score]
with_cv_f1 = [dtree_cv_f1_score,rnd_for_cv_f1_score, log_cv_f1_score,ada_cv_f1_score,Grad_boost_cv_f1_score,MLP_cv_f1_score]

In [36]:
accuracy = {'Without CV Acc': no_cv_acc,'With CV acc': with_cv_acc,'Without CV Pre': no_cv_pre,'With CV Pre': with_cv_pre,'Without CV Recall': no_cv_re,'With CV Recall': with_cv_re,'Without CV F1': no_cv_f1,'With CV F1': with_cv_f1}

In [38]:
accuracy_chart = pd.DataFrame(accuracy, index = ['DTREE','RND_FOR', 'LOG_REG','ADABOOST','Gradient Boosting','Multilayer Perceptron'])

In [39]:
print(accuracy_chart)

                       Without CV Acc  With CV acc  Without CV Pre  \
DTREE                        0.616278     0.616808        0.637933   
RND_FOR                      0.617899     0.618045        0.636185   
LOG_REG                      0.619800     0.619970        0.642770   
ADABOOST                     0.638278     0.619970        0.632701   
Gradient Boosting            0.640953     0.639976        0.635167   
Multilayer Perceptron        0.641127     0.639712        0.635167   

                       With CV Pre  Without CV Recall  With CV Recall  \
DTREE                     0.609231           0.726706        0.601930   
RND_FOR                   0.610447           0.740785        0.601847   
LOG_REG                   0.612388           0.721580        0.606126   
ADABOOST                  0.644101           0.842890        0.609958   
Gradient Boosting         0.646771           0.841972        0.612814   
Multilayer Perceptron     0.646145           0.842699        0.615317  