# Machine Learning for predicting the survival of a person infected with the COVID-19 virus

In acest proiect am incercat sa implementez un model care sa poata prezice dacă o persoană infectata cu virusul COVID-19 va muri sau nu.

In [19]:
import sklearn
import graphviz
import pandas as pd
import numpy as np

np.random.seed(63)

In [20]:
DATA_FILE = 'Patient-Medical-Data-for-Novel-Coronavirus-COVID-19.csv'
dataFile = pd.read_csv(DATA_FILE, sep=',')
dataFile.describe()

Unnamed: 0,Age,Sex,City,AdministrativeDivision,Country,GeoPosition,DateOfOnsetSymptoms,DateOfAdmissionHospital,DateOfConfirmation,Symptoms,...,TravelHistoryLocation,ReportedMarketExposure,ReportedMarketExposureComment,ChronicDiseaseQ,ChronicDiseases,SequenceAvailable,DischargedQ,DeathQ,DateOfDeath,DateOfDischarge
count,213684,213684,213684,213684,213684,213684,213684,213684,213684,213684,...,213684,213684,213684,213684,213684,213684,213684,213684,213684,213684
unique,216,3,1005,421,135,2390,134,96,124,322,...,475,3,8,2,66,2,2,2,40,66
top,"Interval[{35, 59}]","Entity[""Gender"", ""Female""]","Missing[""NotAvailable""]","Entity[""AdministrativeDivision"", {""Bavaria"", ""...","Entity[""Country"", ""Germany""]","GeoPosition[{48.13641000000007, 11.57754000000...","Missing[""NotAvailable""]","Missing[""NotAvailable""]","DateObject[{2020, 4, 2}, ""Day"", ""Gregorian"", -5.]","Missing[""NotAvailable""]",...,"Missing[""NotAvailable""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]",False,"Missing[""NotAvailable""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]"
freq,66683,111711,82354,42594,160356,6981,48913,211620,8384,211984,...,208168,213608,213669,213028,213539,213679,213567,213532,213536,213340


Privind aceste date, pot vedea că am o multime de date care lipsesc. De asemenea, pot folosi cateva coloane pentru a deduce valoarea pentru DeathQ, campul pe care trebuie sa il prezic. DischargedQ inseamna ca pacientul a fost vindecat si externat din spital, daca acesta este cazul inseamna ca pacientul nu a murit, astfel incat sa putem pune valoarea fals in DeathQ.

Cu toate ca de obicei mai multe date inseamna precizie mai buna, anumite date din tabel sunt inutile referitor la rata de supravietuire a unui pacient.

In [21]:
dataFile.loc[dataFile.DateOfDeath != 'Missing["NotAvailable"]', 'DeathQ'] = 'True'
dataFile.loc[dataFile.DischargedQ == 'True', 'DeathQ'] = 'False'
dataFile.loc[dataFile.DateOfDischarge != 'Missing["NotAvailable"]', 'DeathQ'] = 'False'

dataFile = dataFile.drop(dataFile[dataFile.DeathQ == 'Missing["NotAvailable"]'].index)
dataFile = dataFile.drop(columns=['ReportedMarketExposureComment', 'DischargedQ', 'SequenceAvailable', 'GeoPosition', 'LivesInWuhan',
                      'ReportedMarketExposure', 'LivesInWuhanComment', 'DateOfOnsetSymptoms'])

In [22]:
dataFile.describe()

Unnamed: 0,Age,Sex,City,AdministrativeDivision,Country,DateOfAdmissionHospital,DateOfConfirmation,Symptoms,TravelHistoryDates,TravelHistoryLocation,ChronicDiseaseQ,ChronicDiseases,DeathQ,DateOfDeath,DateOfDischarge
count,530,530,530,530,530,530,530,530,530,530,530,530,530,530,530
unique,92,3,71,70,30,63,72,96,37,37,2,55,2,40,66
top,"Missing[""NotAvailable""]","Entity[""Gender"", ""Male""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]","Entity[""Country"", ""Philippines""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]",False,"Missing[""NotAvailable""]",False,"Missing[""NotAvailable""]","Missing[""NotAvailable""]"
freq,39,308,241,266,140,280,43,329,414,359,385,409,378,382,186


In [23]:
dataFile.describe()

Unnamed: 0,Age,Sex,City,AdministrativeDivision,Country,DateOfAdmissionHospital,DateOfConfirmation,Symptoms,TravelHistoryDates,TravelHistoryLocation,ChronicDiseaseQ,ChronicDiseases,DeathQ,DateOfDeath,DateOfDischarge
count,530,530,530,530,530,530,530,530,530,530,530,530,530,530,530
unique,92,3,71,70,30,63,72,96,37,37,2,55,2,40,66
top,"Missing[""NotAvailable""]","Entity[""Gender"", ""Male""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]","Entity[""Country"", ""Philippines""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]","Missing[""NotAvailable""]",False,"Missing[""NotAvailable""]",False,"Missing[""NotAvailable""]","Missing[""NotAvailable""]"
freq,39,308,241,266,140,280,43,329,414,359,385,409,378,382,186


In [24]:
from sklearn.impute import SimpleImputer

i = SimpleImputer(missing_values='Missing["NotAvailable"]', strategy='most_frequent')
i.fit(dataFile)
trans = i.transform(dataFile)
dataFile = pd.DataFrame(trans, columns=dataFile.columns)

Urmeaza sa ne uitam la datele noastre, referitor la simptome si la celelalte boli care pot avea un efect negativ asupra supravietuirii pacientului.

In [25]:
dataFile['ChronicDiseaseQ'].value_counts()

False    385
True     145
Name: ChronicDiseaseQ, dtype: int64

In [26]:
dataFile['ChronicDiseases'].value_counts()

{"hypertension"}                                                                                                   432
{"hypertension", "diabetes"}                                                                                        13
{"diabetes", "hypertension"}                                                                                        10
{"diabetes"}                                                                                                         7
{"chronic obstructive pulmonary disease"}                                                                            4
{"asthma"}                                                                                                           3
{"hypertension", "chronic kidney disease"}                                                                           3
{"chronic kidney disease"}                                                                                           3
{"hepatitis B", "diabetes"}                     

In [27]:
dataFile['Symptoms'].value_counts()

{"acute respiratory distress syndrome", "pneumonia"}    346
{"fever"}                                                16
{"cough", "fever"}                                       14
{"acute respiratory failure", "pneumonia"}               11
{"pneumonia"}                                             8
                                                       ... 
{"cough", "sore throat", "headache"}                      1
{"dizziness", "fever"}                                    1
{"cough", "fatigue", "fever", "headache"}                 1
{"shortness of breath"}                                   1
{"cough", "fever", "chills"}                              1
Name: Symptoms, Length: 95, dtype: int64

In [28]:
X, y = dataFile.loc[:, dataFile.columns != 'DeathQ'], dataFile['DeathQ']

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
X = ohe.fit_transform(X)

In [29]:
dataFile.describe()

Unnamed: 0,Age,Sex,City,AdministrativeDivision,Country,DateOfAdmissionHospital,DateOfConfirmation,Symptoms,TravelHistoryDates,TravelHistoryLocation,ChronicDiseaseQ,ChronicDiseases,DeathQ,DateOfDeath,DateOfDischarge
count,530,530,530,530,530,530,530,530,530,530,530,530,530,530,530
unique,91,2,70,69,29,62,71,95,36,36,2,54,2,39,65
top,56,"Entity[""Gender"", ""Male""]","Entity[""City"", {""Wuhan"", ""Hubei"", ""China""}]","Entity[""AdministrativeDivision"", {""Hubei"", ""Ch...","Entity[""Country"", ""Philippines""]","DateObject[{2020, 1, 23}, ""Day"", ""Gregorian"", ...","DateObject[{2020, 2, 3}, ""Day"", ""Gregorian"", -5.]","{""acute respiratory distress syndrome"", ""pneum...","DateObject[{2020, 1, 17}, ""Day"", ""Gregorian"", ...","{Entity[""City"", {""Wuhan"", ""Hubei"", ""China""}]}",False,"{""hypertension""}",False,"DateObject[{2020, 1, 23}, ""Day"", ""Gregorian"", ...","DateObject[{2020, 2, 18}, ""Day"", ""Gregorian"", ..."
freq,53,344,282,308,142,293,65,346,424,457,385,432,378,394,208


In [30]:
dataFile['DeathQ'].value_counts()

False    378
True     152
Name: DeathQ, dtype: int64

In [31]:
from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.1, random_state=63)
for train_index, test_index in sss.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

In [32]:
train = dataFile.iloc[train_index]
test = dataFile.iloc[test_index]

train.to_csv("train.csv")
test.to_csv("test.csv")

DeathQ -urile sunt disproporționate, așa că există șanse mari ca atunci când impart valorile în setul de invatare si setul de test,acestea să fie și mai disproporționate ca la inceput. Pentru a elimina aceasta posibilitate, am ales StratifiedShuffleSplit din sklearn, ceea ce asigură că aceeasi proporție de True și False poate fi gasita în ambele seturi.

# Arbore de decizie

Este evident ca unele variabile conteaza mai mult ca altele, ar fii aiurea sa spui ca conditiile preliminare ale unui pacient au acelasi impact asupra rata lui de supravietuire ca tara din care fac parte, de exemplu.

In [33]:
from sklearn import tree

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train, y_train)

In [34]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, X_test, y_test)
print("Accuracy: %f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.903636 (+/- 0.13)


Obtinem o acuratete de 0.90, ceea ce pare destul de bine, avand in vedere ca numarul de date real pe care il procesam, dupa ce le eliminam pe cele inutile.Urmeaza sa incercam sa vedem arborele de decizie si sa vedem ierarhia de importanta in luarea unei decizii.

In [35]:
import graphviz 
columns = list(dataFile.columns)
columns.remove('DeathQ')

feature_names = ohe.get_feature_names(columns)
dot_data = tree.export_graphviz(clf, out_file=None,
                               feature_names=feature_names,
                               class_names=np.unique(y),
                               filled=True, rounded=True,
                               special_characters=True) 
graph = graphviz.Source(dot_data)


# MLP

Urmeaza sa folosim un Multi Layer Perceptor pentru a procesa aceleasi date  si a trage anumite concluzii referitor la care metoda este mai buna. In continuare o sa ne referim la Multi Layer Perceptor prin abrevierea lui MLP.

Un MLP este unul din clasele "feedforward" ale unui ANN(artificial neural network), in care conexiunile dintre noduri nu formeaza nu ciclu ceea ce o sa vedem ca este convenabil pentru acest tip de analiza al datelor noastre. Pentru această metoda am folosit implementarea implicită de la sklearn cu (1000) ca hidden layer size.

In [36]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(max_iter=1000)

In [37]:
mlp.fit(X_train, y_train)

MLPClassifier(max_iter=1000)

In [38]:
scores = cross_val_score(mlp, X_test, y_test)
print("Accuracy: %f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.940000 (+/- 0.16)


# Observatii finale

Dupa cum putem observa, MLP ofera rezultate mai bune cu o deviatie standard mai mare pentru aceleasi date. Acuratetea este cu 4 procente mai mare, lucru care este foarte semnificativ avand in vedere volumul de date cu care lucram, desi nu putem sa fim niciodata prea siguri daca acelasi lucru s-ar observa si pe un volum mare de date, dar pentru cu cat lucram, MLP > DecisionTree. 