In order to import the data and allow the notebook to access files, we mount Google Drive and grant permissions.

In [2]:
from google.colab import drive
drive.mount('drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at drive


Let's also import any libraries that we will be needing here.

In [14]:
import csv
import numpy as np
import pandas as pd

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

Next, we read our CSV file from the appropriate directory and store it as a 2D array of strings. We will store the data in a pandas dataFrame which is easily compatible with tensorflow. The dataFrame behaves as a 2D array which means lookups are efficient.

The CSV should follow a certain format: the first two rows are headers and labels and the remaining rows contain data.

In [46]:
#find the path to our CSV file within google drive and pass it as an argument to the open call.
#the path will begin with /content/drive/My Drive/ in most cases.
CHFdata = pd.read_csv('/content/drive/My Drive/Heart Failure AI/CHF Data.csv', header=1, dtype=str)
CHFdata.dtypes

monty, year, case #                                            object
Admission Date                                                 object
Age                                                            object
Gender                                                         object
BMI                                                            object
Zip Code                                                       object
Echocardiogram LVEF (%)                                        object
Troponin (highest)                                             object
Hemoglobin A1C                                                 object
Creat (Chem 7 within 24 hours of admission)                    object
GFR                                                            object
BNP (Initial, B-type naturetic peptide)                        object
Urine Tox negative (0) or per history                          object
Urine Tox Pos Stimulant (1)                                    object
Urine Tox Pos Benzo 

In [None]:
CHFdata

Unnamed: 0,"monty, year, case #",Admission Date,Age,Gender,BMI,Zip Code,Echocardiogram LVEF (%),Troponin (highest),Hemoglobin A1C,Creat (Chem 7 within 24 hours of admission),GFR,"BNP (Initial, B-type naturetic peptide)",Urine Tox negative (0) or per history,Urine Tox Pos Stimulant (1),Urine Tox Pos Benzo (2),Urine Tox Positive Opiate (3),Urine Tox Positive THC (4),Smoking Currently (Yes/ No),Former Smoker,Smoking (Pack Year History),Marijuana (THC),Alcohol (low/high),30 day readmission,60 day readmission,30 day death,60 day death,90 day death,DM,Hypertension,Coronary Artery Disease,Prior Stroke / TIA / Cerebral Vascular Ischemia,Atrial Fibrillation,Peripheral vascular disease,Obstructive Sleep Apnea,"Aortic Stenosis 0 = No, 1= yes / mild, 2=moderate, 3=severe",Dialysis
0,1,3/31/2018,78,M,26.22,95828,65,0.11,6.5,1.19,58,454,0,0,0,0,0,0,1,20,0,0,No,No,No,No,No,Yes,Yes,No,No,Yes,No,No,,
1,2,4/3/2018,66,M,28.18,95833,15,0.03,5.5,1.78,45,2103,0,0,0,0,0,0,1,38,0,1,Yes,Yes,No,No,No,No,Yes,Yes,No,No,No,No,,
2,3,4/6/2018,86,M,19.77,95691,31,0.11,5.5,1.32,52,136,0,0,0,0,0,0,1,31,0,0,No,No,No,No,No,Yes,Yes,Yes,No,No,No,No,,
3,4,4/6/2018,79,M,23.4,95829,35,0.06,5.4,2.37,29,1362,0,0,0,0,0,0,1,0,0,1,Yes,No,Yes,Yes,Yes,No,Yes,No,No,Yes,No,No,,
4,5,4/7/2018,58,M,31.13,94565,49,0.44,5.5,2.17,31,180,1,0,1,1,0,0,1,34,1,0,Yes,Yes,No,No,No,No,Yes,Yes,No,Yes,No,No,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,101815,10/20/18,78,M,23.1,95831,33,0.03,5.5,0.87,83,1745,0,0,0,0,0,0,1,50,0,0,No,No,No,No,No,No,Yes,No,No,Yes,No,No,0,No
116,101816,10/14/18,81,F,16.59,95824,65,0.02,6.6,2.29,19,1372,0,0,0,0,0,0,0,0,0,0,No,Yes,No,No,No,Yes,Yes,Yes,No,No,No,No,0,No
117,101817,10/2/18,68,F,38.74,95822,42,0.05,11.4,1.13,58,278,0,0,0,0,0,0,0,0,0,0,Yes,No,No,No,No,Yes,Yes,Yes,No,No,Yes,Yes,0,No
118,101818,10/10/18,78,M,29.95,95822,19,0.05,6.6,1.03,69,1798,0,0,0,0,0,0,1,20,0,1,No,Yes,No,No,No,Yes,No,Yes,Yes,Yes,Yes,No,0,No


We see that our data is stored in the type "object" which encompasses the python string type, str. However, we would want to store some values as integers, booleans, or floats so that they represent actual numbers rather than characters so we do some casting.

In [47]:
#convert yes and no to True and False respectively as a bool
yesNoList = ['30 day readmission', '60 day readmission', '30 day death', '60 day death', '90 day death', 'DM', 'Hypertension',
             'Coronary Artery Disease', 'Prior Stroke / TIA / Cerebral Vascular Ischemia', 'Atrial Fibrillation', 
             'Peripheral vascular disease', 'Obstructive Sleep Apnea', 'Dialysis']
#convert 1 and 0 to True and False respectively as a bool
intBoolList = ['Urine Tox negative (0) or per history', 'Urine Tox Pos Stimulant (1)', 'Urine Tox Pos Benzo (2)',
               'Urine Tox Positive Opiate (3)', 'Urine Tox Positive THC (4)', 'Smoking Currently (Yes/ No)', 'Former Smoker',
               'Marijuana (THC)', 'Alcohol (low/high)']
#convert strings to appropirate types
convert_dict = {'Age': int, 'BMI': float, 'GFR': int} 
for i in yesNoList:
    CHFdata[i] = CHFdata[i].map({'Yes':True, 'No':False})
for j in intBoolList:
    CHFdata[j] = CHFdata[j].map({'1':True, '0':False})
CHFdata = CHFdata.astype(convert_dict) 

CHFdata

Unnamed: 0,"monty, year, case #",Admission Date,Age,Gender,BMI,Zip Code,Echocardiogram LVEF (%),Troponin (highest),Hemoglobin A1C,Creat (Chem 7 within 24 hours of admission),GFR,"BNP (Initial, B-type naturetic peptide)",Urine Tox negative (0) or per history,Urine Tox Pos Stimulant (1),Urine Tox Pos Benzo (2),Urine Tox Positive Opiate (3),Urine Tox Positive THC (4),Smoking Currently (Yes/ No),Former Smoker,Smoking (Pack Year History),Marijuana (THC),Alcohol (low/high),30 day readmission,60 day readmission,30 day death,60 day death,90 day death,DM,Hypertension,Coronary Artery Disease,Prior Stroke / TIA / Cerebral Vascular Ischemia,Atrial Fibrillation,Peripheral vascular disease,Obstructive Sleep Apnea,"Aortic Stenosis 0 = No, 1= yes / mild, 2=moderate, 3=severe",Dialysis
0,1,3/31/2018,78,M,26.22,95828,65,0.11,6.5,1.19,58,454,False,False,False,False,False,False,True,20,False,False,False,False,False,False,False,True,True,False,False,True,False,False,,
1,2,4/3/2018,66,M,28.18,95833,15,0.03,5.5,1.78,45,2103,False,False,False,False,False,False,True,38,False,True,True,True,False,False,False,False,True,True,False,False,False,False,,
2,3,4/6/2018,86,M,19.77,95691,31,0.11,5.5,1.32,52,136,False,False,False,False,False,False,True,31,False,False,False,False,False,False,False,True,True,True,False,False,False,False,,
3,4,4/6/2018,79,M,23.40,95829,35,0.06,5.4,2.37,29,1362,False,False,False,False,False,False,True,0,False,True,True,False,True,True,True,False,True,False,False,True,False,False,,
4,5,4/7/2018,58,M,31.13,94565,49,0.44,5.5,2.17,31,180,True,False,True,True,False,False,True,34,True,False,True,True,False,False,False,False,True,True,False,True,False,False,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
115,101815,10/20/18,78,M,23.10,95831,33,0.03,5.5,0.87,83,1745,False,False,False,False,False,False,True,50,False,False,False,False,False,False,False,False,True,False,False,True,False,False,0,False
116,101816,10/14/18,81,F,16.59,95824,65,0.02,6.6,2.29,19,1372,False,False,False,False,False,False,False,0,False,False,False,True,False,False,False,True,True,True,False,False,False,False,0,False
117,101817,10/2/18,68,F,38.74,95822,42,0.05,11.4,1.13,58,278,False,False,False,False,False,False,False,0,False,False,True,False,False,False,False,True,True,True,False,False,True,True,0,False
118,101818,10/10/18,78,M,29.95,95822,19,0.05,6.6,1.03,69,1798,False,False,False,False,False,False,True,20,False,True,False,True,False,False,False,True,False,True,True,True,True,False,0,False


This application will use XGBoost as the machine learning algorithm since we are working on a relatively small dataset. Additionally we can take advantage of the features of XGBoost such as the ability to handle missing data which is common in our dataset and the ability to reasonably damped overfitting.

First, lets load and prepare our data to use with XGBoost. We can specify what factors we want to include in an array. Lets try using just 3 factors for input, and the 30 day readmittance for our output.

In [48]:
import numpy as np

inputColumns = ['Age', 'BMI', 'GFR']
outputColumn = '30 day readmission'

inputData = CHFdata[inputColumns]
outputData = CHFdata[outputColumn]

print(inputData)
print(outputData)

     Age    BMI  GFR
0     78  26.22   58
1     66  28.18   45
2     86  19.77   52
3     79  23.40   29
4     58  31.13   31
..   ...    ...  ...
115   78  23.10   83
116   81  16.59   19
117   68  38.74   58
118   78  29.95   69
119   70  26.80   38

[120 rows x 3 columns]
0      False
1       True
2      False
3       True
4       True
       ...  
115    False
116    False
117     True
118    False
119    False
Name: 30 day readmission, Length: 120, dtype: bool


Lets set some parameters for XGBoost and split our data into a training set and a testing set. We use a seed for a random number generator which ensures that the split is consistent for each round of execution.

In [49]:
inputTrain, inputTest, outputTrain, outputTest = train_test_split(inputData, outputData, test_size = .3, random_state = 2)

Now we can begin to train the model.

In [51]:
model = XGBClassifier()
model.fit(inputTrain, outputTrain)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, verbosity=1)


In [53]:
from sklearn.metrics import accuracy_score
outputPred = model.predict(inputTest)
predictions = [round(value) for value in outputPred]
# evaluate predictions
accuracy = accuracy_score(outputTest, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

Accuracy: 66.67%
