# Prognosis of cardiovascular events: from logistic regression to deep learning
                            Laura I. Queipo and Efrain Nava
                              Applied Computing Institute
                            University of Zulia, Venezuela
                       lqueipom@gmail.com, enava@ica.luz.edu.ve

                                   Progress report
                                 February 14th, 2018

## Introduction

Relevance. According to a report of the American Heart Association Statistics (2016), heart disease is the leading cause of death for both men and women and responsible for 1 in every 4 deaths, even modest improvements in prognostic models of heart event and complications could save literally hundreds of lives and help to significantly reduce the cost of health care services, medications, and lost productivity.


## Methods 

Deep neural networks (DNN) represents a set of modern machine learning (ML) models that have gain widespread recognition because they were behind the first FDA (US food and drug administration) approved machine learning application in healthcare; to be approved it had to pass tests to show it can produce results at least as accurately as humans are currently able to. Recently, such ML models were also used to detect with cardiologist-level accuracy 14 types of arrhythmias (sometime life-threatening heart beats) form ECG-electrocardiogram signals generated by wearable monitors. 

## Original contribution

Studies exploring the potential of this technology for the prognosis of cardiovascular events/complications from risk factors have been limited; events/complications are, for example, coronary artery disease, stroke and congestive heart failure, and risk factors are those established by the American College of Cardiology/American Heart Association (ACC/AHA) such as age, high blood pressure, high LDL cholesterol, and smoking and others, such as, systolic blood pressure variability, kidney disease, and ethnicity. 

Most of previous studies have either used logistic regression or classical machine learning algorithms such as random forest, gradient boosting and neural networks (non-deep); in addition, comparison studies of the cited algorithms with deep learning models in the specific prognosis context under consideration are not readily available. 

## Research objectives

Establish the relative performance of deep learning models, such as deep belief networks and convolutional neural networks, and ensembles with respect to classical machine learning algorithms (including logistic regression) using cases studies built from well-known heart disease data sets such as the Cleveland set available from the UCI repository. Research questions of interest are, for example, for what would be the threshold of sample size in heart disease studies where the more complex but potentially more effective deep learning models would be recommended?, would ensembles of machine learning models be able to provide more robust predictions as it has been the case in other knowledge domains?, does the ACC/AHA list of eight risk factors should be updated with other genetic or lifestyle factors?. The deep learning models will be implemented in Tensorflow (originally from Google, now open source) and healthcare.ai, an open source that facilitate the development of machine learning in healthcare, with the prevision that can handle so called big data by using the Hadoop/Spark platform.   

In [None]:
import pandas as pd

In [None]:
nombres = ['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','class']
dataset=pd.read_csv("../input/Heart_Disease_Data.csv", na_values="?")
#dataset = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/cleve.mod', skiprows=20, header=None, sep='\s+', names=nombres, index_col=False, na_values="?")
dataset["pred_attribute"].replace(inplace=True, value=[1, 1, 1, 1], to_replace=[1, 2, 3, 4])
dataset

In [None]:
%matplotlib inline

In [None]:
import numpy as np
import pandas as pd
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression 
from sklearn.model_selection import train_test_split 
from sklearn import metrics

### 1. Preliminary description of the data. 

Box plots and histograms were used for continuous and categorical variables.
Basic statistics are also available.

#### Continuous variables 
Box plots + basic statistics

In [None]:
#Boxplots
continuas=["age", "trestbps", "chol", "thalach", "oldpeak", "ca"]
dataset[continuas].boxplot(return_type='axes', figsize=(12,8))
plt.show()

In [None]:
continuas=["age", "trestbps", "chol", "thalach", "oldpeak", "ca"]
dataset[continuas].describe()

### Categorical variables 
##### Histograms + basic statistics

In [None]:
#Sex: sex (1 = male; 0 = female) 
tempo5 = dataset['sex']
tempo5.value_counts().plot(kind="bar")

In [None]:
#Fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
tempo6 = dataset['fbs']
tempo6.value_counts().plot(kind="bar")

In [None]:
#Slope: the slope of the peak exercise ST segment  
#Value 1: upsloping 
#Value 2: flat 
#Value 3: downsloping
tempo7 = dataset['slop']
tempo7.value_counts().plot(kind="bar")

In [None]:
#Cp: chest pain type
#Value 1: typical angina 
#Value 2: atypical angina 
#Value 3: non-anginal pain 
#Value 4: asymptomatic 
tempo8 = dataset['cp']
tempo8.value_counts().plot(kind="bar")

In [None]:
#Exang: exercise induced angina (1 = yes; 0 = no) 
tempo9 = dataset['exang']
tempo9.value_counts().plot(kind="bar")

In [None]:
#Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect 
tempo10 = dataset['thal']
tempo10.value_counts().plot(kind="bar")

In [None]:
#Restecg: resting electrocardiographic results 
#Value 0: normal 
#Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 
#Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria 
tempo11 = dataset['restecg']
tempo11.value_counts().plot(kind="bar")

In [None]:
#Class: diagnosis of heart disease (angiographic disease status) 
#Value 0: < 50% diameter narrowing (Healthy)
#Value 1: > 50% diameter narrowing (Sick)
tempo12 = dataset['pred_attribute']
tempo12.value_counts().plot(kind="bar")

### 2. Define training and test samples. 

The Cleveland data set available from the UCI repository has 303 samples; the training and test data sets were randomly selected with 30% of the original data set corresponding to the test data set.  The relative proportions of the classes of interest (disease/no disease) in both sets were checked to be similar.

In [None]:
dataset.dropna(inplace=True, axis=0, how="any")
X=dataset.loc[:, "age":"thal" ]
Y=dataset["pred_attribute"]

In [None]:
# evaluate the model by splitting into train and test sets

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=0)


In [None]:
freqs = pd.DataFrame({ "Training dataset": Y_train.value_counts().tolist(), "Test dataset":Y_test.value_counts().tolist(), "Total": Y.value_counts().tolist()}, index=["Healthy", "Sick"])
freqs[["Training dataset", "Test dataset", "Total"]]

In [None]:
# instantiate a logistic regression model, and fit with X and y (with training data in X,y)
model = LogisticRegression()
model.fit(X_train, Y_train)

# check the accuracy on the training set
model.score(X_train, Y_train)




In [None]:
# check the accuracy on the test set
model.score(X_test, Y_test)



In [None]:
# predict class labels for the training set
predicted1 = model.predict(X_train)

# predict class labels for the test set
predicted2 = model.predict(X_test)

* Confusion matrices for training and test data sets (0=Healthy, 1=Sick)

In [None]:
pd.crosstab(Y_train, predicted1, rownames=['Predicted'], colnames=['Reality'], margins=True)

In [None]:
pd.crosstab(Y_test, predicted2, rownames=['Predicted'], colnames=['Reality'], margins=True)