<br><br>
# Case 1: Heart disease classification

### Juha Nuutinen

### 20.01.2019

### Helsinki Metropolia University of Applied Sciences
<br><br>

# 1. Objectives
This notebook documents the process of using neural networks to try and predict some kind of heart disease for an individual from a set of their biological attributes.

The goals of this assignment are to learn to use Python for machine learning with neural networks (from the `keras` library), read data from external sources using the `pandas` library, visualize data with `matplotlib`, and document the results clearly.
Learning to use the neural network includes testing of different model architectures (number of layers, number of units, activation functions), and solver optimizers and training settings (epochs, batch sizes, validation splits).

# 2. Required Libraries

In [61]:
# Required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from keras.models import Sequential
from keras.layers import Dense, Activation

from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle

# 3. Data

### Origin

The data is provided by <a href="https://archive.ics.uci.edu/ml/index.php">UC Irvine Machine Learning Repository</a>, and the data folder can be found <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/">here</a>. For this assignment, we are going to use the preprocessed data from Cleveland Clinic Foundation (`processed.cleveland.data`). The principal investigator responsible for the collection of the data is Robert Detrano, M.D., Ph.D. in Cleveland Clinic Foundation. The data is from the year 1988.

### Description

All information and numbers in this section are taken from the <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names">heart-disease.names</a> file. The data is in CSV (Comma Separated Values) format and it contains a total of 303 instances. Each instance has 14 attributes. The non-processed data contained originally a total of 76 attributes. Missing values are encoded with a question mark (?).

A list of the attributes in each instance:
1. age
    * Age in years, numeric
2. sex
    * Nominal
    * 1 = male
    * 0 = female
3. cp
    * Chest pain type, nominal
    * 1 = typical angina
    * 2 = atypical angina
    * 3 = non-anginal pain
    * 4 = asymptomatic
4. trestbps
    * Resting blood pressure in mm Hg, numeric
5. chol
    * Serum cholestoral in mg/dl, numeric
6. fbs
    * Fasting blood sugar > 120 mg/dl, nominal
    * 1 = true
    * 0 = false
7. restecg
    * Resting electrocardiaographic results, nominal
    * 0 = normal
    * 1 = having ST-T wave abnormality (T  wave inversions and/or ST elevation or depression of > 0.05mV
    * 2 = showing probable or definite left ventricular hypertrophy by  Estes' criteria
8. thalach
    * Maximum heart rate achieved, numeric
9. exang
    * Exercise induced angina, nominal
    * 1 = yes
    * 0 = no
10. oldpeak
    * St depression induced by exercise relative to rest, numeric
11. slope
    * The slope of the peak exercise relative to rest, nominal
    * 1 = upslopping
    * 2 = flat
    * 3 = downslopping
12. ca
    * Number of major vessels (0-3) colored by flourosopy, numeric
13. thal
    * Nominal
    * 3 = normal
    * 6 = fixed defect
    * 7 = reversable defect
14. num (the predicted value)
    * Diagnosis of heart disease (angiographic disease status), nominal
    * 0 = absence of disease
    * 1, 2, 3, 4 = presence of disease

In this data set, there are a total of 164 disease-free instances (num = 0), and 139 instances with disease (num = 1, 2, 3 or 4), totaling to 303 instances.

### Preprocessing

#### Read in the data
The data is read into a pandas DataFrame in the cell below. Also the column names are set.

In [62]:
url = r"http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
dataframe = pd.read_csv(url, 
                        sep = ',', 
                        header = None, 
                        index_col = None,
                        na_values = '?')

# The CSV data does not contain a header row, so the column names
# must be set manually.
names = ["age", "sex", "cp","trestbps", "chol", "fbs","restecg",
         "thalac","exang","oldpeak","slope","ca","thal","num"]

dataframe.columns = names

#### Set column (attribute) types
The attributes 2 (sex), 3 (cp), 6 (fbs), 7 (restecg), 9 (exang), 11 (slope) 13 (thal) and 14 (num) are nominal, all other are numeric. Column types are converted appropriately in the cell below. We'll only need to change the type of the nominal columns to categorical, as all columns are by fedault interpreted as numeric.

In [63]:
dataframe = dataframe.astype({"sex": "category",
                              "cp": "category",
                              "fbs": "category",
                              "restecg": "category",
                              "exang": "category",
                              "slope": "category",
                              "thal": "category",
                              "num": "category"})

#### Fill missing values
Next the missing values are replaces with mean values for the corresponding columns.

In [64]:
# Check how many missing values there are.
# Missing values are '?' in the original data, but in the read_csv
# line in an earlier cell, they were converted to NaNs, which are the default representation
# for missing values in Pandas.

dataframe = dataframe.fillna(dataframe.median())

#### Normalize the numeric values
The next step is to normalize all the numeric values, to get them all to be in somewhat of the same scale.

In [65]:
scaler = StandardScaler()
dataframe[["age", "trestbps", "chol", "thalac", "oldpeak", "ca"]] \
    = scaler.fit_transform(dataframe[["age", "trestbps", "chol", "thalac", "oldpeak", "ca"]])

#### Modify the labels to binary
Because we are only trying to predict the presence of a heart disease, and not the type, the labels need to be converted to a binary form. If the label is 0 (= no disease) leave it as-is. Otherwise set it to 1.

In [66]:
dataframe["num"] = dataframe["num"].mask(dataframe["num"] != 0, 1)

#### Shuffle the data
Data is randomly shuffled, to get rid of any possible structure in it.

In [67]:
dataframe = shuffle(dataframe)

#### Divide the data to a training set and a validation set

In [85]:
df_train, df_validate = np.split(dataframe.sample(frac=1),
                                 [int(0.7*len(dataframe))])
print("Size of training set: {0}".format(len(df_train)))
print("Size of validation set: {0}".format(len(df_validate)))

Size of training set: 212
Size of validation set: 91


# 4. Modeling and compilation

In [86]:
model = Sequential([
    Dense(9, input_dim=13),
    Activation("relu"),
    Dense(1),
    Activation("sigmoid")
])


model.compile(optimizer="rmsprop",
              loss="binary_crossentropy",
              metrics=["accuracy"])

# 5. Training and validation

In [87]:
training_data = df_train.loc[:, 'age':'thal']
training_labels = df_train.loc[:, 'num']
model.fit(training_data, training_labels, epochs = 20, batch_size = 12)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x7fd992a71d68>

# 6. Evaluation

In [91]:
validation_data = df_validate.loc[:, 'age':'thal']
validation_labels = df_validate.loc[:, 'num']
score = model.evaluate(validation_data, validation_labels, batch_size=12)
print("Validation set accuracy: {0}".format(score[1]))

Validation set accuracy: 0.7142857214906714


# 7. Results

# 8. Conclusion