<br><br>
# <p style="text-align: center;">Case 1: Heart disease classification</p>

### <p style="text-align: center;">Juha Nuutinen</p>

### <p style="text-align: center;">20.01.2019</p>

### <p style="text-align: center;">Helsinki Metropolia University of Applied Sciences</p>
<br><br>

# 1. Objectives
This notebook documents the process of using neural networks to try and predict some kind of heart disease for an individual from a set of their biological attributes.

The goals of this assignment are to learn to use Python for machine learning with neural networks (from the `keras` library), read data from external sources using the `pandas` library, visualize data with `matplotlib`, and document the results clearly.
Learning to use the neural network includes testing of different model architectures (number of layers, number of units, activation functions), and solver optimizers and training settings (epochs, batch sizes, validation splits).

# 2. Required Libraries

In [52]:
# Required libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from keras.models import Sequential
from keras.layers import Dense, Activation

# 3. Data

### Origin

The data is provided by <a href="https://archive.ics.uci.edu/ml/index.php">UC Irvine Machine Learning Repository</a>, and the data folder can be found <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/">here</a>. For this assignment, we are going to use the preprocessed data from Cleveland Clinic Foundation (`processed.cleveland.data`). The principal investigator responsible for the collection of the data is Robert Detrano, M.D., Ph.D. in Cleveland Clinic Foundation. The data is from the year 1988.

### Description

All information and numbers in this section are taken from the <a href="https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/heart-disease.names">`heart-disease.names`</a> file. The data is in CSV (Comma Separated Values) format and it contains a total of 303 instances. Each instance has 14 attributes. The non-processed data contained originally a total of 76 attributes. Missing values are encoded with a question mark (?).

A list of the attributes in each instance:
1. age
    * Age in years, numeric
2. sex
    * Nominal
    * 1 = male
    * 0 = female
3. cp
    * Chest pain type, nominal
    * 1 = typical angina
    * 2 = atypical angina
    * 3 = non-anginal pain
    * 4 = asymptomatic
4. trestbps
    * Resting blood pressure in mm Hg, numeric
5. chol
    * Serum cholestoral in mg/dl, numeric
6. fbs
    * Fasting blood sugar > 120 mg/dl, nominal
    * 1 = true
    * 0 = false
7. restecg
    * Resting electrocardiaographic results, nominal
    * 0 = normal
    * 1 = having ST-T wave abnormality (T  wave inversions and/or ST elevation or depression of > 0.05mV
    * 2 = showing probable or definite left ventricular hypertrophy by  Estes' criteria
8. thalach
    * Maximum heart rate achieved, numeric
9. exang
    * Exercise induced angina, nominal
    * 1 = yes
    * 0 = no
10. oldpeak
    * St depression induced by exercise relative to rest, numeric
11. slope
    * The slope of the peak exercise relative to rest, nominal
    * 1 = upslopping
    * 2 = flat
    * 3 = downslopping
12. ca
    * Number of major vessels (0-3) colored by flourosopy, numeric
13. thal
    * Nominal
    * 3 = normal
    * 6 = fixed defect
    * 7 = reversable defect
14. num (the predicted value)
    * Diagnosis of heart disease (angiographic disease status), nominal
    * 0 = absence of disease
    * 1, 2, 3, 4 = presence of disease

In this data set, there are a total of 164 disease-free instances (num = 0), and 139 instances with disease (num = 1, 2, 3 or 4), totaling to 303 instances.

### Preprocessing

#### Read in the data
The data is read into a pandas DataFrame in the cell below. Also the column names are set.

In [53]:
url = r'http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'
dataframe = pd.read_csv(url, 
                        sep = ',', 
                        header = None, 
                        index_col = None,
                        na_values = '?')

# The CSV data does not contain a header row, so the column names
# must be set manually.
names = ['age', 'sex', 'cp','trestbps', 'chol', 'fbs','restecg',
         'thalac','exang','oldpeak','slope','ca','thal','num']

dataframe.columns = names
print(dataframe.tail())

      age  sex   cp  trestbps   chol  fbs  restecg  thalac  exang  oldpeak  \
298  45.0  1.0  1.0     110.0  264.0  0.0      0.0   132.0    0.0      1.2   
299  68.0  1.0  4.0     144.0  193.0  1.0      0.0   141.0    0.0      3.4   
300  57.0  1.0  4.0     130.0  131.0  0.0      0.0   115.0    1.0      1.2   
301  57.0  0.0  2.0     130.0  236.0  0.0      2.0   174.0    0.0      0.0   
302  38.0  1.0  3.0     138.0  175.0  0.0      0.0   173.0    0.0      0.0   

     slope   ca  thal  num  
298    2.0  0.0   7.0    1  
299    2.0  2.0   7.0    2  
300    2.0  1.0   7.0    3  
301    2.0  1.0   3.0    1  
302    1.0  NaN   3.0    0  


Note the missing (NaN) value in the excerpt above.
#### Set column (attribute) types
The attributes 2 (sex), 3 (cp), 6 (fbs), 7 (restecg), 9 (exang), 11 (slope) 13 (thal) and 14 (num) are nominal, all other are numeric. Column types are converted appropriately in the cell below. We'll only need to change the type of the nominal columns to categorical, as all columns are by fedault interpreted as numeric.

In [54]:
dataframe = dataframe.astype({"sex": 'category',
                              "cp": 'category',
                              "fbs": 'category',
                              "restecg": 'category',
                              "exang": 'category',
                              "slope": 'category',
                              "thal": 'category',
                              "num": 'category'})
print(dataframe.dtypes)

age          float64
sex         category
cp          category
trestbps     float64
chol         float64
fbs         category
restecg     category
thalac       float64
exang       category
oldpeak      float64
slope       category
ca           float64
thal        category
num         category
dtype: object


#### Fill missing values
Next the missing values are replaces with mean values for the corresponding columns.

In [55]:
# Check how many missing values there are.
# Missing values are '?' in the original data, but in the read_csv
# line in an earlier cell, they were converted to NaNs, which are the default representation
# for missing values in Pandas.
print("Total number of missing values: " + str(dataframe.isnull().sum().sum()) + "\n")

dataframe = dataframe.fillna(dataframe.median())
print(dataframe.tail())

print("\nTotal number of missing values after filling them (should be 0): "
      + str(dataframe.isnull().sum().sum()))

Total number of missing values: 6

      age  sex   cp  trestbps   chol  fbs restecg  thalac exang  oldpeak  \
298  45.0  1.0  1.0     110.0  264.0  0.0     0.0   132.0   0.0      1.2   
299  68.0  1.0  4.0     144.0  193.0  1.0     0.0   141.0   0.0      3.4   
300  57.0  1.0  4.0     130.0  131.0  0.0     0.0   115.0   1.0      1.2   
301  57.0  0.0  2.0     130.0  236.0  0.0     2.0   174.0   0.0      0.0   
302  38.0  1.0  3.0     138.0  175.0  0.0     0.0   173.0   0.0      0.0   

    slope   ca thal num  
298   2.0  0.0  7.0   1  
299   2.0  2.0  7.0   2  
300   2.0  1.0  7.0   3  
301   2.0  1.0  3.0   1  
302   1.0  0.0  3.0   0  

Total number of missing values after filling them (should be 0): 0


If the above excerpt is compared with the one shown earlier, we'll see that the "ca" value that was NaN earlier is now replaced with a value.

# 4. Modeling and compilation

In [92]:
model = Sequential([
    Dense(32, input_dim=13),
    Activation('relu'),
    Dense(1),
    Activation('softmax')
])


model.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['accuracy'])

# 5. Training and validation
The data must be separated to two separate DataFrames for training. One DataFrame must include all other attributes than the label attribute ("num" in this case), and the other DataFrame must include only the correct labels.

In [93]:
training_data = dataframe.loc[:, 'age':'thal']
training_labels = dataframe.loc[:, 'num']

model.fit(training_data, training_labels, epochs = 5, batch_size = 32)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fdd25b745f8>

# 6. Evaluation

In [50]:
score = model.evaluate(x_test, y_test, batch_size=128)

NameError: name 'x_test' is not defined

# 7. Results

# 8. Conclusion