# Case 1
#### Otto Åström and Vili Niemi 
#### 3.2.2019
#### Helsinki Metropolia University of Applied Sciences

The purpose of this document is to study making a neural network that predicts the likelyhood of heart diseases based on data taken from the Cleveland Clinic Foundation.  

First we'll import the necessary tools.

In [12]:
#Necessary imports
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense

Next we'll import the processed data from Cleveland using the read_csv function of pandas. The data, however, is still unusable. We add column names, and change the different types of heart disease into just 1 type, in order to predict if a patient is sick or not instead of predicting what illness they might have. It also has six unknown values that are marked with the value '?'. These need to be removed, and the best way to do that is to simply remove the rows they occupy with the drop function.

In [13]:
#Importing the data
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
cl_data = pd.read_csv(url)

#Adding column names
cl_data.columns=['age','sex','cp','trestbps','chol','fbs','restecg','thalach','exang','oldpeak','slope','ca','thal','num']

#Replacing different types of illness with just one type of ill.
cl_data.num.replace([2, 3, 4], [1, 1, 1], inplace=True)

#Cleaning up the data of filthy '?' marks
cl_data.drop(301, inplace=True)
cl_data.drop(286, inplace=True)
cl_data.drop(265, inplace=True)
cl_data.drop(191, inplace=True)
cl_data.drop(165, inplace=True)
cl_data.drop(86, inplace=True)

After we clean up the data it's time to split it into a training set and a testing set. First we'll split the data into y half, that only contains the value of wether or not someone is sick "num", and the X half that contains all the other values but the "num". Then we split those two yet again, and create a test and train set for both of them using the train_test_split function. In this function we can control the size of the split using the parameter test_size. Here we have set it to split the data 33%-66%.

In [14]:
#Creating test and train splits
X = cl_data.drop('num', axis=1) 
y = cl_data['num']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Then finally we'll actually start building the neural network. We start by creating a Sequential model, and add to it three Dense type layers.

First the input layer where we have chosen to use 12 hidden units, and we are using relu as our activation argument, which means the layer will output an array the size of ('*', 12) and we also inform the model how many values we'll be feeding it using the input_shape parameter.

Then we'll add one hidden layer, that like the input layer, uses relu, and outputs an array the size of ('*', 8).

Finally the output layer differs from the first two, as it's set to output a single probability of how likely a sample is to have a num value of "1" (how likely a patient is to be sick) with the activation argument of sigmoid.

In [15]:
#Initialize the constructor
model = Sequential()

#Add an input layer 
model.add(Dense(12, activation='relu', input_shape=(13,)))

#Add one hidden layer 
model.add(Dense(8, activation='relu'))

#Add an output layer 
model.add(Dense(1, activation='sigmoid'))

Next we'll actually use the model we just created. First we'll have to train it with the training data we created prior. 

We'll start with compiling the model. We configure the compile function with the adam optimizer and add the binary_crossentropy parameter because we are trying to achieve a binary outcome of wether or not the value num is 0 or 1. We also wish to monitor the accuracy of the model during training so we added the accuracy value to the metrics parameter.

To fit the model we feed the fit function the training data, and the number of epochs (iterations over the data) and the batch size of 1 sample. The verbose parameter is just there to make the loading look cool. 


In [16]:
#Training the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10, batch_size=1, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0xb16a0f0588>

Now that we have trained the model we are ready to make predictions with it and evaluate its accuracy. 

To make a prediction all we need to do is to put the X_test data through the predict function of our model and store that in y_pred. But just doing this doesn't really tell us anything about anything. We'll need to evaluate our model. We can choose to do this in multiple ways, we have chosen to go with the option of just printing the accuracy straight up through the  score value we get from the evaluate function. We could also use the commented out method of printing a confusion matrix.

In [17]:
#Making the prediction
y_pred = model.predict(X_test)

#Confusion matrix
#print(confusion_matrix(y_test, y_pred))

#Model evaluation
score = model.evaluate(X_test, y_test,verbose=1)
print("\n Score: \n Loss/Accuracy \n", score)


 Score: 
 Loss/Accuracy 
 [0.5788441616661695, 0.7142857142857143]


And there we have it. A neural network that makes predictions on heart diseases and evaluates its own accuracy. The accuracy on these settings ranges from 53-82% which is below acceptable limits, however, with more time to play around with them, we believe we could get it to predict with over 70% accuracy on the regular. 