# A Logistic Regression Model to Predict Coronary Heart Disease 

In [0]:
try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass

In [0]:
import tensorflow as tf
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

#  DNN with overfitting

# Loading and Preprocessing
1. I read the data use pandas from the source url

2. I then process the data to assign numerical code to catagorical data in the datafram- 'famhist' column, assign 1 to present and 0 to absent.

3. I split datafram into a training and testing set- 80% training and 20% testing, using the train_test_split function from sklearn.model_selection libary.

4. I then extract the feature list and the labels from the the dataframe, to train the model.

5. Finally, I normalize the values of the data between 0 and 1 because it makes training the neural network faster because it leads to faster convergence of the network, using the StandardScaler module from sklearn.preprocessing libary. 

# Model

1. I used a Sequential DNN model for this problem because it is the standard and most neural networks for this dataset use it.

2. The model consists of 2 dense layers each with 128,64 neurons and an output layer with 1 neurons.

3. I used 'relu' as the activation function for my hidden layers because it is less computationally intense, converges faster because It doesn’t have the vanishing gradient problem and it doesn't activate for negative values so it is sparsely activated.
The output layer using the sigmoid function as its activation function because it is the standard for logistic regression output layer and gives the result.

# Model Compilation and Training 

1. I used 'Adam' as the optimizer for my model because it was providing the best result after experimenting with 'sgd' and other optimizers.

2. I used binary_crossentropy as the loss function because it is great for binary classification networks and provided the best results after experimentation. 

3. To train the network I used 'accuracy' as metric because it is the standard practice for binary classifiers like this.

4. I trained my model using 150 epocs because the training plateaus here.

# Testing And Evaluation 

1. After training the model, I test it by running the model on the validation dataset and measure its accuracy. The model is able to predict between 56% to 62% of the test data accurately.

2. During training, I observered that the model was 100% clearly showing of overfitting of the model.

In [155]:
# loading the dataset
heart_df = pd.read_csv("http://pages.cpsc.ucalgary.ca/~hudsonj/CPSC501F19/heart.csv")

# data cleaning 
heart_df = heart_df.drop(columns=['row.names']) # remove row.number from the df

# coding the catagorical data 
heart_df['famhist'] = pd.Categorical(heart_df['famhist'])
heart_df['famhist'] = heart_df.famhist.cat.codes

# splict the dataset
train, test = train_test_split(heart_df, test_size=0.2)

# split the test and train data into feautre list and labels
label_train = train.pop('chd').values.astype(int) # convert the pd series to np array
label_test =  test.pop('chd').values.astype(int) # convert the pd series to np array
feature_train = train
feature_test = test

# normalize the continous data in the train and test feature list
scaler = StandardScaler().fit(feature_train)
feature_train = scaler.transform(feature_train)
feature_test = scaler.transform(feature_test)

# simple DNN that overfits
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
  ])

# compile the model
model.compile(optimizer='Adam',
                loss='binary_crossentropy',
                metrics=['accuracy'])
# train
model.fit(feature_train,label_train ,epochs=150, verbose=1 )

# evaluate
print("--Evaluate model--")
model_loss, model_acc = model.evaluate(feature_test, label_test, verbose=2)
print(f"Model Loss:    {model_loss:.2f}")
print(f"Model Accuray: {model_acc*100:.1f}%")


Train on 369 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150


#  DNN with minimized overfitting

# Loading and Preprocessing
1. This part is the same as the part for DNN with overfitting, expect here I removed the column 'row.numbers' from the data because it doesn't add any meaning to the dataset.

# Model

1. I used a Sequential DNN model for this problem because it is the standard and most neural networks for this dataset use it.

2. The model consists of 2 dense layers each with 256 neurons and an output layer with 1 neurons.

3. I used 'relu' as the activation function for my hidden layers because it is less computationally intense, converges faster because It doesn’t have the vanishing gradient problem and it doesn't activate for negative values so it is sparsely activated.
The output layer using the sigmoid function as its activation function because it is the standard for logistic regression output layer and gives the result.

3. At each layer I add weight regularization because it helps decrease the complexity of the model and simple is better. I use L2 regularization because it penalize the weights parameters without making them sparse since the penalty goes to zero for small weights.

4. Also, I add Dropout at each layer because it makes the  individual nodes in the network not rely on the output of the others. So each node must output features that are useful on their own. It does this by randomly drop outputs of certain nodes based on the dropout rate.

# Model Compilation and Training 

1. I used 'Adam' as the optimizer for my model because it was providing the best result after experimenting with 'sgd' and other optimizers.

2. I used binary_crossentropy as the loss function because it is great for binary classification networks and provided the best results after experimentation. 

3. To train the network I used 'accuracy' as metric because it is the standard practice for binary classifiers like this.

4. I trained my model using 150 epocs because the training plateaus here.

# Testing And Evaluation 

1. After training the model, I test it by running the model on the validation dataset and measure its accuracy. The model is able to predict between 65% to 77% of the test data accurately.

2. Now during the training, the accurracy is also at 82%, thus showing the prevention of overfiting.

3. Also, the loss of the model is very low compare to the loss of the previous model.

In [162]:
# loading the dataset
heart_df = pd.read_csv("http://pages.cpsc.ucalgary.ca/~hudsonj/CPSC501F19/heart.csv")

# data cleaning 
heart_df = heart_df.drop(columns=['row.names']) # remove row.number from the df

# coding the catagorical data 
heart_df['famhist'] = pd.Categorical(heart_df['famhist'])
heart_df['famhist'] = heart_df.famhist.cat.codes

# splict the dataset
train, test = train_test_split(heart_df, test_size=0.2)

# split the test and train data into feautre list and labels
label_train = train.pop('chd').values.astype(int) # convert the pd series to np array
label_test =  test.pop('chd').values.astype(int) # convert the pd series to np array
feature_train = train
feature_test = test

# normalize the continous data in the train and test feature list
scaler = StandardScaler().fit(feature_train)
feature_train = scaler.transform(feature_train)
feature_test = scaler.transform(feature_test)

# DNN to deal with overfitting
model = tf.keras.Sequential([
    tf.keras.layers.Dense(256, kernel_regularizer=tf.keras.regularizers.l2(0.001),activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(256, kernel_regularizer=tf.keras.regularizers.l2(0.001),activation='relu'),
    tf.keras.layers.Dropout(0.4),
    tf.keras.layers.Dense(1, kernel_regularizer=tf.keras.regularizers.l2(0.001),activation='sigmoid')
  ])

# compile the model
model.compile(optimizer='Adam', loss='binary_crossentropy', metrics=['accuracy'])
# train
model.fit(feature_train,label_train ,epochs=150, verbose=1 )

# evaluate
print("--Evaluate model--")
model_loss, model_acc = model.evaluate(feature_test, label_test, verbose=2)
print(f"Model Loss:    {model_loss:.2f}")
print(f"Model Accuray: {model_acc*100:.1f}%")





Train on 369 samples
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
