# CS 5324 Lab 5: Wide and Deep Networks

For this assignment, we will be exploring the [Heart Failure Prediction](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction?resource=download) dataset. It is a collection of datasets combined into one large dataset. This dataset is composed of observations regarding patients' health traits related to the likelihood of heart failure.

This dataset was sourced from [Kaggle](https://www.kaggle.com/datasets) and consists of 918 observations.

## Team

The team consists of three members:
1. Melodie Zhu
2. Samina Faheem
3. Giancarlos Dominguez

## Dataset Preparation

Let us import our dataset.

In [1]:
# import libraies
import os
import pandas as pd
import numpy as np
import copy

In [2]:
# get dataset from csv file
data_directory = os.getcwd() + '\\data\\heart.csv'
df = pd.read_csv(data_directory)

In [3]:
df.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0


In [4]:
print("Shape of the dataset", df.shape)

print(f"\nNumber of observations in the dataset: {df.shape[0]}")
print(f"Number of features in the dataset: {df.shape[1]}")

Shape of the dataset (918, 12)

Number of observations in the dataset: 918
Number of features in the dataset: 12


Next, we will check for any duplicate observations.

In [5]:
# check for duplicates
print(f"\nNumber of duplicate rows: {df.duplicated().sum()}")


Number of duplicate rows: 0


Luckily, we don't have to worry about duplicate rows. Now, let us check for rows with missing values.

In [6]:
# check for missing values
print(f"\nNumber of missing values: {df.isnull().sum().sum()}")


Number of missing values: 0


In [10]:
df_imputed.replace(to_replace=' ?',value=np.nan, inplace=True)
df_imputed.dropna(inplace=True)
df_imputed.reset_index()

Unnamed: 0,index,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease
0,0,40,M,ATA,140,289,0,Normal,172,N,0.0,Up,0
1,1,49,F,NAP,160,180,0,Normal,156,N,1.0,Flat,1
2,2,37,M,ATA,130,283,0,ST,98,N,0.0,Up,0
3,3,48,F,ASY,138,214,0,Normal,108,Y,1.5,Flat,1
4,4,54,M,NAP,150,195,0,Normal,122,N,0.0,Up,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
913,913,45,M,TA,110,264,0,Normal,132,N,1.2,Flat,1
914,914,68,M,ASY,144,193,1,Normal,141,N,3.4,Flat,1
915,915,57,M,ASY,130,131,0,Normal,115,Y,1.2,Flat,1
916,916,57,F,ATA,130,236,0,LVH,174,N,0.0,Flat,1


None of our observations have missing values. Therefore, we don't have to worry about holes in our data.

In [7]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB
None


### Defining Class Variables

The dataset holds 12 `features`:

1. `Age`: How old the patient is (yrs)

2. `Sex`: The biological gender of the patient
    - M: Male
    - F: Female<br><br>

3. `ChestPainType`: The specific chest pain condition
    - TA: Typical Angina
    - ATA: Atypical Angina
    - NAP: Non-Anginal Pain
    - ASY: Asymptomatic<br><br>

4. `RestingBP`: Resting blood pressure (mmHg)

5. `Cholesterol`: Serum cholesterol (mm/dl)

6. `FastingBS`: Fasting blood sugar
    - 1: if FastingBS > 120 (mg/dl)
    - 0: otherwise<br><br>

7. `RestingECG`: Resting electrocardiogram results 
    - Normal: Normal
    - ST: Having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - LVH: Showing probable or definite left ventricular hypertrophy by Estes' criteria<br><br>

8. `MaxHR`: The maximum heart rate achieved

9. `ExerciseAngina`: Exercise-induced angina 
    - Y: Yes 
    - N: No<br><br>

10. `Oldpeak`: ST Numeric value measured in depression

11. `ST_Slope`: Slope of the peak exercise ST segment
    - Up: Upsloping
    - Flat: Flat
    - Down: Downsloping <br><br>

12. `HeartDisease`: Output class 
    - 1: Likely to have heart failure
    - 0: Not likely to have heart failure

We will divide our features between categorical and numerical:<br><br>
**Catagorical** Features: 
- Sex
- cp
- FastingBS
- RestingECG
- ExerciseAngina
- ST_Slope
- HeartDisease

**Numeric** Features: 
- Age
- RestingBP 
- Cholesterol
- MaxHR
- Oldpeak

Next, we will one-hot encode our categorical features into integers to help our model make better predictions.

In [8]:
df_imputed = copy.deepcopy(df)

In [11]:
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler

# define objects that can encode each variable as integer    
encoders = dict()
categorical_headers = ['Sex','ChestPainType','RestingECG','ExerciseAngina','ST_Slope']

# train all encoders (special case the target 'income')
for col in categorical_headers:
    df_imputed[col] = df_imputed[col].str.strip()

    # integer encode strings that are features
    encoders[col] = LabelEncoder() # save the encoder
    df_imputed[col+'_int'] = encoders[col].fit_transform(df_imputed[col])

# scale numeric, continuous variables
numeric_headers = ["Age", "RestingBP", "Cholesterol","MaxHR"]

for col in numeric_headers:
    df_imputed[col] = df_imputed[col].astype(float)
    
    ss = StandardScaler()
    df_imputed[col] = ss.fit_transform(df_imputed[col].values.reshape(-1, 1))
    
include_header =["FastingBS","Oldpeak"]
df_imputed.head()

Unnamed: 0,Age,Sex,ChestPainType,RestingBP,Cholesterol,FastingBS,RestingECG,MaxHR,ExerciseAngina,Oldpeak,ST_Slope,HeartDisease,Sex_int,ChestPainType_int,RestingECG_int,ExerciseAngina_int,ST_Slope_int
0,-1.43314,M,ATA,0.410909,0.82507,0,Normal,1.382928,N,0.0,Up,0,1,1,1,0,2
1,-0.478484,F,NAP,1.491752,-0.171961,0,Normal,0.754157,N,1.0,Flat,1,0,2,1,0,1
2,-1.751359,M,ATA,-0.129513,0.770188,0,ST,-1.525138,N,0.0,Up,0,1,1,2,0,2
3,-0.584556,F,ASY,0.302825,0.13904,0,Normal,-1.132156,Y,1.5,Flat,1,0,0,1,1,1
4,0.051881,M,NAP,0.951331,-0.034755,0,Normal,-0.581981,N,0.0,Up,0,1,2,1,0,2


In [12]:
# let's start without any feature preprocessing
categorical_headers_ints = [x+'_int' for x in categorical_headers]

# we will forego one-hot encoding right now and instead just use all inputs as-is 
# this is just to get an example running in Keras (its not a good idea)
feature_columns = categorical_headers_ints+numeric_headers+include_header

print(feature_columns)

['Sex_int', 'ChestPainType_int', 'RestingECG_int', 'ExerciseAngina_int', 'ST_Slope_int', 'Age', 'RestingBP', 'Cholesterol', 'MaxHR', 'FastingBS', 'Oldpeak']


In [13]:
# import libraries
import tensorflow as tf 
from tensorflow import keras
from tensorflow.keras.layers import Dense, Activation, Input
from tensorflow.keras.models import Model
from sklearn.model_selection import StratifiedKFold, StratifiedShuffleSplit, ShuffleSplit, cross_val_score

In [None]:
# we want to predict the X and y data as follows:
X = df_imputed[feature_columns].to_numpy()
y = df_imputed['HeartDisease'].values # get the labels we want

num_cv_iterations = 10
num_instances = len(y)
cv_object = StratifiedKFold(n_splits=num_cv_iterations)
                         
print(cv_object)

In [None]:
# run logistic regression and vary some parameters
from sklearn import metrics as mt

# first we create a reusable logisitic regression object
#   here we can setup the object with different learning parameters and constants
#lr_clf = HessianBinaryLogisticRegression(eta=0.1,iterations=10) # get object

# now we can use the cv_object that we setup before to iterate through the 
#    different training and testing sets. Each time we will reuse the logisitic regression 
#    object, but it gets trained on different data each time we use it.

iter_num=0
# the indices are the rows used for training and testing in each iteration
for train_indices, test_indices in cv_object.split(X,y): 
    # I will create new variables here so that it is more obvious what 
    # the code is doing (you can compact this syntax and avoid duplicating memory,
    # but it makes this code less readable)
    X_train = X[train_indices]
    y_train = y[train_indices]
    
    X_test = X[test_indices]
    y_test = y[test_indices]


In [None]:
# get some of the specifics of the dataset
n_samples, n_features = X.shape
n_classes = len(np.unique(y))

print("n_samples: {}".format(n_samples))
print("n_features: {}".format(n_features))
print("n_classes: {}".format(n_classes))

In [None]:
from matplotlib import pyplot as plt

get_ipython().run_line_magic('matplotlib', 'inline')
plt.style.use('ggplot')

print('Number of instances in each class:'+str(np.bincount(y)))
plt.hist(y)
plt.show()

In [None]:
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.linear_model import LogisticRegression

# select model
clf = LogisticRegression()
clf.fit(X,y)

#select cross validation
cv = StratifiedShuffleSplit(n_splits=10) # made kfold stratified

# select evaluation criteria
my_scorer = make_scorer(recall_score)

# run model training and cross validation
per_fold_eval_criteria = cross_val_score(estimator=clf,
                                    X=X,
                                    y=y,
                                    cv=cv,
                                    scoring=my_scorer
                                   )

plt.bar(range(len(per_fold_eval_criteria)),per_fold_eval_criteria)
plt.ylim([min(per_fold_eval_criteria)-0.01,max(per_fold_eval_criteria)])
#print(per_fold_eval_criteria.mean()*100)


# We have 508 values as 'HeartDisease' i.e. 1. From, above bar chart, we can see that few of the bars are 86% i.e. 1 denoting HeartDisease. So, some of folds have perfect recall. lowest here is 0.80, its 80% recall.
# 
# We do not have severe class imbalance in our data set. We need to stratify accross all the folds that we use and make sure classes are stratified in each fold. In order to properly seperate training and testing sets a stratified KFold should be used. Stratified KFold will ensure that each fold is representative of the overall data set. This cross-validation object is a variation of KFold that returns stratified folds. The folds are made by preserving the percentage of samples for each class.
# 
# - Reference : [StratifiedShuffleSplit](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html)


In [None]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

print(f"We will use the following {len(feature_columns)} features:")
pp.pprint(feature_columns)


# Identifying the groups of features in the data that should be combined into cross-product features:
# 
# For this dataset, the categorical features are `Sex , ChestPainType , RestingECG, ExerciseAngina , ST_Slope`
# 
# The cross-product operation on the categorical features can be interpreted as logical conjunctions. 
# The significance of the cross-product features is creating the combined features which are more useful 
# to the prediction/classification tasks compared with the independent features .


## Modeling

In [None]:
from tensorflow.keras.layers import Dense, Activation, Input
from tensorflow.keras.models import Model
from tensorflow.keras.utils import plot_model
from tensorflow.keras.metrics import Recall

In [None]:
os.environ['KMP_DUPLICATE_LIB_OK']='True'

print(tf.__version__)
print(keras.__version__)

In [None]:
# First, lets setup the input size
num_features = X_train.shape[1]
input_tensor = Input(shape=(num_features,))

# a layer instance is callable on a tensor, and returns a tensor
x = Dense(units=10, activation='relu')(input_tensor)
x = Dense(units=5, activation='tanh')(x)
predictions = Dense(1, activation='sigmoid')(x)

# This creates a model that includes
# the Input layer and three Dense layers
model = Model(inputs=input_tensor, outputs=predictions)


In [None]:
model.compile(optimizer='sgd',
              loss='mean_squared_error',
              metrics=[Recall()])

model.summary()
model.fit(X_train, y_train, epochs=12, verbose=1)

In [None]:
import graphviz
from tensorflow.keras.utils import plot_model

In [None]:
plot_model( model, to_file='model.png', show_shapes=True, show_layer_names=True, rankdir='LR')

In [None]:
yhat_proba = model.predict(X_test)
yhat = np.round(yhat_proba)
print(mt.confusion_matrix(y_test,yhat))
print(mt.classification_report(y_test,yhat))

## Graduate Analysis