#### Problem Statement
The dataset is provided by Cleveland Clinic Foundation for Heart Disease.
https://archive.ics.uci.edu/ml/datasets/heart+Disease

It's a csv with 303 rows each containing a patient information. We use these features to predict if a patient has a heart disease or not (binary classification)

##### Import Modules

In [20]:
import tensorflow as tf
import numpy as np
import pandas as pd
from tensorflow import keras
from tensorflow.keras import layers

from tensorflow.keras.layers import IntegerLookup
from tensorflow.keras.layers import Normalization
from tensorflow.keras.layers import StringLookup

In [2]:
print(tf.__version__)

2.8.0


##### Preparing the data

In [3]:
file_url = "http://storage.googleapis.com/download.tensorflow.org/data/heart.csv"
dataframe = pd.read_csv(file_url)

In [4]:
dataframe.shape

(303, 14)

In [5]:
dataframe.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


In [6]:
# Spliting the data into train and validation set

In [7]:
val_dataframe = dataframe.sample(frac=0.2, random_state=1337)
train_dataframe = dataframe.drop(val_dataframe.index)

In [8]:
print(
    "Using %d samples for training and %d for validation"
    % (len(train_dataframe), len(val_dataframe))
)

Using 242 samples for training and 61 for validation


In [9]:
# Let's generate tf.data.Dataset objects for each dataframe

In [13]:
def dataframe_to_dataset(dataframe):
    dataframe = dataframe.copy()
    labels = dataframe.pop("target")
    ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
    ds = ds.shuffle(buffer_size=len(dataframe))
    return ds

In [14]:
train_ds = dataframe_to_dataset(train_dataframe)
val_ds = dataframe_to_dataset(val_dataframe)

In [15]:
# Note: Each dataset yields a tuple (input, target) where input is a dictionary of features
# and target is the value 0 or 1

In [16]:
for x, y in train_ds.take(1):
    print("Input: ", x)
    print("Target: ", y)

Input:  {'age': <tf.Tensor: shape=(), dtype=int64, numpy=55>, 'sex': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'cp': <tf.Tensor: shape=(), dtype=int64, numpy=2>, 'trestbps': <tf.Tensor: shape=(), dtype=int64, numpy=130>, 'chol': <tf.Tensor: shape=(), dtype=int64, numpy=262>, 'fbs': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'restecg': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'thalach': <tf.Tensor: shape=(), dtype=int64, numpy=155>, 'exang': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'oldpeak': <tf.Tensor: shape=(), dtype=float64, numpy=0.0>, 'slope': <tf.Tensor: shape=(), dtype=int64, numpy=1>, 'ca': <tf.Tensor: shape=(), dtype=int64, numpy=0>, 'thal': <tf.Tensor: shape=(), dtype=string, numpy=b'normal'>}
Target:  tf.Tensor(0, shape=(), dtype=int64)


In [17]:
# Let's batch the dataset -

In [18]:
train_ds = train_ds.batch(32)
val_ds = val_ds.batch(32)

In [19]:
train_ds

<BatchDataset element_spec=({'age': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'sex': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'cp': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'trestbps': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'chol': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'fbs': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'restecg': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'thalach': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'exang': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'oldpeak': TensorSpec(shape=(None,), dtype=tf.float64, name=None), 'slope': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'ca': TensorSpec(shape=(None,), dtype=tf.int64, name=None), 'thal': TensorSpec(shape=(None,), dtype=tf.string, name=None)}, TensorSpec(shape=(None,), dtype=tf.int64, name=None))>

##### Feature pre-processing with Keras layers
The following features are categoriacal features encoded as integers -
* sex
* cp
* fbs
* restecg
* exang
* ca

We will encode these features uisng one-hot encoding. We have two options here -
1. Use CategoryEncoding(), which requires knowing the range of input values and will error on input outside the range
2. Use IntegerLookup() which will build a lookup table for inputs and reserve an output index for unknown input values

Here we want a solution which will handle out of range inputs at inference, so we will use IntegerLookup()

We also have a categorical feature encoded as a string: "thal". We will create an index of all possible features and encode output using the StringLookup() layer.

Finally, the following feature are continuous numerical features -
* age
* trestbps
* chol
* thalach
* oldpeak
* slope

For each of these features, we will use a Normalization() layer to make sure the mean of each feature is 0 and its standard deviation is 1.

In [21]:
# To apply featurewise normalization to numerical features

def encode_numerical_features(feature, name, dataset):
    # create a Normalization layer for our feature
    normalizer = Normalization()
    
    # Prepare a Dataset that only yields our feature
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
    
    # Learn the statistics of the data -
    normalizer.adapt(feature_ds)
    
    # Normalize the input feature-
    encoded_feature = normalizer(feature)
    return encoded_feature

In [22]:
def encode_categorical_feature(feature, name, dataset, is_string):
    lookup_class = StringLookup if is_string else IntegerLookup
    
    # Create a lookup layer which will turn strings into integer indices
    lookup = lookup_class(output_mode = "binary")
    
    # Prepare a Dataset that only yields our feature -
    feature_ds = dataset.map(lambda x, y: x[name])
    feature_ds = feature_ds.map(lambda x: tf.expand_dims(x, -1))
    
    # Learn the set of possible string values and assign them a fixed integer index -
    lookup.adapt(feature_ds)
    
    # Turn the string input into integer indices
    encoded_feature = lookup(feature)
    
    return encoded_feature

##### Build a Model
Let's build our end-to-end model

In [24]:
# Categorical features encoded as integers

In [None]:
sex = keras.Input(shape = (1, ), name = "sex", dtype = "int64")
cp = keras.Input(shape = (1, ), name = "cp", dtype = "int64")
fbs = keras.Input(shape = (1, ), name = "fbs", dtype = "int64")
restecg = keras.Input(shape = (1, ), name = "restecg", dtype = "int64")
exang = keras.Input(shape = (1, ), name = "exang", dtype = "int64")
ca = keras.Input(shape = (1, ), name = "ca", dtype = "int64")