# TensorFlow Tutorials
# Load and Preprocess Data 03 - `pandas`

Loading `pandas` dataframes into a `tensorflow.data.Data` objects. Uses a small dataset provided by the Cleveland Clinic Foundation. The dataset is a CSV: each row describes a patient and each column describes an attribute. 

Going to use the loaded data to perform binary classification (heart disease prediction).

## Preparing Workspace

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

import pandas as pd
import tensorflow as tf

# Need to enable eager execution for reading from iterator later on
tf.enable_eager_execution()

In [0]:
# Instead of defining filepath and then using `get_file`, doing it with overloaded call
csv_file = tf.keras.utils.get_file('heart.csv', 
                         'https://storage.googleapis.com/applied-dl/heart.csv')

## Data Exploration

In [0]:
# Read the dataset into a `pandas` dataframe using the `read_csv` function
df = pd.read_csv(csv_file)

In [4]:
# Examine the first five rows of the dataset
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0


In [5]:
# Display the data types of each column (feature) in the dataset
df.dtypes

age           int64
sex           int64
cp            int64
trestbps      int64
chol          int64
fbs           int64
restecg       int64
thalach       int64
exang         int64
oldpeak     float64
slope         int64
ca            int64
thal         object
target        int64
dtype: object

In [0]:
# One-hot encoding of a a categorical variable to numeric one
df['thal'] = pd.Categorical(df['thal']) # in place 

In [0]:
# Change the categorical data to numerical codes
df['thal'] = df.thal.cat.codes

In [8]:
# Now check the dataset - each possible category of `thal` has been encoded with a digit
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,2,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,3,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,4,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,3,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,3,0


## Using `tf.data.Dataset`

We can also use a dataset to read into a `tf.data.Dataset` to create a better, more efficient data pipeline. 

In [0]:
# Extract the column with labels/target - what we're trying to predict
target = df.pop('target')

In [0]:
# Again, use `from_tensor_slices` to read in data and labels as tuples of NumPy tensors
# Calling .values on a dataframe returns NumPy tensors
dataset = tf.data.Dataset.from_tensor_slices((df.values, target.values))

In [11]:
for feat, targ in dataset.take(5):
  print ('Features: {}, Target: {}'.format(feat, targ))

Features: [ 63.    1.    1.  145.  233.    1.    2.  150.    0.    2.3   3.    0.
   2. ], Target: 0
Features: [ 67.    1.    4.  160.  286.    0.    2.  108.    1.    1.5   2.    3.
   3. ], Target: 1
Features: [ 67.    1.    4.  120.  229.    0.    2.  129.    1.    2.6   2.    2.
   4. ], Target: 0
Features: [ 37.    1.    3.  130.  250.    0.    0.  187.    0.    3.5   3.    0.
   3. ], Target: 0
Features: [ 41.    0.    2.  130.  204.    0.    2.  172.    0.    1.4   1.    0.
   3. ], Target: 0


Why did enabling eager execution bypass the infinite loop of the iterator? 

In [12]:
# pd.Series (a column) can be used wherever a tf.Tensor or np.array can
# Because all three implement the __array__ protocol
tf.constant(df['thal'])

<tf.Tensor: id=31, shape=(303,), dtype=int32, numpy=
array([2, 3, 4, 3, 3, 3, 3, 3, 4, 4, 2, 3, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3,
       3, 4, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 4, 2, 4, 3, 4, 3, 4, 4,
       2, 3, 3, 4, 3, 3, 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 4,
       4, 2, 3, 3, 4, 3, 4, 3, 3, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 4, 4, 4,
       3, 3, 4, 3, 4, 4, 3, 4, 3, 3, 3, 4, 3, 4, 4, 3, 3, 4, 4, 4, 4, 4,
       3, 3, 3, 3, 4, 3, 4, 3, 4, 4, 3, 3, 2, 4, 4, 2, 3, 3, 4, 4, 3, 4,
       3, 3, 4, 2, 4, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4,
       4, 3, 3, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3,
       3, 3, 3, 3, 3, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 4, 3, 2,
       4, 4, 3, 3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 2, 2, 4, 3, 4, 2, 4, 3,
       3, 4, 3, 3, 3, 3, 4, 3, 4, 3, 4, 2, 2, 4, 3, 4, 3, 2, 4, 3, 3, 2,
       4, 4, 4, 4, 3, 0, 3, 3, 3, 3, 1, 4, 3, 3, 3, 4, 3, 4, 3, 3, 3, 4,
       3, 3, 4, 4, 4, 4, 3, 3, 4, 3, 4, 3, 4, 4, 3, 4, 4, 3, 4, 4, 3, 3

In [0]:
# Shuffle the dataset and divide it into batches
train_dataset = dataset.shuffle(len(df)).batch(1)

## Create and Train a Model

In [0]:
def get_compiled_model():
  # Instantiate
  model = tf.keras.Sequential([
      tf.keras.layers.Dense(10, activation='relu'),
      tf.keras.layers.Dense(10, activation='relu'),
      tf.keras.layers.Dense(1, activation='sigmoid')
  ])
  
  # Compile
  model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
  
  # Return
  return model

In [16]:
# Reference to instantiated model
model = get_compiled_model()

# Train
model.fit(train_dataset, epochs=15)

Epoch 1/15
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7f2231ac05f8>

## Alternative to Feature Columns
Instead of passing individual feature columns from a `pandas` dataframe into the neural network, we can also use a 

In [17]:
# df.keys() is a list of column names
df.keys()

# 

Index(['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach',
       'exang', 'oldpeak', 'slope', 'ca', 'thal'],
      dtype='object')

In [0]:
# Create a dictionary of inputs where the key is name of the column and the value is a
# keras `Input` instance of arbitrary dimensions (unspecified - accepts all dimensions)
inputs = {key: tf.keras.layers.Input(shape=(), name=key) for key in df.keys()}

In [21]:
# Check
inputs

{'age': <tf.Tensor 'age_1:0' shape=(?,) dtype=float32>,
 'ca': <tf.Tensor 'ca_1:0' shape=(?,) dtype=float32>,
 'chol': <tf.Tensor 'chol_1:0' shape=(?,) dtype=float32>,
 'cp': <tf.Tensor 'cp_1:0' shape=(?,) dtype=float32>,
 'exang': <tf.Tensor 'exang_1:0' shape=(?,) dtype=float32>,
 'fbs': <tf.Tensor 'fbs_1:0' shape=(?,) dtype=float32>,
 'oldpeak': <tf.Tensor 'oldpeak_1:0' shape=(?,) dtype=float32>,
 'restecg': <tf.Tensor 'restecg_1:0' shape=(?,) dtype=float32>,
 'sex': <tf.Tensor 'sex_1:0' shape=(?,) dtype=float32>,
 'slope': <tf.Tensor 'slope_1:0' shape=(?,) dtype=float32>,
 'thal': <tf.Tensor 'thal_1:0' shape=(?,) dtype=float32>,
 'thalach': <tf.Tensor 'thalach_1:0' shape=(?,) dtype=float32>,
 'trestbps': <tf.Tensor 'trestbps_1:0' shape=(?,) dtype=float32>}

In [0]:
# Now use the `stack` function to extract values for these columns as a tensor
x = tf.stack(list(inputs.values()), axis=-1)

In [0]:
# In the functional API, we make layers and pass individual tensors to them
x = tf.keras.layers.Dense(10, activation='relu')(x)

In [0]:
#  Output is the result of passing the activations of the previous layer to a new layer
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)

In [0]:
# We don't create a model. We use the I/Os to learn a model cost function
model_func = tf.keras.Model(inputs=inputs, outputs=outputs)

In [0]:
# Optimise 
model_func.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [0]:
# Preserve the column structure of the DataFrame: convert df to dict, slice that 
dict_slices = tf.data.Dataset.from_tensor_slices((df.to_dict('list'), target.values)).batch(16)

In [33]:
for dict_slice in dict_slices.take(1):
  print(dict_slice)

({'age': <tf.Tensor: id=51247, shape=(16,), dtype=int32, numpy=
array([63, 67, 67, 37, 41, 56, 62, 57, 63, 53, 57, 56, 56, 44, 52, 57],
      dtype=int32)>, 'sex': <tf.Tensor: id=51255, shape=(16,), dtype=int32, numpy=array([1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1], dtype=int32)>, 'cp': <tf.Tensor: id=51250, shape=(16,), dtype=int32, numpy=array([1, 4, 4, 3, 2, 2, 4, 4, 4, 4, 4, 2, 3, 2, 3, 3], dtype=int32)>, 'trestbps': <tf.Tensor: id=51259, shape=(16,), dtype=int32, numpy=
array([145, 160, 120, 130, 130, 120, 140, 120, 130, 140, 140, 140, 130,
       120, 172, 150], dtype=int32)>, 'chol': <tf.Tensor: id=51249, shape=(16,), dtype=int32, numpy=
array([233, 286, 229, 250, 204, 236, 268, 354, 254, 203, 192, 294, 256,
       263, 199, 168], dtype=int32)>, 'fbs': <tf.Tensor: id=51252, shape=(16,), dtype=int32, numpy=array([1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int32)>, 'restecg': <tf.Tensor: id=51254, shape=(16,), dtype=int32, numpy=array([2, 2, 2, 0, 2, 0, 2, 0, 2,

So by transforming the df to a dictionary and then slicing that dictionary using the `from_tensor_slices` function, we can preserve the dataframe structure. This means we don't have to create an Input layer using the keys of the dataframe like we did in the last few steps. 

In [34]:
model_func.fit(dict_slices, epochs=15)

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Epoch 8/15
Epoch 9/15
Epoch 10/15
Epoch 11/15
Epoch 12/15
Epoch 13/15
Epoch 14/15
Epoch 15/15


<tensorflow.python.keras.callbacks.History at 0x7f222b488fd0>

Still works