# TF Pipeline

TFDS also helps us working with non standard datasets which are not available in the TFDS library. These datasets can be of any type and we can easily create pipelines so that we dont need to work too hard to perform the same operations again and again and can just directly deploy the pipeline.

So our basic idea here is to take a datasource and then convert it into a `tf.data.dataset`. This will help us then easily handle the dataset for transformations.

Once we have the data in a `tf.data.dataset` object, we can store them in many forms. The most useful forms are int and float. So to help us with this, we can have multiple types of data in a single dataset and these are shown as `Feature Columns`.

There are 2 main type of feature columns:
- Categorical column
- Dense Column

In Categorical Column we have:
- Categorical with Identity column
- Categorical with Vocubulary column
- Categorical with Hashed column
- Crossed Column

In Dense Column we have:
- Numeric column
- Indicator column
- Embedding column

We also have a hybrid type called 'Bucketized Column'

## Feature Column

To create feature columns we can use the `tf.feature_column` package. Inside which we can find all the types of columns that we mentioned above

## Example

using Pandas dataframe as a source to create a pipeline

### Imports

In [9]:
import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split

### Read Raw data into Dataframe

In [2]:
heart_disease = pd.read_csv('dataset/heart.csv')

In [28]:
heart_disease.head(32)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,fixed,0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,normal,1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,reversible,0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,normal,0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,normal,0
5,56,1,2,120,236,0,0,178,0,0.8,1,0,normal,0
6,62,0,4,140,268,0,2,160,0,3.6,3,2,normal,1
7,57,0,4,120,354,0,0,163,1,0.6,1,0,normal,0
8,63,1,4,130,254,0,2,147,0,1.4,2,1,reversible,1
9,53,1,4,140,203,1,2,155,1,3.1,3,0,reversible,0


### Create Test, Train and Validation splits

In [4]:
train, test = train_test_split(heart_disease, test_size=0.2)
train, validation = train_test_split(train, test_size=0.2)

In [5]:
print('Train size', len(train))
print('Validaion size', len(validation))
print('Test size', len(test))

Train size 193
Validaion size 49
Test size 61


### Create input Pipeline

This is where we will take the data and transform it into a tf.data.dataset object for easy use.

**Note: If the data is too large, i.e. the csv file is so large that it wont fit into memory then use tf.data to directly read it form the source rather than reading it using a dataframe first and then converting it into a tf.data.dataset object**

In [6]:
def df_to_ds(df, shuffle=False, batch_size=32):
    df = df.copy()
    labels = df.pop('target')
    ds = tf.data.Dataset.from_tensor_slices((dict(df), labels))
    if shuffle:
        ds = ds.shuffle(buffer_size=len(df))
    ds = ds.batch(batch_size)
    return ds

In [10]:
train_ds = df_to_ds(train, True)
validation_ds = df_to_ds(validation)
test_ds = df_to_ds(test)

### Checking the input pipeline

In [21]:
for feat_batch, label_batch in train_ds.take(1):
    keys = list(feat_batch.keys())
    len_batch = len(feat_batch[keys[0]])
    print('Length of the batch: ', len_batch)
    print('Keys in the feature batch', keys)

Length of the batch:  32
Keys in the feature batch ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']


### Checking numerous Feature Column

The output of these feature columns is what will be given as input to the model we define.

In [29]:
example_batch = next(iter(train_ds))[0]
print(type(example_batch))

<class 'dict'>


#### Numeric Column

This is used to give real valued input to the model. The data from the dataset goes unchanged in this case.

In [30]:
age = tf.feature_column.numeric_column('age')
feat_col = tf.keras.layers.DenseFeatures(age)
feat_col(example_batch)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



<tf.Tensor: shape=(32, 1), dtype=float32, numpy=
array([[57.],
       [52.],
       [63.],
       [48.],
       [56.],
       [54.],
       [45.],
       [55.],
       [64.],
       [42.],
       [39.],
       [55.],
       [42.],
       [60.],
       [63.],
       [40.],
       [57.],
       [51.],
       [43.],
       [40.],
       [64.],
       [58.],
       [49.],
       [60.],
       [68.],
       [43.],
       [62.],
       [66.],
       [58.],
       [57.],
       [43.],
       [53.]], dtype=float32)>

#### Bucketized Column

The values will be split and the values will be represented by the bucket the belong to after splitting.

**Note: the input to the Bucketized column must be a Numeric Column object rather than the name of the column to be picked from the dataset**

In [32]:
age_bucket = tf.feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70])
feat_col = tf.keras.layers.DenseFeatures(age_bucket)
feat_col(example_batch)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



<tf.Tensor: shape=(32, 12), dtype=float32, numpy=
array([[0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0.

#### Categorical Columns

Here the column that we want to convert will be a string rather than numeric data. As we can't pass the string data to the model, we convert it into a numeric value as in one-hot encoded value by passing a vocabulary to denote the value.

In [33]:
thal = tf.feature_column.categorical_column_with_vocabulary_list('thal', ['fixed', 'normal', 'reversible'])
thal_one_hot = tf.feature_column.indicator_column(thal)
feat_col = tf.keras.layers.DenseFeatures(thal_one_hot)
feat_col(example_batch)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



<tf.Tensor: shape=(32, 3), dtype=float32, numpy=
array([[0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.]], dtype=float32)>

#### Embedding Columns

In the above case, just in case of having few category, we have thousands of category then we cant convert them feasibly to a one-hot encoding. We in place of encoding them in one-hot vectors, we encode them into a dense vector with minimal number of 0s.

**Note: here we need to pass the categorical_column_with_vocabulary_list object and not the name of the column**

In [34]:
thal_embedding = tf.feature_column.embedding_column(thal, dimension=8)
feat_col = tf.keras.layers.DenseFeatures(thal_embedding)
feat_col(example_batch)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



<tf.Tensor: shape=(32, 8), dtype=float32, numpy=
array([[ 0.0417821 ,  0.4522052 , -0.03355381,  0.5166072 , -0.02963242,
         0.07643393, -0.1067495 , -0.51274467],
       [ 0.0417821 ,  0.4522052 , -0.03355381,  0.5166072 , -0.02963242,
         0.07643393, -0.1067495 , -0.51274467],
       [-0.44705936,  0.5012992 , -0.3451139 , -0.6306948 , -0.01072552,
         0.06884802, -0.5132365 , -0.16622813],
       [ 0.0417821 ,  0.4522052 , -0.03355381,  0.5166072 , -0.02963242,
         0.07643393, -0.1067495 , -0.51274467],
       [ 0.0417821 ,  0.4522052 , -0.03355381,  0.5166072 , -0.02963242,
         0.07643393, -0.1067495 , -0.51274467],
       [ 0.0417821 ,  0.4522052 , -0.03355381,  0.5166072 , -0.02963242,
         0.07643393, -0.1067495 , -0.51274467],
       [ 0.0417821 ,  0.4522052 , -0.03355381,  0.5166072 , -0.02963242,
         0.07643393, -0.1067495 , -0.51274467],
       [-0.44705936,  0.5012992 , -0.3451139 , -0.6306948 , -0.01072552,
         0.06884802, -0.5132365

#### Hashed Feature Column

Here we se the feature_column_with_hashed_bucket method. We don't need to provide any vocabulary. It will be taken care by the method.

Only issues is that if the hash_bucket_size is not big enough to map all the possible values, then there can be collision and we can mis-represent some values.

In [37]:
thal_hashed = tf.feature_column.categorical_column_with_hash_bucket('thal', hash_bucket_size=1000)
thal_hashed_col = tf.feature_column.indicator_column(thal_hashed)
feat_col = tf.keras.layers.DenseFeatures(thal_hashed_col)
feat_col(example_batch)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



<tf.Tensor: shape=(32, 1000), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>

#### Crossed Feature Columns

Here we create new feature columns by combining more than columns so that the model can learn separate weights for complex features.

In [38]:
crossed_feat = tf.feature_column.crossed_column([age_bucket, thal], hash_bucket_size=1000)
crossed_feat_col = tf.feature_column.indicator_column(crossed_feat)
feat_col = tf.keras.layers.DenseFeatures(crossed_feat_col)
feat_col(example_batch)



To change all layers to have dtype float64 by default, call `tf.keras.backend.set_floatx('float64')`. To change just this layer, pass dtype='float64' to the layer constructor. If you are the author of this layer, you can disable autocasting by passing autocast=False to the base Layer constructor.



<tf.Tensor: shape=(32, 1000), dtype=float32, numpy=
array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)>

### Creating the Feature Columns

Now that we have seen what all feat columns we can create, lets create some real feat columsn for our dataset.

In [41]:
feature_columns = []

# numeric feat col
for col in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:
    feature_columns.append(tf.feature_column.numeric_column(col))
    
# bucketized feat col
age_bucket = tf.feature_column.bucketized_column(age, boundaries=[18,20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70])
feature_columns.append(age_bucket)

# indicator feat cols
thal = tf.feature_column.categorical_column_with_vocabulary_list('thal', ['fixed', 'normal', 'reversible'])
thal_one_hot = tf.feature_column.indicator_column(thal)
feature_columns.append(thal_one_hot)

# embedding feat cols
thal_embedding = tf.feature_column.embedding_column(thal, dimension=8)
feature_columns.append(thal_embedding)

# crossed feat cols
crossed_feat = tf.feature_column.crossed_column([age_bucket, thal], hash_bucket_size=1000)
crossed_feat_col = tf.feature_column.indicator_column(crossed_feat)
feature_columns.append(crossed_feat_col)

### Create an input layer for the model

In [42]:
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

### Model Section

In [43]:
model = tf.keras.models.Sequential([
    feature_layer,
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

In [45]:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

In [49]:
model.fit(train_ds, validation_data=validation_ds, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f6d60719ca0>

In [50]:
loss, accuracy = model.evaluate(test_ds)
print('Accuracy: ', accuracy)

Accuracy:  0.8524590134620667
