# 13. Loading and Preprocessing Data with TensorFlow

Introducing the **Data API**, a useful tool in ingesting and preprocessing large datasets. 

### The Data API

The whole API revolves around the concept of **dataset**. Here a simple example:

In [1]:
import tensorflow as tf

X = tf.range(10) # any data tensor

In [2]:
dataset = tf.data.Dataset.from_tensor_slices(X)

In [3]:
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

#### Chaining Transformations

We can apply all sorts of transformation to our datasets, and these will create new datasets. We can also chain them:

In [4]:
dataset = dataset.repeat(3).batch(7)

In [5]:
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


We can transform it using `map()`:

In [7]:
dataset = dataset.map(lambda x: x * 2) # items: [0,2,4,6,8,10,12]

#### Shuffling the Data

Example:

In [9]:
dataset = tf.data.Dataset.range(10).repeat(3) # 0 to 9, three times

In [10]:
dataset = dataset.shuffle(buffer_size=5, seed=42).batch(7)

In [11]:
for item in dataset:
    print(item)

tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


Another method to shuffle the data is **interleaving lines from multiple files**. The idea is simple: take the dataset and split it into a training set, a validation set, and a test set. Then split each set into many files.

Now let's suppose we have a list `train_filepaths` that contains the list of training file paths. Next, we create a dataset containing filepaths:

In [12]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

NameError: name 'train_filepaths' is not defined

And finally`interleave` to read one line at the time from all of them (sequentially):

In [13]:
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length=n_readers)

NameError: name 'filepath_dataset' is not defined

### TFRecord Format

Useful when bottleneck is loading and parsing the data. We can create a TFRecord file easily:

In [14]:
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second record")

Then we use `tf.data.TFRecordDataset` to read it:

In [15]:
filepaths = ["my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second record', shape=(), dtype=string)


### Preprocessing the Input Features

Aka converting all features into numerical features, generally normalizing them, etc. We have several options, some already covered:

1. Preparing data files in advance using another package (e.g. pandas)
2. Preprocessing using Data API
3. Preprocessing layer directly embedded in our model

Let's cover `3`.

Let's implement a standardization layer using a `Lambda` layer. For each feature, it subtracts the mean and divides by its standard deviation (plus a tiny smoothing term to avoid division by zero):

In [18]:
import numpy as np

means = np.mean(X_train, axis=0, keepdims=True)
stds = np.std(X_train, axis=0, keepdims=True)
eps = keras.backend.epsilon()
model = keras.models.Sequential([
    keras.layers.Lambda(lambda inputs: (inputs - means) / (stds + eps)),
    [...] # other layers
])

NameError: name 'X_train' is not defined

However, we may prefer to use a nice self-contained custom layer instead of having global variables like `means` and `stds`: 

In [21]:
from tensorflow import keras 

class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)
    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ +
keras.backend.epsilon())

#### Encoding Categorical Features Using One-Hot Vectors

Encoding categorical features is an essential step in any preprocesssing pipeline. Let's use how this can be done:

In [22]:
# list of all possible categories
vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
# tensor with indices (0-4)
indices = tf.range(len(vocab), dtype=tf.int64)
# initializer for the lookup table
table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
# out of vocab buckets
num_oov_buckets = 2
# lookup for oov buckets
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

Now let’s use the lookup table to encode a small batch of categorical features to one-hot vectors:

In [23]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices = table.lookup(categories)

In [24]:
cat_one_hot = tf.one_hot(cat_indices, depth=len(vocab) + num_oov_buckets)

In [25]:
cat_one_hot

<tf.Tensor: shape=(4, 7), dtype=float32, numpy=
array([[0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.]], dtype=float32)>

**Tip**: This works well with relatively few (<10) categories. For larger (>50) vocabs, it may be better to switch to **embeddings**.

#### Encoding Categorical Features Using Embeddings

An embedding is a trainable dense vector that represents a category. In some cases (e.g. word embeddings) they can be reused. 

In addition to important concept of **representation learning** embeddings can also encode concepts (e.g. King - Man + Woman $\approx$ Queen).

Here how it works in a simplified 2D embedding example:

In [26]:
embedding_dim = 2 #typically 10-300
embed_init = tf.random.uniform([len(vocab) + num_oov_buckets,
embedding_dim])
embedding_matrix = tf.Variable(embed_init)

In [27]:
embedding_matrix

<tf.Variable 'Variable:0' shape=(7, 2) dtype=float32, numpy=
array([[0.80938625, 0.66542244],
       [0.9468827 , 0.268731  ],
       [0.872504  , 0.5759852 ],
       [0.41557658, 0.1011914 ],
       [0.2997185 , 0.8080206 ],
       [0.43381143, 0.44566298],
       [0.51396513, 0.6908008 ]], dtype=float32)>

Encoding:

In [29]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])

In [30]:
cat_indices = table.lookup(categories)

In [31]:
cat_indices

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([3, 5, 1, 1], dtype=int64)>

In [32]:
tf.nn.embedding_lookup(embedding_matrix, cat_indices)

<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[0.41557658, 0.1011914 ],
       [0.43381143, 0.44566298],
       [0.9468827 , 0.268731  ],
       [0.9468827 , 0.268731  ]], dtype=float32)>

Putting everything together, we can create a model that learns categorical features through embeddings.

In [33]:
regular_inputs = keras.layers.Input(shape=[8])
categories = keras.layers.Input(shape=[], dtype=tf.string)
# look up each category
cat_indices = keras.layers.Lambda(lambda cats: table.lookup(cats))
(categories)
# look up embeddings
cat_embed = keras.layers.Embedding(input_dim=6, output_dim=2)
(cat_indices)
encoded_inputs = keras.layers.concatenate([regular_inputs, cat_embed])
outputs = keras.layers.Dense(1)(encoded_inputs)
model = keras.models.Model(inputs=[regular_inputs, categories],
                            outputs=[outputs])

ValueError: A `Concatenate` layer should be called on a list of at least 2 inputs

### TF Transform

It may happen that we want to deploy our production model, and having to modify preprocessing for every platform can definitely be a problem in the long run. TF Transform takes care of making sure that there is only one preprocessing step. 

In [36]:
import tensorflow_transform as tft

def preprocess(inputs): # inputs = a batch of input features
    median_age = inputs["housing_median_age"]
    ocean_proximity = inputs["ocean_proximity"]
    standardized_age = tft.scale_to_z_score(median_age)
    ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
    return {
        "standardized_median_age": standardized_age,
        "ocean_proximity_id": ocean_proximity_id
    }