In [1]:
import numpy as np
import tensorflow as tf
import pandas as pd

# Loading and Preprocessing Data with Tensorflow

Ingesting a large dataset and preprocessing it efficiently can be tricky to implement with other Deep Learning libraries, but tensorFlow makes it easy thanks to the *Data API*: you just create a dataset object, and tell it where to get the data and how to transform it.

Off the shelf, the Data API can read from text files, binary files with fixed-sized records, and binary files that use TensorFlow's TFRecord format. The Data API also has support for reading from SQL databases.

Reading huge datasets efficiently is not the only difficulty: the data also needs to be preprocessed, usually normalized. Moreover, it is not always composed strictly of convenient numerical fields: there may be text features, categorical features, and so on. These need to be encoded, for example using one-hot encoding, bag-of-words encoding, or _embeddings_ (as we will see, an embedding is a trainable dense vector that represents a category or token). One option to handle all this preprocessing is to write your own custom preprocessing layers. Another is to use the standard preprocessing layers provided by Keras.

## The Data API

In [2]:
X = tf.range(10)  # any data tensor
dataset = tf.data.Dataset.from_tensor_slices(X)

for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


The from_tensor_slices() function takes a tensor and creates a tf.data.Dataset whose elements are all the slices of X (along the first dimension), so this dataset contains 10 items. In this case we would have obtained the same dataset if we had used tf.data.Dataset.range(10). Note that you can iterate over the dataset's items intuitively.

### Chaining Transformations

Once you have a dataset, you can apply all sorts of transformations to it by calling its transformation methods. Each method returns a new dataset, so you can chain transformations like the example below.

In [3]:
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


In this example, we first call the repeat() method on the original dataset, and it returns a new dataset that will repeat the items of the original dataset three times. Of course, this will not copy all the data in memory three times! (If you call this method with no arguments the new dataset will repeat the srouce dataset forever, so the code that iterates over the dataset will have to decide when to stop.) Then we call the batch() method on this new dataset, and again this creates a new dataset. This one will group the items of this final dataset. As you can see, the batch() method had to output a final batch of size two instead of seven, but you can call it with drop_remainder=True if you want it to drop this final batch so that all batches have the exact same size.

The dataset methods do _not_ modify datasets, they create new ones, so make sure to keep a reference to these new datasets (e.g. with dataset = ...) or else nothing will happen.

In [4]:
dataset = dataset.map(lambda x: x * 2)
for item in dataset:
    print(item)

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int32)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int32)
tf.Tensor([16 18], shape=(2,), dtype=int32)


You can also transform the items by calling the map() method. For example, the code above doubles the values of the original dataset. This function is the on you will call to apply any preprocessing you want to your data. Sometimes this will include computations that can be quite intensive, such as reshaping or rotating an image, so you will usually want to spawn multiple threads to speeds things up: it's as simple as setting the num_parallel_calls argument. Note that the function you pass to the map() method must be convertible to a TF Function.

While the map() method applies a transformation to each item, the apply() method applies a transformation to the dataset as a whole. For example, the following code applies the unbatch() to the dataset. Each item in the new dataset will be a single-integer tensor instead of a batch of seven integers. It is also possible to simply filter the dataset using filter() or look at just a few items using take()

In [5]:
dataset = dataset.unbatch() #dataset.apply(tf.data.experimental.unbatch())
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, sh

In [6]:
dataset = dataset.filter(lambda x: x < 10)
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)


In [7]:
for item in dataset.take(3):
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)


### Shuffling Data

Gradient Descent works best when the instances in the training set are independent and identically distributed. A simple way to ensure this is to shuffle the instances using the shuffle() method. You must specify the buffer size, and it is important to make it large enough, or else shuffling will not be very effective. Just don't exceed the amount of RAM you have, and even if you ahve plenty of it, there's no need to go beyond the dataset's size.

For example, the following code creates and displays a dataset containing the integers 0 to 9, repeated 3 times, shuffled using a buffer of size 5 and a random seed of 42, then batched with a batch size of 7.

If you call repeat() on a shuffled dataset, but default it will generate a new order at every iteration. This is generally a good idea, but if you prefer to reuse the same order at each iteration you can set reshuffle_each_iteration=False

In [8]:
dataset = tf.data.Dataset.range(10).repeat(3).shuffle(buffer_size=5, seed=42).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


For a large dataset that does not fit into memory, this simple shuffling-buffer approach may not be sufficient. One solution is to shuffle the source data itself. Then, to shuffle the instances some more, a common approach is to split the source data into multiple files, then read them in a random order during training. However, instances located in the same file will still end up close to each other. To avoid this you can pick multiple files randomly and read them simultaneously, interleaving their records. Then on top of that you can add a shuffling buffer using the shuffle() method. The Data API makes all of this possible in just a few lines of code.

#### Interleaving lines from multiple files

The following example assumes the Califorina housing dataset has been downloaded, split into train, val, and test sets, and that each set is further broken down into multiple csv files. I'm not going to do this because it is unncessary bloat on my computer.

In [9]:
# train_filepaths = ['datasets/housing/my_train_00.csv', 'datasets/housing/my_train_01.csv', ...]
# filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed=42)

# n_readers = 5
# dataset = filepath_dataset.interleave(
#     lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
#     cycle_length=n_readers
# )

# for line in dataset.take(5):
#     print(line.numpy())

Let's suppose train_filepaths contains the list of training file paths. By default, the list_files() function returns a dataset that shuffles the file paths. In general, this is a good thing, but you can set shuffle=False if you do not want that for some reason.

Next, you can call the interleave() method to read from five files (n_readers) at a time and interleave their lines (skipping the first line of each file, which is the header row, using the skip() method).

The interleave() method will create a dataset that will pull five file paths from the filepath_dataset, and for each one it will call the function you feed it (a lambda in this example) to create a new dataset (in this case a TextLineDataset).

For interleaving to work best, it is preferable to have files of identical length; otherwise the ends of the longest files will not be interleaved.

By default, interleave() does not use parallelism; it just reads one line at a time from each file, sequentially. If you want it to actually read files in parallel, you can set the num_parallel_calls argument to the number of threads you want. Alteratively, you can set num_parallel_calls=tf.data.experimental.AUTOTUNE for TensorFlow to choose the right number of threads dynamically based on the availble CPU.

### Preprocessing the Data

In [10]:
# x_mean, x_std = [...] # mean and scale of each feature in the training set
# n_inputs = 8

# def preprocess(line):
#     defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
#     fields = tf.io.decode_csv(line, record_defaults=defs)
#     x = tf.stack(fields[:-1])
#     y = tf.stack(fields[-1:])
#     return (x - x_mean) / x_std, y

Let's walk through the code above:

1. First, the code assumes that we have precomputed the mean and standard deviation of each feature in the training set. x_mean and x_std are just 1D tensors (or Numpy arrays) containing 8 floats, one per input feature.
2. The preprocess() function takes one CSV line and starts by parsing it. For this it uses the tf.io.decode_csv() function, which takes 2 arguments: the first is the line to parse, and the second is an array containing the default value for each column in CSV File. This array tells TensorFlow not only the default value for each column, but also what number of columns and their types. In this example, we tell it that all feature columns are floats and that missing values should default to 0, but we provide an empty array of tf.float32 as the default value for the last column (the target): the array tells TensorFlow that this column contains floats, but that there is no default value, so it will raise an exception if it encounters the missing value.
3. The decode_csv() function returns a list of scalar tensors (one per column), but we need to return 1D tensor arrays. So we call tf.stack() on all tensors except the last one (the target): this will stack these tensors into a 1D array. We then do the same for the target value (this makes it a 1D tensor array with a single value, rather than a scalar tensor).
4. Finally, we scale the input features by subtracting the feature means and then dividing by the feature standard deviations, and we return a tuple containing the scaled features and target.

### Putting Everything Together

To make the code reusable, let's put together everything we have discussed so far int oa small helper function

In [11]:
def csv_reader_dataset(
    filepaths,
    repeat=1,
    n_readers=5,
    n_read_threads=None,
    shuffle_buffer_size=1e5,
    n_parse_threads=5,
    batch_size=32
):
    dataset = tf.data.Dataset.list_files(filepaths)
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
        cycle_length=n_readers,
        num_parallel_calls=n_parse_threads
    )
    dataset = dataset.shuffle(shuffle_buffer_size).repeat(repeat)
    return dataset.batch(batch_size).prefetch(1)

### Prefetching

By calling prefetch(1) at the end, we are creating a dataset that will do its best to always be one batch ahead. In other words, while our training algorithm is working on one batch, the dataset will already be working in parallel on getting the next batch ready. This can improve performance dramatically.

In general, just prefetching one batch is fine, but in some cases you may way to prefetch a few more. Alternatively, you can let TensorFlow decide automatically by passing tf.data.experimental.AUTOTUNE.

With prefetching, the CPU and the GPU work in parallel: as the GPU works on one batch, the CPU works on the next. If you plan to purchase a GPU, its processing power and its memory size are of course very important (in particular, a large amount of RAM is crucial for computer vision). Just as important to get good performance is its _memory bandwidth_; this is the number of gigabytes of data it can get into or out of its RAM per second.

If the dataset is small enough to fit in memory, you can significantly speed up training by using the dataset's cache() method to cache its content to RAM. This way, each instance will only be read and preprocessed once.

We have discussed the most common dataset methods, but there are a few more you may want to look at:

1. concatenate()
2. zip()
3. window()
4. reduce()
5. shard()
6. flat_map()
7. padded_batch()

There are also a couple more class methods work mentioning:

1. from_generator()
2. from_tensors()

which create a new dataset from a Python generator or a list of tensors, respectively. Also note that there are experimental features available in tf.data.experimental.

### Using the Dataset with tf.keras

In [12]:
# train_set = csv_reader_dataset(train_filepaths)
# val_set = csv_reader_dataset(val_filepaths)
# test_set = csv_reader_dataset(test_filepaths)

# model = tf.keras.models.Sequential([...])
# model.compile([....])
# model.fit(train_set, epochs=10, validation_data=val_set)

# model.evaluate(test_set)
# new_set = test_set.take(3).map(lambda X: y: X) # pretend we have 3 new instances
# model.predict(new_set) # a dataset containing new instances

Now we can use the csv_reader_dataset(). Note that we do not need to repeat it, as this will be taken care of by tf.keras.

If you want to build your own custom training loop (as in Chapter 12), you can just iterate over the training set, very naturally

In [13]:
# for X_batch, y_batch in train_set:
#     [...] # perform one gradient descent step

In fact, it is even possible to create a TF Function (see Chapter 12) that performs the whoel training loop.

In [14]:
# @tf.function
# def train(model, optimizer, loss_fun, n_epochs, **kwargs): 
#     train_set = csv_reader_dataset(train_filepaths, repeat=n_epochs, [...])
#     for X_batch, y_batch in train_set:
#         with tf.GradientTape() as tape:
#             y_pred = model(X_batch)
#             main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
#             loss = tf.add_n([main_loss] + model.losses)
#         grads = tape.gradient(loss, model.trainable_variables)
#         optimizer.apply_gradients(zip(grads, model.trainable_variables))

Note that CSV files do not support large or complex data structures (such as images or audio) very well. So let's see how to use TFRecords instead. If you are happy with CSV files (or whatever other format you are using), you do not _need_ to use TFRecords. As the saying goes, if it ain't broke, don't fix it! __TFRecords are useful when the bottleneck during training is loading and parsing data.__

## The TFRecord Format

The TFRecord format is TensorFlow's preferred format for storing large amounts of data and reading it efficiently. You can easily create a TFRecord file using the tf.io.TFRecordWriter class. By default, a TFRecordDataset will read files one by one, but you can make it read multiple files in parallel and interleave their records by setting num_parallel_reads. Alternatively, you could obtain the same results by using list_files() and interleave() as we did earlier to read multiple CSV files.

In [15]:
with tf.io.TFRecordWriter('my_data.tfrecord') as f:
    f.write(b'This is the first record')
    f.write(b'This is the second record')

filepaths = ['my_data.tfrecord']
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'This is the second record', shape=(), dtype=string)


### Compressed TFRecord Files

It can sometimes be useful to compress your TFRecord files, especially if they need to be loaded via a network connection. You can create a compressed TFRecord file by setting the options argument. Whjen reading a compressed TFRecord file you need to specify the compression type.

In [16]:
options = tf.io.TFRecordOptions(compression_type='GZIP')
with tf.io.TFRecordWriter('my_compressed_data.tfrecord', options) as f:
    f.write(b'This is the first record')
    f.write(b'This is the second record')

filepaths = ['my_compressed_data.tfrecord']
dataset = tf.data.TFRecordDataset(filepaths, compression_type='GZIP')
for item in dataset:
    print(item)

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'This is the second record', shape=(), dtype=string)


### A Brief Introduction to Protocol Buffers

TFRecord files usually contain serialized protocol buffers (also called _protobuffs_). This is a portable, extensible, and efficient binary format developed at Google back in 2001. The following example says we are using version 3 of the protobuf format, and it specifies that each Person object may (optionally) have a name of type string, an id of type int32, and zero or more email fields, each of type string. The numbers 1, 2, and 3 are the field identifiers: they will be used in each record's binary representation. Once you have a definition in a .proto file, you can compile it. This requires protoc, the protobuf compiler, to generate access classes in Python. Fortuantely, TensorFlow does include special protobuf definitions for which it provies parsing operations.

In [17]:
# syntax = 'proto3';
# message Person {
#     string name = 1;
#     int32 id = 2;
#     repeated string email = 3;
# }

### TensorFlow Protobufs

The main protobuf typically used in a TFRecord file is the __Example__ protobuf, which represents one instance in a dataset. Normally you would write much more than one Example. __Typically, you would create a conversion script that reads from your current format (say, CSV files), creates an Example protobuf for reach instance, serializes them, and saves them to several TFRecord files, ideally shuffling them in the process. This requires a bit of work, so once again make sure it is really necessary__ (perhaps your pipeline works fine with CSV files).

In [18]:
from tensorflow.train import BytesList, FloatList, Int64List, Feature, Features, Example

person_example = Example(
    features=Features(
        feature={
            'name': Feature(bytes_list=BytesList(value=[b'Alice'])),
            'id': Feature(int64_list=Int64List(value=[123])),
            'emails': Feature(bytes_list=BytesList(value=[
                b'a@b.com',
                b'c@d.com'
            ]))
        })
)

with tf.io.TFRecordWriter('my_contacts.tfrecord') as f:
    f.write(person_example.SerializeToString())

### Loading and Parsing Examples

To load the serialized Example protobufs, we will use a tf.data.TFRecordDataset once again, and we will parse each Example using tf.io.parse_single_example(). This is a TensorFlow operations, so it can be included in a TF Function. It requires at least two arguments: a string scalar tensor containing the serialized data, and a description of each feature.

The following code defines a description dictionary, then it iterates over the TFRecordDataset and parses the serialized Example protobuf this dataset constains.

In [19]:
feature_description = {
    'name': tf.io.FixedLenFeature([], tf.string, default_value=''),
    'id': tf.io.FixedLenFeature([], tf.int64, default_value=0), 
    'emails': tf.io.VarLenFeature(tf.string),
}

for serialized_example in tf.data.TFRecordDataset(['my_contacts.tfrecord']):
    parsed_example = tf.io.parse_single_example(
        serialized_example,
        feature_description
    )

parsed_example['emails'].values

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>

A BytesList can contain any binary data you want, including any serialized object. You can also store any tensor you want in a BytesList by serializing the tensor using tf.io.serialize_tensor() then putting the resulting byte string in a BytesList feature. Later, when you parse the TFRecord, you can parse this data using tf.io.parse_tensor().

As you can see, the Example protobuf will probably be sufficient for mose use cases. However, it may be a bit cumbersome to use when you are dealing with lists of lists. For example, suppose you want to classify text documents. Each document may be represented as a list of sentences, where each sentence is represented as a list of words. And perhaps each document also has a list of comments, where each comment is represented as a list of words. __TensorFlow's SequenceExample protobuf is designed for such use cases__.

In [20]:
''' Instead of parsing examples one by one using tf.io.parse_single_example() you may want to parse them batch by batch suing tf.io.parse_example()'''

dataset = tf.data.TFRecordDataset(['my_contacts.tfrecord']).batch(10)
for serialized_examples in dataset:
    parsed_examples = tf.io.parse_example(
        serialized_examples,
        feature_description
    )

### Handling Lists of Lists Using the SequenceExample Protobuf

A SequenceExample contains a Features object for the contextual data and a FeatureLists objects that contains one or more named FeatureList objects (e.g. a FeatureList named 'content' and a Featurelist named 'comments').

Each FeatureList contains a list of Feature objects, each of which may be a list of byte strings, a list of 64-bit integers, or a list of floats (in this example, each Feature would represent a sentence or a comment, perhaps in the form of a list of word identifiers). Building a SequenceExample, serializing it, and parsing it is simliar to building, serializing, and parsing an Example, but you must use tf.io.parse_single_sequence_example() to parse a single SequenceExample or tf.io.parse_sequence_example() to parse a batch. 

Both functions return a tuple containing the context features (as a dictionary) and the feature lists (also as a dictionary). IF the feature lists contain sequences of varying sizes (as in the preceding example), you may want to convert them to ragged tensors, using tf.RaggedTensor.from_sparse().

In [21]:
# parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
#     serialized_sequence_example,
#     context_feature_descriptions,
#     sequence_feature_descriptions
# )
# parsed_content = tf.RaggedTensor.from_sparse(parsed_feature_lists['content'])

## Preprocessing the Input Features

Preparing your data for a neural network requires converting all features into numerical features, generally normalizing them, and more. If you data contains categorical features or text features, they need to be converted to numbers. This can be done ahead of time using NumPy, pandas, or Scikit-Learn, for example, or you can preprocess your data with the Data API (e.g. using the dataset's map() method). Alternatively, you can include a preprocessing layer directly in your model. Let's look at this last option now.

This is an example of how you can implement a standardization layer using a Lambda layer.

In [22]:
# means = np.mean(X_train, axis=0, keepdims=True)
# stds = np.std(X_train, axis=0, keepdims=True)
# eps = tf.keras.backend.epsilon()
# model = tf.keras.models.Sequential([
#     tf.keras.layers.Lambda(lambda inputs: (inputs - means) / (stds + eps)),
#     [...] # other layers
# ])

You may prefer to use a nice self-contained custom layer (must like Scikit-Learns StandardScaler)

In [23]:
class Standardization(tf.keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)

    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ + tf.keras.backend.epsilon())

std_layer = Standardization()
# std_layer.adapt(data_sample)

Before you can use this standardization layer, you will need to adapt it to your dataset by calling the adapt() method and passing it the data sample. This will allow it to use the appropriate mean and standard deviation for each feature.

The sample must be large enough to be representative of your dataset, but it does not have to be the full training set: it general, a few hundred randomely selected instances will suffice. Next, you can use this preprocessing layer like a normal layer

In [24]:
# model = tf.keras.models.Sequential()
# model.add(std_layer)
# [...] # create the rest of the model
# model.compile([...])
# model.fit([...])

### Encoding Categorical Features Using One-Hot Vectors

Consider the ocean_proximity feature in the California housing dataset we explored in Chapter 2. We need to encode this feature before we feed it into a neural network. Since there are very few categories, we can use one-hot encoding. For this, we first need to map each category to its index (0 to 4), which can be done using a lookup table. 

In [25]:
vocab = ['<1H OCEAN', 'INLAND', 'NEAR OCEAN', 'NEAR BAY', 'ISLAND']
indices = tf.range(len(vocab), dtype=tf.int64)
table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
num_oov_buckets = 2
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

Let's go through this code:

1. We first define the _vocabulary_: this is the list of all possible categories
2. Then we create a tensor with the corresponding indices (0 to 4)
3. Next, we create an initializer for the lookup table, passing it the list of categories and their corresponding indices. In this example, we already have this data, so we use a KeyValueTensorInitializer; __but if the categories were listed in a text file (with one category per line), we would use a TextFileInitializer instead.__
4. In the last two lines we create the lookup table, giving it the initializer and specifying the number of _out-of-vocabulary_ (oov) buckets. If we look up a category that does not exist in the vocabulary, the lookup table will compute a hash of this category and use it to assign the unknown category to one of the oov buckets. Their indices start after the known categories, so in this example the indices of the two oov buckets are 5 and 6.

Why use oov buckets? Well, if the number of categories is large (e.g. zip codes, cities, words, products, or users) and the dataset is large as well, or it keeps changing, then getting the full list of categories may not be convenient. One solution is to define the vocabulary based on a data sample (rather than the whole training set) and add some oov buckets for the other categories that were not in the data sample. The more unknown categories you expect to find during training, the more oov buckets you should use. Indeed, if there are not enough oov buckets, there will be collisions: different categores will end up in the same bucket, so the neural network will not be able to distinguish them (at least not based on this feature).

Now let's use the lookup table to encode a small batch of categorical features to one-hot vectors

In [26]:
categories = tf.constant(['NEAR BAY', 'DESERT', 'INLAND', 'INLAND'])
cat_indices = table.lookup(categories)
cat_indices

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([3, 5, 1, 1], dtype=int64)>

In [27]:
cat_one_hot = tf.one_hot(cat_indices, depth=len(vocab) + num_oov_buckets)
cat_one_hot

<tf.Tensor: shape=(4, 7), dtype=float32, numpy=
array([[0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.]], dtype=float32)>

Notice that we have to tell this function the total number of indices, which is equal to the vocabulary size plus the number of oov buckets. Just like earlier, it wouldn't be too difficult to bundle all of this logic into a nice self-contained class. Its adapt() method would take a data sample and extract all the distinct categories in contains. It would create a lookup table to map each category to its index (including unknown categories using oov buckets). Then its call() method would use the lookup table to map the input categories to their indices. 

By the time you read this, tf.keras will probably include a layer called tf.keras.layers.TextVectorization, which will be capable of doing exactly this. You can add this layer at the beginning of your model, followed by a Lambda layer that woudl apply the tf.one_hot() function, if you want to convert these indices to one-hot vectors.

__One-hot encoding may not be the best solution, though. The size of each one-hot vector is the vocabulary length plus the number of oov buckets. This is fine when there are just a few possible categories, but if hte vocabulary is large, it is much more efficient to encode them using _embeddings_ instead.__

___As a rule of, if the number of categories is lower than 10, then one-hot encoding is generally the way to go. If the number of categories is greater than 50 (which is often the case when you use hash buckets), then embeddings are usually preferable. In between 10 and 50 you may want to experiment with both options and see which one works best for your use case.___

### Encoding Categorical Features Using Embeddings

An embedding is a trainable dense vector that represents a category. By default, embeddings are initalized randomly, so for example the 'NEAR BAY' category could be represented initially by a random vector such as [0.131, 0.890], while the 'NEAR OCEAN' category might be represented by another random vector such as [0.631, 0.791].

__In this example, we use 2D embeddings, but the number of dimensions is a hyperparameter you can tweak.__ Since these embeddings are trainable, they will gradually improve during training; and as they represent fairly simliar categories, Gradient Descent will certainly end up pushing them closer together, while it will tend to move them away from the 'ISLAND' category's embedding. __This is called representation learning__ (more on this in Chapter 17).

Let's look at how we could implement embeddings manually, to understand how they work. First, we need to create an _embedding matrix_ containing each category's embedding, initialized randomly; it will have one row per category and per oov bucket, and one column per embedding dimension.

In [28]:
'''
This embedding matrix is a random 6x2 matrix stored in a variable (so it can be tweaked by Gradient Descent during training)
'''
embedding_dim = 2
embed_init = tf.random.uniform([len(vocab) + num_oov_buckets, embedding_dim])
embedding_matrix = tf.Variable(embed_init)
embedding_matrix

<tf.Variable 'Variable:0' shape=(7, 2) dtype=float32, numpy=
array([[0.99118483, 0.6733229 ],
       [0.43409908, 0.46441567],
       [0.09780705, 0.30941188],
       [0.22888577, 0.67305243],
       [0.5651032 , 0.89478266],
       [0.9883096 , 0.96375036],
       [0.6182895 , 0.6394534 ]], dtype=float32)>

In [29]:
'''
Now lets encode the same batch of categorical features as earlier, but this time using these embeddings.

The tf.nn.embedding_lookup() function looks up the rows in the embedding matrix, at the given indices-- thats all it does.
'''
categories = tf.constant(['NEAR BAY', 'DESERT', 'INLAND', 'INLAND'])
cat_indices = table.lookup(categories)
tf.nn.embedding_lookup(embedding_matrix, cat_indices)

<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[0.22888577, 0.67305243],
       [0.9883096 , 0.96375036],
       [0.43409908, 0.46441567],
       [0.43409908, 0.46441567]], dtype=float32)>

In [30]:
'''
tf.keras provides a tf.keras.layers.Embedding layer that handles the embedding matrix (trainable by default);
when the layer is created it initializes the embedding matrix randomly, and then when it is called with some category
indices it returns the rows at those indices in the embedding matrix
'''
embedding = tf.keras.layers.Embedding(
    input_dim=len(vocab) + num_oov_buckets,
    output_dim=embedding_dim
)
embedding(cat_indices)

<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[-0.00919043, -0.00759076],
       [ 0.04987205,  0.03222804],
       [ 0.01170005, -0.00302599],
       [ 0.01170005, -0.00302599]], dtype=float32)>

In [31]:
'''
Putting everything together, we can now create a tf.keras model that can process categorical features (along with regular numerical features)
and learn an embedding for each category (as well as for each oov bucket)
'''
regular_inputs = tf.keras.layers.Input(shape=[8])
categories = tf.keras.layers.Input(shape=[], dtype=tf.string)
cat_indices = tf.keras.layers.Lambda(lambda cats: table.lookup(cats))(categories)
cat_embed = tf.keras.layers.Embedding(input_dim=6, output_dim=2)(cat_indices)
encoded_inputs = tf.keras.layers.concatenate([regular_inputs, cat_embed])
outputs = tf.keras.layers.Dense(1)(encoded_inputs)
model = tf.keras.models.Model(
    inputs=[regular_inputs, categories],
    outputs=[outputs]
)

__In this example we are using 2D embeddings, but as a rule of thumb embeddings typically have 10 to 300 dimensions, depending on the task and the vocabulary size__.

This model takes two inputs: a regular input containing 8 numerical features per instance, plus a categorical input (contianing one categorical feature per instance). It uses a Lambda layer to look up each category's index, then it looks up the embeddings for these indices. Next, it concatenates the embeddings and the regular inputs in order to give the encoded inputs, which are ready to be fed to a neural network. We could add any kind of neural network at this point, but we just add a dense output layer, and we creat the tf.keras.Model.

When the tf.keras.layers.TextVectorization layer is available, you can call its adapt() method ot make it extract the vocabulary from a data sample (it will take care of creating the lookup table for you). Then you can add it to your model, and it will perform the index lookup (replacing the Lambda layer in the previous example).

One-hot encoding followed by a Dense layer (with no activation function and no biases) is equivalent to an Embedding layer. However, the Embedding layer uses way fewer computations (the performance difference becomes clear when the size of the embedding matrix grows). The Dense layer's weight matrix plays the role of the embedding matrix. For example, __using one-hot vectors of size 20 and a Dense layer with 10 units is equivalent to use an Embedding layer with input_dim=20 and output_dim=10. As a result, it would be wasteful to use more embedding dimensions than the number of units in the layer that follows the Embedding layer.__

### Word Embeddings

Not only will embeddings generally be useful representations for the task at hand, but quite often these same embeddings can be reused successfully for other tasks. The most common example of this is _word embeddings_: when you are working on a natural language processing task, you are often better off reusing pretrained word embeddings than training your own.

In 2013 Google researchers published a paper describing an efficient technique to learn word embeddings using neural networks. They trained a neural network to predict the words near any given word, and obtained astounding word embeddings. For example, synonyms had very close embeddings, and semantically related words such as France, Spain, and Italy ended up clustered together.

It's not just about proximity, though: word embeddings were also organized along meaningful axes in the embedding space. HEre is a famous example: if you compute King - Man + Woman (add and subtracting the embedding vectors of these words), then the result will be very close to the embedding of the word Queen.

In other words, the word embeddings encode the concept of gender! Similary, you can compute Madrid - Spain + France and the result is close to Paris, which seems to show that the notion of captial city was also encoded in the embeddings.

Unfortunately, word embeddings sometimes capture biases, such as learning Man is to Doctor as Woman is to Nurse. Ensuring fairness in Deep Learning algorithms is an important and active research topic.

### Keras Preprocessing Layers

The TensorFlow team is working on providing a set of standard Keras preprocessing layers. This new API will likely supersede the existing Feature Columns API, which is harder to use and less intuitive (if you want to learn more about the Feature Columns API anyway pleace check the notebook for this chapter).

We already discussed two of these layers: __the tf.keras.layers.Normalization layer that will perform feature standardization__ (it will be equivalent to the Standardization layer we defined earlier), and __the TextVectorization layer that will be capable of encoding each word in the inputs into its index in the vocabulary__. In both cases, you create the layer, you call its adapt() method on a data sample, and then you use the layer normally in your model. The other preprocessing layers will follow the same pattern. 

The API will also include a __tf.keras.layers.Discretization layer that will chop continuous data into different bins and encode each bin as a one-hot vector. For example, you could use it to discretize prices into three categories (low, medium, high), which would be encoded as [1, 0 0], [0, 1, 0], and [0, 0, 1], respectively__. Of course, this loses a lot of information, but in some cases it can help the model detect patterns that would be otherwise not obvious when just looking at hte continuous values.

__The Discretization layer will not be differentiable, and it should only be used at the start of your model. Indeed, the model's preprocessing layers will be frozen during training, so their parameters will not be affected by Gradient Descent, and thus they do not need to be differentiable. This also means that you should not use an Embedding layer directly in a custom preprocessing layer, if you want it to be trainable: instead, it should be added separately to your model, as in the previous code example.__

___It will also be possible to chain multiple preprocessing layers using the PreprocessingStage class. For example, the following code will create a preprocessing pipeline that will first normalize the inputs, then discretize them (this may remind you of Scikit-Learn pipelines). After you adapt this pipeline to a data sample, you can use it like a regular layer in your models.___

In [32]:
' This is depreciated and no longer exists in the TF API. There are other ways to create preprocessing pipelines. TF Transform, for example'
# normalization = tf.keras.layers.Normalization()
# discretization = tf.keras.layers.Discretization([...])
# pipeline = tf.keras.layers.PreprocessingStage([normalization, discretization])
# pipeline.adapt(data_sample)

' This is depreciated and no longer exists in the TF API. There are other ways to create preprocessing pipelines. TF Transform, for example'

The TextVectorization layer will also have an option to output word-count vectors instead of word indices. For example, if the vocabulary contains three words, say ['and', 'basketball', 'more'], then the text 'more and more' will be mapped to the vector [1, 0, 2]: the word 'and' appears once and the word 'basketball' does not appear at all, and the word 'more' appears twice. This text representation is called a _bag of words_, since it completely loses the order of the words. Common words like 'and' will have a large value in most texts, even though they are usually the least interesting (e.g. in the text 'more and more basketball' the word 'basketball' is clearly the most important, precisely because it is not a very frequent word).

__So, the word ccounts should be normalized in a way that reduces the importance of frequent words. A common way to do this is to divide each word count by the log of the total number of training instances in which the word appears.__ This technique is called _Term-Frequency x Inverse-Document-Frequency_ (TF-IDF).

For example, let's imagine that the words 'and', 'basketball', and 'more' appear respectively in 200, 10 and 100 text instances in the training set: in this case, the final vector will be [ 1/log(200), 0/log(10), 2/log(100) ], which is aproximately equal to [0.19, 0., 0.43]. The TextVectorization layer will (likely) have an option to perform TF-IDF.

If the standard preprocessing layers are insufficient for your task, you will still have the option to create your own custom preprocessing layer, much like we did earlier with the Standardization class.

Create a subclass of the tf.keras.layers.PreprocessingLayer class with an adapt() method, which should take a data_sample argument and optionally an extra reset_state argument: if True, then the adapt() method should reset any existing state before computing the new state, if False, it should try to update the existing state.

## TF Transform (TFX)

tf.Transform makes it possible to write a single preprocessing function that can be run in batch mode on your full training set, before training (to speed it up), and then exported to a TF Function and incorporated into your trained model so that once it is deployed in production it can take care of preprocessing new instances on the fly.

 If preprocessing is computationally expensive, then handling it before training rather than on the fly may give you a significant speedup: the data will be preprocessed just once per instance _before_ training, rather than once per instance _during_ trianing. This is what TF Transform was designed for.

First, to use a TFX component such as TF Transform, you must install it; it does not come bundled with TensorFlow. Here is what the preprocessing function might look like if we just had two features:

In [33]:
# pip install tensorflow-transform

In [34]:
import tensorflow_transform as tft

def preprocess(inputs): # inputs = a batch of input features
    median_age = inputs['housing_median_age']
    ocean_proximity = inputs['ocean_proximity']
    standardized_age = tft.scale_to_z_score(median_age)
    ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
    return {
        'standardized_median_age': standardized_age,
        'ocean_proximity_id': ocean_proximity_id
    }

Next, TF Transform lets you apply this preprocess() function to the whole training set using Apache Beam (it provides an AnalyzeAndTransformDataset class that you can use for this purpose in your Apaceh Beam pipeline). In this process, it will also compute all the necessary statistics over the whole training set: in this example, the mean and standard deviation of the housing_median_age feature, and the vocabulary for the ocean_proximity feature.

Importantly, TF Transform will also generate an equivalent TensorFlow function that you can plug into the model you deploy. With the Data API, TFRecord, the Keras preprocessing layers, and TF Transform, you can build highly scalable input pipelines for training and benefit from fast and portable preprocessing in production.

## The TensorFlow Datasets (TFDS) Project

TFDS provides a convenient function to download many common datasets of all kinds, including large ones like ImageNet, as well as convenient dataset objects to manipulate them using the Data API

The TensorFlow Datasets project makes it very easy to download common datasets, from small ones like MNIST or Fashion MNIST to huge datasets like ImageNet. 

You can visit https://homl.info/tfds to view the full list, along with a description of each dataset. TFDS is not bundled with TensorFlow, so you need to install the tensorflow-datasets library. 

Then call the tfds.load() function and it will download the data you want. The load() function shuffles each data shard it downloads (only for the training set). This may not be sufficient, so it's best to shuffle the training data some more.

Note that each item in the dataset is a dictionary containing both the features and the labels. But tf.Keras expects each item to be a tuple containing two elements (again, the features and the labels.) Ask the load() function to do this for you by setting as_supervised=True (obviously this works only for labeled datasets). 

In [35]:
# pip install tensorflow-datasets

In [47]:
import tensorflow_datasets as tfds

mnist_dataset = tfds.load('mnist', batch_size=32, as_supervised=True)
mnist_train, mnist_test = mnist_dataset.get('train').shuffle(10000).prefetch(1), mnist_dataset.get('test')

imdb_dataset = tfds.load('imdb_reviews', batch_size=32, as_supervised=True)
imdb_train, imdb_test = imdb_dataset.get('train').shuffle(10000).prefetch(1), imdb_dataset.get('test')

[1mDownloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to C:\Users\Steph\tensorflow_datasets\imdb_reviews\plain_text\1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Generating splits...:   0%|          | 0/3 [00:00<?, ? splits/s]

Generating train examples...: 0 examples [00:00, ? examples/s]

Shuffling C:\Users\Steph\tensorflow_datasets\imdb_reviews\plain_text\1.0.0.incompleteM3DLSH\imdb_reviews-train…

Generating test examples...: 0 examples [00:00, ? examples/s]

Shuffling C:\Users\Steph\tensorflow_datasets\imdb_reviews\plain_text\1.0.0.incompleteM3DLSH\imdb_reviews-test.…

Generating unsupervised examples...: 0 examples [00:00, ? examples/s]

Shuffling C:\Users\Steph\tensorflow_datasets\imdb_reviews\plain_text\1.0.0.incompleteM3DLSH\imdb_reviews-unsup…

[1mDataset imdb_reviews downloaded and prepared to C:\Users\Steph\tensorflow_datasets\imdb_reviews\plain_text\1.0.0. Subsequent calls will reuse this data.[0m


# Exercises

1. Why would you want to use the Data API?

>My Answer:



>Book Answer:

2. What are the benefits of splitting a large dataset into multiple files?

>My Answer:



>Book Answer:

3. During training, how can you tell that your input pipeline is the bottleneck? What can you do to fix it?""

>My Answer:



>Book Answer:

4. Can you save any binary data to the TFRecord file, or only serialized protocol buffers?

>My Answer:



>Book Answer:

5. Why would you go through the hassle of converting all your data to the Example protobuf format? Why not use your own protobuf definition?

>My Answer:



>Book Answer:

6. When using TFRecords, when would you want to activate compression? Why not do it systematically?

>My Answer:



>Book Answer:

7. Data can be preprocessed directly when writing the data files, or within the tf.data pipeline, or in preprocessing layers within your model, or using TF Transform. Can you list a few pros and cons of each option?

>My Answer:



>Book Answer:

8. Name a few common techniques you can use to encode categorical features. What about text?

>My Answer:



>Book Answer:

9. Load the Fashion MNIST dataset; split it into a training set, a validation set, and a test set; shuffle the training set; and save each dataset to multiple TFRecord files. Each record should be a serialized Example protobuf with two features: the serialized image (use tf.io.serialize_tensor() to serialize each image), and the label. Then use tf.data to create an efficient dataset for each set. Finally, use a Keras model to train these datasets, including a preprocessing layer to standardize each input feature. Try to make the input pipeline as efficient as possible, using TensorBoard to visualize profiling data.

>My Answer:



>Book Answer: