# Loading & Preprocessing Data with TensorFlow

So far, we've used only datasets taht fint in memory, but deep learning systems are often trained on very large dataset that will not fit in RAM. Ingesting a large dataset & preprocessing it efficiently can be tricky to implement with other deep learning libraries, but TensorFlow makes it easy thanks to the *Data API*: you just create a dataset object, & tell it where to get the data & how to transform it. TensorFlow takes care of all the implementation details, such as multithreading, queuing, batching, & prefetching. Moreover, the Data API works seamlessly with tf.keras.

Off the shelf, the data API can read from text files (such as CSV files), binary files with fixed-size records, & binaary files that use TensorFlows's tfrecord format, which supports records of varying sizes. TFRecord is a felxible & efficient format usually containing protocol buffers (an open source binary format). The data API also supports for reading from SQL databases. Moreover, many open source extensions are available to read from all sorts of data sources, such as Google's BigQuery service.

Reading huge datasets efficiently is not the only difficulty: the data also needs to be preprocessed, usually normalised. Moreover, it is not always composed strictly of convenient numerical fields: there may be text features, categorical features, & so on. These need to be encoded, for example using one-hot encoding, bag-of-words encoding, or *embeddings* (as we will see, an embedding is a trainable dense vector that represents a category or token). One option is to handle all this preprocessing is to write your own custom preprocessing layers. Another is to use the standard preprocessing layers provided by keras.

In this lesson, we will learn about the data API, the tfrecord format, & how to create customer preprocessing layers & use the standard keras ones. We will also take a quick look at a few related projects from TensorFlow's ecosystem:

* *TF Transform* (`tf.Transform`)
   - Makes it possible to write a single preprocessing function that can be run in batch mode on your full training set, before training (to speed it up), & then exported to a tf function & incorporated into your trained model so that once it is deployed on production, it can take care of preprocessing new instances on the fly.
* *TF Datasets* (TFDS)
   - Provides a convenient function to download many common datasets of all kinds, including large ones like imagenet, as well as convenient dataset objects to manipulate them using the data API.

---

# The Data API

The whole data API revolves around the concept of a *dataset*: as you might suspect, this represents a sequence of data items. Usually, you will use datasets that gradually read data from disk, but for simplicity let's create a dataset entirely in RAM using `tf.data.Dataset.from_tensor_slices()`:

In [2]:
import tensorflow as tf

X = tf.range(10)
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

2024-09-07 12:48:37.657953: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

The `from_tensor_slices()` function takes a tensor & creates a `tf.data.Dataset` whose elements are all the slices of `X` (along the first dimension), so this dataset contains 10 items: tensors 0, 1, 2, ..., 9. In this case, we would have obtained the same dataset if we had used `tf.data.Dataset.range(10)`.

You can simply iterate over a dataset's items like so:

In [4]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


2024-09-07 12:48:47.810791: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## Chaining Transformations

Once you have a dataset, you can apply all sorts of transformations to it by calling its transformation methods. Each method returns a new dataset, so you can chain transformations like this:

In [6]:
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


2024-09-07 12:48:50.314890: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


<img src = "Images/Chaining Transformations.png" width = "600" style = "margin:auto"/>

In this example, we first call the `repeat()` method on the original dataset, & it returns a new dataset that will repeat the items of the original dataset three times. Of course, this will not copy all the data in memory three times! (If you call this method with no arguments, the new dataset will repeat the source dataset forever, so the code that iterates over the dataset will have to decide when to stop). Then we call the `batch()` method on this new dataset, & again this creates a new dataset. This one will group the items of the previous dataset in batches of seven items. Finally, we iterate over the items of this final dataset. As you can see, the `batch()` method had to output a final batch of size two instead of seven, but you can call it with `drop_remainder = True` if you want it to drop this final batch so that all batches have the exact same size.

You can also transform the items by calling the `map()` method. For example, this creates a new dataset with all items doubled.

In [8]:
dataset = dataset.map(lambda x: x * 2)

This function is the one you will call to apply any preprocessing you want to your data. Sometimes this will include computations that can be quite intensive, such as reshaping or rotating an image, so you will usually want to spawn multiple threads to speed things up: it's as simple as setting the `num_parallel_calls` argument. Note that the function you pass to the `map()` method must be convertible to a tf function.

While the `map()` method applies a transformation to each item, the `apply()` method applies a transformation to the dataset as a whole. For example, the following code applies the `unbatch()` function to the dataset (this function is currently experimental, but it will most likely move to the core API in ta future release). Each item in the new dataset will be a single-integer tensor instead of a batch of seven integers:

In [10]:
dataset = dataset.apply(tf.data.experimental.unbatch())

Instructions for updating:
Use `tf.data.Dataset.unbatch()`.


It's is also possible to simply filter the dataset using the `filter()` method:

In [12]:
dataset = dataset.filter(lambda x: x < 10)

You will often want to look at just a few items from a dataset. You can use the `take()` method for that:

In [14]:
for item in dataset.take(3):
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)


2024-09-07 12:48:59.581319: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## Shuffling the Data

As you know, gradient descent works best when the instances in the training set are independent & identically distributed. A simple way to ensure this is to shuffle the instances, using the `shuffle()` method. It will create a new dataset htat will start by filling up a buffer with the first items of the source dataset. Then, whenever it is asked for an item, it will pull one out randomly from the buffer & replace it with a fresh one from the source dataset, until it has iterated entirely throughout the source dataset. At this point, it continues to pull out items randomly from the buffer until it is empty. You must specify buffer size, & it is important to make it large enough, or else shuffling will not be very effective. Just don't exceed the amount of RAM you have, & even if you have plenty of it, there's no need to go beyond the dataset's size. You can provide a random see if you want the same random order every time you run your program. For example, the following code creates & displays a dataset containing the integers 0 to 9, repeated 3 times, shuffled using a buffer size of 5 & a random seed of 42, & batched with a batch size = 7.

In [16]:
dataset = tf.data.Dataset.range(10).repeat(3)
dataset = dataset.shuffle(buffer_size = 5, seed = 42).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


2024-09-07 12:49:05.126336: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


For a large dataset that does not fit in memory, this simple shuffling-buffer approach may not be sufficient, since the buffer will be small compared to the dataset. One solution is to shuffle the source data itself. This will definitely improve shuffling a lot. Even if the source data is shuffled, you will usually want to shuffle it some more, or else the same order will be repeated at each epoch, & the modelmay end up being biased (e.g., due to some spurious patterns present by change in the source data's order). To shuffle the instances some more, a common approach is to split the source data into multiple files, then read them in a random order during training. However, instances located in the same file will still end up close to each other. To avoid this, you can pick multiple files randomly & read them simultaneously, interleaving their records. Then on top of that, you can add a shuffling buffer using the `shuffle()` method. If all this sounds like a lot of work, don't worry: the data API makes all this possible in just a few lines of code. Let's see how to do this.

### Interleaving Lines from Multiple Files

First, let's suppose that you've loaded the California housing dataset, shuffled it, & split it into a training set, validation set, & a test set. Then you split each set into many csv files that each look like this (each row contains eigth input features plus the target median house value):

In [28]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train, X_test, y_train, y_test = train_test_split(housing.data, 
                                                    housing.target.reshape(-1, 1))
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

ImportError: cannot import name 'METRIC_MAPPING64' from 'sklearn.metrics._dist_metrics' (/opt/anaconda3/lib/python3.9/site-packages/sklearn/metrics/_dist_metrics.cpython-39-darwin.so)

In [None]:
def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join("datasets", "housing")
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")

    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

Let's also suppose `train_filepaths` contains the list of training file paths (& you also have `val_filepaths` & `test_filepaths`):

In [None]:
train_data = np.c_[X_train, y_train]
val_data = np.c_[X_val, y_val]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts = 20)
val_filepaths = save_to_multiple_csv_files(val_data, "valid", header, n_parts = 10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts = 10)

Alternatively, you could use file patterns; for example, `train_filepaths = "datasets/housing/my_train_*.csv"`. Now let's create a dataset containing only these file paths:

In [None]:
filepath_dataset = tf.data.Dataset.list_files(train_filepaths, seed = 42)

By default, the `list_files()` function returns a dataset that shuffles the file paths. In general this is a good thing, but you can set `shuffle = False` if you do not want that for some reason.

Next, you can call the `interleave()` method to read from five files at a time & interleave their lines (skipping the first line of each file, which is the header row, using the `skil()` method):

In [None]:
n_readers = 5
dataset = filepath_dataset.interleave(
    lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
    cycle_length = n_readers
)

The `interleave()` method will create a dataset that will pull five file paths from the `filepath_dataset`, & for each one it will call the function you gave it (a lambda in this example) to create a new dataset (in this case a `TextLineDataset`). To be clear, at this stage there will be seven datasets in all: the filepath dataset, the interleave dataset, & the five `TextLineDatasets` created internally by the interleave dataset. When we iterate over the interleave dataset, it will cycle through those five `TextLineDatasets`, reading one line at a time from each until all datasets are out of items. Then it will get the next five file paths from the `filepath_dataset` & interleave them the same way, & so on until it runs out of file paths.

By default, `interleave()` does not use parallelism; it just reads one line at a time from each file, sequentially. If you want it to actually read files in parallel, you can set the `num_parallel_cals` argument to the number of threads you want (note that the `map()` method also has this argument). You can even set it to `tf.data.experimental.AUTOTUNE` to make TensorFlow choose the right number of threads dynamically based on the available CPU. Let's look at what the dataset contains now:

In [None]:
for line in dataset.take(5):
    print(line.numpy())

These are the first rows (ignoring the header row) of five CSV files, chosen randomly. Looks good so far, but as you can see, these are just byte strings; we need to parse them & scale the data.

## Preprocessing the Data

Let's implement a small function that will perform this preprocessing:

In [None]:
n_inputs = 8

@tf.function
def preprocess(line):
    defs = [0.0] * n_inputs + [tf.contant([], dtype = tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults = defs)
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    return (x - X_mean) / X_std, y

Let's walk through this code:

* First, the code assumes that we have precomputed the mean & standard deviation of each feature in the training set. `X_mean` & `X_std` are just 1D tensors (or numpy arrays) containing eight floats, one per input feature.
* The `preprocess()` function takes one CSV line & starts by parsing it. For this it uses the `tf.io.decode_csv()` function, which takes two arguments: the first is the line to parse, & the second is an array containing the default value for each column in the CSV file. This array tells TensorFlow not only the default value for each column, but also the number of columns & their types. In this example, we tell it that all feature columns are floats & that missing values should default to 0, but we provide an empty array of type `tf.float32` as the default value for the last column (the target): the array tells TensorFlow that this column contains floats, but that there is no default value, so it will raise an exception if it encounters a missing value.
* The `decode_csv()` function returns a list of scaler tensors (one per column), but we need to return 1D tensor arrays. So we call `tf.stack()` on all tensors except for the last one (the target): this will stack these tensors into a 1D array. We then do the same for the target value (this makes it a 1D tensor array with a single value, rather than a scalar tensor).
* Finally, we scale the input features by subtracting the feature means & then dividing by the feature standard deviations, & we return a tuple containing the scaled features & the target.

Let's test this preprocessing function:

In [None]:
preprocess(b"4.2083,44.0,5.3232,0.9181,846,0.2,3370,37.47,-122.2,2.782")

Looks good! We can now apply the function to the dataset.

## Putting Everything Together

To make the code reusable, let's put together everything we have discussed so far into a small helper function: it will create & return a dataset that will efficiently load California housing data from multiple CSV files, preprocess it, shuffle it, optionally repeat it, & batch it.

In [None]:
def csv_reader_dataset(filepaths, repeat = 1, n_readers = 5, n_read_threads = None,
                       shuffle_buffer_size = 10000, n_parse_threads = 5, batch_size = 32):
    dataset = tf.data.Dataset.list_files(filepaths)
    dataset = dataset.interleave(
        lambda filepath: tf.data.TextLineDataset(filepath).skip(1),
        cycle_length = n_readers, 
        num_parallel_calls = n_read_threads
    )
    dataset = dataset.map(preprocess, num_parallel_calls = n_parse_threads)
    dataset = dataset.shuffle(shuffle_buffer_size).repeat(repeat)
    return dataset.batch(batch_size).prefetch(1)

Everything should make sense in this code, except the very last line (`prefetch(1)`), which is important for performance.

<img src = "Images/Load & Preprocess Data From Multiple CSV Files.png" width = "600" style = "margin:auto"/>

## Prefetching

By calling `prefetch(1)` at the end, we are creating a dataset that will do its best to always be one batch ahead. In other words, while our training algorithm is working on one batch, the dataset will already be working in parallel on getting the next batch ready (e.g., reading the data from disk & preprocessing it). This can improve performance dramatically, illustrated below.

<img src = "Images/Prefetching.png" width = "500" style = "margin:auto"/>

If we also ensure that loading & preprocessing are multithreaded (by setting `num_parallel_calls` when calling `interleave()` & `map()`), we can exploid multiple cores on the CPU & hopefully make preparing one batch of data shorter than running a training step on the GPU: this way the GPU will be also 100% utilised (except for the data transfer time from the CPU to the GPU), & training will run much faster.

If the dataset is small enough to fit in memory, you can significantly speed up training by using the dataset's `cache()` method to cache its content to RAM. You should generally do this after loading & preprocessing the data, but before shuffling, repeating, batching, & prefetching. This way, each instance will only be read & preprocessed once (instead of once per epoch), but the data will still be shuffled differently at each epoch, & the next batch will still be prepared in advance.

You now know how to build efficient input pipelines to load & preprocess data from multiple text file. We have discussed the most common dataset methods, but there are a few more you may want to look at: `concatenate()`, `zip()`, `window()`, `reduce()`, `shard()`, `flat_map()`, & `padded_batch()`. There are also a couple more class methods: `from_generator()` & `from_tensors()`, which create a new dataset from a Python generator or a list of tensors, respectively. Check out the API documentation for more details.

## Using the Dataset with tf.keras

Now we can use the `csv_reader_dataset()` function to create a dataset for the training set. Note that we do not need to repeat it, as this will be taken care of by tf.keras. We also create datasets for the validation set & the test set:

In [None]:
train_set = csv_reader_dataset(train_filepaths)
val_set = csv_reader_dataset(val_filepaths)
test_set = csv_reader_dataset(test_filepaths)

Now we can simply build & train a keras model using these datasets. All we need to do is pass the training & validation datasets to the `fit()` method, instead of `X_train`, `y_train`, `X_val`, & `y_val`:

In [None]:
keras.backend.clear_session()

model = keras.models.Sequentail([
    keras.layers.Dense(30, activation = "relu", input_shape = X_train.shape[1:]),
    keras.layers.Dense(1)
])

model.compile(loss = "mse", optimizer = keras.optimizers.SGD(learning_rate = 1e-3))
batch_size = 32
model.fit(train_set, steps_per_epoch = len(X_train) // batch_size,
          epochs = 10, validation_set = val_set)

Similarly, we can pass a dataset to the `evaluate()` & `predict()` methods:

In [None]:
model.evaluate(test_set)
new_set = test_set.take(3).map(lambda X, y: X)
model.predict(new_set, steps = len(X_new) // batch_size)

Unlike the other sets, the `new_set` will usually not contain labels (if it does, keras will ignore them). Note that in all these cases, you can still use numpy arrays instead of datasets if you want (but of course they need to have been loaded & preprocessed first).

If you want to build your own custom training loop, you can just iterate over the training set, very naturally:

In [None]:
optimiser = keras.optimizers.Nadam(learning_rate = 0.01)
loss_fn = keras.losses.mean_squared_error

n_epochs = 5
batch_size = 32
n_steps_per_epoch = len(X_train) // batch_size
total_steps = n_epochs * n_steps_per_epoch
global_step = 0

for X_batch, y_batch in train_set.take(total_steps):
    global_step += 1
    print("\rGlobal step {}/{}".format(global_step, total_steps), end = "")
    with tf.GradientTape() as tape:
        y_pred = model(X_batch)
        main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
        loss = tf.add_n([main_loss] + model.losses)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

In fact, it is even possible to create a tf function that performs the whole training loop:

In [51]:
keras.backend.clear_session()
optimiser = keras.optimizers.Nadam(learning_rate = 0.01)
loss_fn = keras.losses.mean_squared_error

@tf.function
def train(model, n_epochs, batch_size = 32, n_readers = 5, n_read_threads = 5,
          shuffle_buffer_size = 10000, n_parse_threads = 5):
    train_set = csv_reader_dataset(train_filepaths, repeat = n_epochs, n_readers = n_readers,
                                   n_read_threads = n_read_threads, 
                                   shuffle_buffer_size = shuffle_buffer_size,
                                   n_parse_threads = n_parse_threads, batch_size = batch_size)
    for X_batch, y_batch in train-set:
        with tf.GradientTape() as tape:
            y_pred = model(X_batch)
            main_loss = tf.reduce_mean(loss_fn(y_batch, y_pred))
            loss = tf.add_n([main_loss] + model.losses)
        gradients = tape.gradient(loss, model.trainable_variables)
        optimizer.apply_gradients(zip(gradients, model.trainable_variables))

train(model, 5)

SyntaxError: unexpected EOF while parsing (3653596466.py, line 6)

Congratulations, you now know how to build powerful input pipelines using the data API. However, so far, we have used CSV files, which are common, simple & convenient but not really efficient, & do not support large or complex data structures (such as images or audio) very well. Let's see how to use tfrecords instead.

---

# The TFRecord Format

The tfrecord format in TensorFlow's preferred format for storing large amounts of data & reading it efficiently. It is a very simple binary format that just contains a sequence of binary records of varying sizes (each record is comprised of a length, a sequence of binary records of varying sizes (each record is comprised of a length, a CRC checksum to check that the length was not corrupted, then the actual data, & finally a CRC checksum for the data). You can easily create a tfrecord file using the `tf.io.TFRecordWriter` class:

In [57]:
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
    f.write(b"This is my first record")
    f.write(b"& this is the second record")

You can then use a `tf.data.TFRecordDataset` to read one or more tfrecord files:

In [60]:
filepaths = ["my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)
for item in dataset:
    print(item)

tf.Tensor(b'This is my first record', shape=(), dtype=string)
tf.Tensor(b'& this is the second record', shape=(), dtype=string)


2024-09-07 15:38:23.426238: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


## Compressed TFRecord Files

It can sometimes be useful to compress your tfrecord files, especially if they need to be loaded via a network connection. You can create a compressed tfrecord file by setting the `options` argument:

In [63]:
options = tf.io.TFRecordOptions(compression_type = "GZIP")
with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as f:
    f.write(b"This is the first record")
    f.write(b"& this is the second record")

When reading a compressed tfrecord file, you need to specify the compression type:

In [66]:
dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"],
                                  compression_type = "GZIP")

## A Brief Introduction to Protocol Buffers

Even though each record can use any binary format you want, tfrecord files usually contain serialised protocol buffers (also called *protobufs*). This is a portable, extensible, & efficient binary format developed at Google back in 2001 & made open source in 2008; protobugs are now widely used, in particular in gRPC, Google's remote procedure call system. They a defined using simple language that looks like this:

In [None]:
syntax = "proto3";
message Person {
    string name = 1;
    int32 id = 2;
    repeated string email = 3;
}

This definition says we are using verions 3 of the protobuf format, & it specifies that each `Person` object may (optionally) have a `name` of type `string`, an `id` of type `int32`, & zero or more `email` fields, each of type `string`. The numbers 1, 2, & 3 are the field identifiers: they will be used in each record's binary representation. Once you have a definition in a *.proto* file, you can compile it. This requires `protoc`, the protobuf compiler, to generate access classes in Python (or some other language). Note that the protobuf definitions we will use have already been compiled for you, & their python classes are part of TensorFlow, so you will not need to use `protoc`. All you need to know is how to use protobuf access classes in python. To illustrate the basics, let's look at a simple example that uses the access classes generated for the `Person` protobuf:

In [74]:
%%writefile person.proto
syntax = "proto3";
message Person {
  string name = 1;
  int32 id = 2;
  repeated string email = 3;
}

Writing person.proto


In [80]:
!protoc person.proto --python_out=. --descriptor_set_out=person.desc --include_imports

In [82]:
!ls person*

person.desc   person.proto  person_pb2.py


In [84]:
from person_pb2 import Person # Import the generated access class

person = Person(name = "Al", id = 123, email = ["a@b.com"]) # Create a person
print(person)

name: "Al"
id: 123
email: "a@b.com"



In [88]:
person.name # Read a field

'Al'

In [90]:
person.name = "Alice" # Modify a field

In [92]:
person.email[0] # Repeated fields can be accessed like arrays

'a@b.com'

In [94]:
person.email.append("c@d.com") # Add an email address

In [96]:
s = person.SerializeToString() # Serialize the object to a byte string
s

b'\n\x05Alice\x10{\x1a\x07a@b.com\x1a\x07c@d.com'

In [98]:
person2 = Person() # Create a new person

In [100]:
person2.ParseFromString(s) # Parse the byte string (27 bytes long)

27

In [102]:
person == person2 # Now they are equal

True

In short, we import the `Person` class generated by `protoc`, we create an instance & play with it, visualising it & reading & writing some fields, then we serialise it using the `SerializeToString()` method. This is the binary data that is ready to be saved or transmitted over the network. When reading or receiving this binary data, we can parse it using the `ParseFromString()` method, & we get a copy of the object that was serialised.

We could save the serialised `Person` object to a tfrecord file, then we could load & parse it: everything would work fine. However, `SerializeToString()` & `ParseFromString()` are not TensorFlow operations (& neither are the other operations in this code), so they cannot be included in a TensorFlow function (except by wrapping them in a `tf.py_function()` operation, which would make the code slower & less portable). Fortunately, TensorFlow does include specila protobuf definitions for which it provides parsing operations.

## TensorFlow Protobufs

The main protobuf typically used in a tfrecord file is the `Example` protobuf, which represents one instance in a dataset. It contains a list of named features, where each feature can either be a list of byte strings, a list of floats, or a list of integers. Here is the protobuf definition:

In [None]:
syntax = "proto3";
message BytesList {repeated bytes value = 1}
message FloatList {repeated float value = 1 [packed = true]}
message Int64List {repeated int64 value = 1 [packed = true]}
message Feature {
    oneof kind {
        BytesList bytes_list = 1;
        FloatList float_list = 2;
        Int64List int64_list = 3
    }
}
message Features {map<string, Feature> feature = 1}
message Example {Features features = 1}

The definitions of `ByteList`, `FloatList`, & `Int64List` are straightforward enough. Note that [`packed = true`] is used for repeated numerical fields, for a more efficient encoding. A `Feature` contains either a `BytesList` a `FloatList` or an `Int64List`. A `Features` (with an `s`) contains a dictionary that maps a feature name to the corresponding feature value. Finally, an `Example` contains only a `Features` object. Here is how you could create a `tf.train.Example` representing the same person as earlier & write it to a tfrecord file:

In [109]:
from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example

person_example = Example(features = Features(
    feature = {"name": Feature(bytes_list = BytesList(value = [b"Alice"])),
               "id": Feature(int64_list = Int64List(value = [123])),
               "emails": Feature(bytes_list = BytesList(value = [b"a@b.com",
                                                                 b"c@d.com"]))}
))

The code is a bit verbose & repetitive, but it's rather straightforward (& you could easily wrap it inside a small helper function). Now that we have an `Example` protobuf, we can serialize it by calling its `SerializeToString()` method, then write the resulting data to a tfrecord file:

In [114]:
with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
    f.write(person_example.SerializeToString())

Normally you would write much more than one `Example`! Typically, you would create a conversion script that reads from your current format (say, CSV files), creates an `Example` protobuf for each instance, serialises them, & saves them to several tfrecord files, ideally shuffling them in the process. This requires a bit of work, so once again, make sure it is really necessary (perhaps your pipeline works fine with CSV files). 

Now that we have a nice tfrecord file containing a serialised `Example` let's try to load it.

## Loading & Parsing Examples

To load the serialised `Example` protobufs, we will use a `tf.data.TFRecordDataset` once again, & we will parse each `Example` using `tf.io.parse_single_example()`. This is a TensorFlow operation, so it can be included in a tf function. It requires at least two arguments: a string scalar tensor containing the serialised data, & a description of each feature. The description is a dictionary that maps each feature name to either a `tf.io.FixedLenFeature` descriptor indicating the feature's shape, type, & default value, or a `tf.io.VarLenFeature` descriptor indicating only the type (if the length of the feature's list may vary, such as the `"emails"` feature).

The following code defines a description dictionary, then it iterates over the `TFRecordDataset` & parses the serialised `Example` protobuf this dataset contains:

In [118]:
feature_description = {
    "name": tf.io.FixedLenFeature([], tf.string, default_value = ""),
    "id": tf.io.FixedLenFeature([], tf.int64, default_value = 0),
    "emails": tf.io.VarLenFeature(tf.string)
}

for serialised_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
    parsed_example = tf.io.parse_single_example(serialised_example, feature_description)

2024-09-07 17:45:17.351546: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


The fixed-length features are parsed as regular tensors, but the variable-length features are parsed as sparse tensors. You can convert a sparse tensor to a dense tensor using `tf.sparse.to_dense()`, but in this case it is simpler to just access its values:

In [121]:
tf.sparse.to_dense(parsed_example["emails"], default_value = b"")

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>

In [123]:
parsed_example["emails"].values

<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>

A `BytesList` can contain any binary data you want, including any serialized object. For example, you can use `tf.io.encode_jpeg()` to encode an image using the JPEG format & put this binary data in a `BytesList`. Later, when your code reads the tfrecord, it will start by parsing the `Example` then it will need to call `tf.io.decode_jpeg()` to parse the data & get the original image (or you can use `tf.io.decode_image()`, which can decode any BMP, GIF, JPEG, or PNG image). You can also store any tensor you want in a `BytesList` by serialising the tensor using `tf.io.serialize_tensor()` then putting the resulting byte string in a `BytesList` feature. Later, when you parse the tfrecord, you can parse this data using `tf.io.parse_tensor()`.

Instead of parsing examples one by one using `tf.io.parse_single_example()`, you may want to parse them batch by batch using `tf.io.parse_example()`:

In [126]:
dataset = tf.data.TFRecordDataset(["my_contacts.tfrecord"]).batch(10)
for serialised_examples in dataset:
    parsed_examples = tf.io.parse_example(serialised_examples,
                                          feature_description)

2024-09-07 18:02:25.096787: W tensorflow/core/framework/local_rendezvous.cc:404] Local rendezvous is aborting with status: OUT_OF_RANGE: End of sequence


As you can see, the `Example` protobuf will probably be sufficient for most use cases. However, it may be a bit cumbersome to use when you are dealing with lists of lists. For example, suppose you want to classify text documents. Each document may be represented as a list of sentences, where each sentence is represented as a list of words. Perhaps each document also has a list of comments, where each comment is represented as a list of words. There may be some contextual data too, such as the document's author, title, & publication data. TensorFlow's `SequenceExample` protobuf is designed for such use cases.

## Handling Lists of Lists Using the SequenceExample Protobuf

Here is the definition of the `SequenceExample` protobuf:

In [None]:
message FeatureList {repeated Feature feature = 1; };
message FeatureLists {map<string, FeatureList> feature_list = 1; };
message SequenceExample {
    Features context = 1;
    FeatureLists feature_lists = 2;
};

A `SequenceExample` contains a `Features` object for the contextual data & a `FeatureLists` object that contains one or more named `FeatureList` objects (e.g., a `FeatureList` named `"content"` & another named `"comments"`). Each `FeatureList` contains a list of `Feature` objects, each of which may be a list of byte strings, a list of 64-bit integers, or a list of floats (in this example, each `Feature` would rpresent a sentence or comment, perhaps in the form of a list of words identifiers). Building a `SequenceExample`, serializing it, & parsing it is similar to building, serializing, & parsing an `Example`, but you must use `tf.io.parse_single_sequence_example()` to parse a single `SequenceExample` or `tf.io.parse_sequence_example()` to parse a batch. Both functions return a tuple containing the context features (as a dictionary) & the feature lists (also a dictionary). If the feature lists contain sequences of varying sizes (as in the preceding example), you may want to convert them to ragged tensors, using `tf.RaggedTensor.from_sparse()`:

In [137]:
FeatureList = tf.train.FeatureList
FeatureLists = tf.train.FeatureLists
SequenceExample = tf.train.SequenceExample

context = Features(feature = {"author_id": Feature(int64_list = Int64List(value = [123])),
                              "title": Feature(bytes_list = BytesList(value = [b"A", b"desert", b"place", b"."])),
                              "pub_date": Feature(int64_list = Int64List(value = [1623, 12, 25]))})

content = [["When", "shall", "we", "three", "meet", "again", "?"], 
           ["In", "thunder", ",", "lightning", ",", "or", "in", "rain", "?"]]
comments = [["When", "the", "hurlyburly", "'s", "done", "."],
            ["When", "the", "battle", "'s", "lost", "and", "won", "."]]

def words_to_feature(words):
    return Feature(bytes_list = BytesList(value = [word.encode("utf-8")
                                                   for word in words]))

content_features = [words_to_feature(sentence) for sentence in content]
comments_features = [words_to_feature(comment) for comment in comments]

sequence_example = SequenceExample(context = context,
                                   feature_lists = FeatureLists(feature_list = {
                                       "content": FeatureList(feature = content_features),
                                       "comments": FeatureList(feature = comments_features)
                                   }))
sequence_example

context {
  feature {
    key: "author_id"
    value {
      int64_list {
        value: 123
      }
    }
  }
  feature {
    key: "pub_date"
    value {
      int64_list {
        value: 1623
        value: 12
        value: 25
      }
    }
  }
  feature {
    key: "title"
    value {
      bytes_list {
        value: "A"
        value: "desert"
        value: "place"
        value: "."
      }
    }
  }
}
feature_lists {
  feature_list {
    key: "comments"
    value {
      feature {
        bytes_list {
          value: "When"
          value: "the"
          value: "hurlyburly"
          value: "\'s"
          value: "done"
          value: "."
        }
      }
      feature {
        bytes_list {
          value: "When"
          value: "the"
          value: "battle"
          value: "\'s"
          value: "lost"
          value: "and"
          value: "won"
          value: "."
        }
      }
    }
  }
  feature_list {
    key: "content"
    value {
      feature {
      

In [141]:
serialised_sequence_example = sequence_example.SerializeToString()

context_feature_descriptions = {
    "author_id": tf.io.FixedLenFeature([], tf.int64, default_value = 0),
    "title": tf.io.VarLenFeature(tf.string),
    "pub_date": tf.io.FixedLenFeature([3], tf.int64, default_value = [0, 0, 0])
}

sequence_feature_descriptions = {
    "content": tf.io.VarLenFeature(tf.string),
    "comments": tf.io.VarLenFeature(tf.string),
}

parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
    serialised_sequence_example, context_feature_descriptions, sequence_feature_descriptions
)
parsed_content = tf.RaggedTensor.from_sparse(parsed_feature_lists["content"])
parsed_content

<tf.RaggedTensor [[b'When', b'shall', b'we', b'three', b'meet', b'again', b'?'],
 [b'In', b'thunder', b',', b'lightning', b',', b'or', b'in', b'rain', b'?']]>

Now that you know how to efficient store, load, & parse data, the next step is to prepare it so that it can be fed to a neural network.

---

# Preprocessing the Input Features

Preparing your data for a neural network requires converting all features into numerical features, generally normalising them, & more. In particular, if your data contains categorical features or text features, they need to be converted to numbers. This can be done ahead of time when preparing your data files, using any tool you like (e.g., numpy, pandas, scikit-learn). Alternatively, you can preprocess your data on the fly when loading it with the data API (e.g., using the dataset's `map()` method, as we saw earlier), or you can include a preprocessing layer directly in your model. Let's look at this last option now. 

For example, here is how you can implement a standardisation layer using a `Lambda` layer. For each feature, it subtracts the mean & divides by its standard deviation (plus a tiny smoothing term to avoid divisions by zero):

In [None]:
means = np.mean(X_train, axis = 0, keepdims = True)
stds = np.std(X_train, axis = 0, keepdims = True)
eps = keras.backend.epsilon()
model = keras.models.Sequential([
    keras.layers.Lambda(lambda inputs: (inputs - means) / (stds + eps)),
    [...] # other layers
])

That's not too hard! However, you may prefer to use a nice self-contained custom layer (much like scikit-learn's `StandardScaler`), rather than having global variables like `means` & `stds` dangling around:

In [None]:
class Standardisation(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis = 0, keepdims = True)
        self.stds_ = np.std(data_sample, axis = 0, keepdims = True)
    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())

Before you can use this standardisation layer, you will need to adapt it to your dataset by calling the `adapt()` method & passing it a data sample. This will allow it to use the appropriate mean & standard deviation for each feature:

In [None]:
std_layer = Standardization()
std_layer.adapt(data_sample)

This sample must be large enough to be representative of your dataset, but it does not have to be the full training set: in general, a few hundred randomly selected instances will suffice (however, this depends on your task). Next, you can use this preprocessing layer like a normal layers:

In [None]:
model = keras.Sequential()
model.add(std_layer)
[...] # Create the rest of the model
model.compile([...])
model.fit([...])

If you are thinking that keras should contain a standardisation layer like this one, here's some good news for you: by the time you read this, the `keras.layers.Normalization` layer will probably be available. It will work very much like our custom `Standardization` layer: first, create the layer, then adapt it to your dataset by passing a data sample to the `adapt()` method, & finally use the layer normally.

Now, let's look at categorical features. We will start by encoding them as one-hot vectors.

## Encoding Categorical Features Using One-Hot Vectors

Consider the `ocean_proximity` feature in the California housing dataset we explored before: it is a categorical feature with five possible values: `"<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", ISLAND"`. We need to encode this feature before we feed it to a neural network. Since there are very few categories, we can use one-hot encoding. For this, we first need to map each category to its index (0 to 4), which can be done using a lookup table:

In [154]:
vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]
indices = tf.range(len(vocab), dtype = tf.int64)
table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
num_oov_buckets = 2
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

Let's go through this code:

* We first define the *vocabulary*: this is the list of all possible categories.
* Then we create a tensor with the corresponding indices (0 to 4).
* Next, we create an initializer for the lookup table, passing it the list of categories & their corresponding indices. In this example, we already have this data, so we use a `KeyValueTensorInitializer`; but if the categories were listed in a text file (with on category per line), we would use a `TextFileInitializer` instead.
* In the last two lines, we create the lookup table, giving it the initializer & specifying the number of *out-of-vocabulary* (oov) buckets. If we look up a category that does not exist in the vocabulary, the lookup table will compute a hash of this cateoogry & use it to assign the unknwon category to one of the oov buckets. Their indices start after the known categories, so in this example, the indices of the two oov buckets are 5 & 6.

Why use oov buckets? Well, if the number of categories is large (e.g., zip codes, cities, words, products, orusers) & the dataset is large as well, or it keeps changing, then getting the full list of categories may be inconvenient. One solution is to define the vocabulary based on a data sample rather than the whole training set & add some oov buckets for the other categories that were not in the data sample. The more unknown categories you expect to find during training, the more oov buckets you should use. Indeed, if there are not enough oov buckets, there will be collisions: different categories will end up in the same bucket, so the neuralnetwork will not be able to distinguish them (at least not based on this feature).

Now let's use the lookup table to encode a small batch of categorical features into one-hot vectors:

In [157]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices = table.lookup(categories)
cat_indices

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([3, 5, 1, 1])>

In [159]:
cat_one_hot = tf.one_hot(cat_indices, depth = len(vocab) + num_oov_buckets)
cat_one_hot

<tf.Tensor: shape=(4, 7), dtype=float32, numpy=
array([[0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.]], dtype=float32)>

As you can see, `"NEAR BAY"` was mapped to index 3, the unknown category `"DESERT"` was mapped to one of the two oov buckets (at index 5), & `"INLAND"` was mapped to index 1, twice. Then we used `tf.one_hot()` to one-hot encode these indices. Notice that we have to tell this function the total number of indices, which is equal to the vocabulary size plus the number of oov buckets. Now you know how to encode categorical features to one-hot vectors using TensorFlow!

Just like earlier, it wouldn't be too difficult to bundle all of this logic into a nice self-contained class. Its `adapt()` method would take a data sample & extract all the distinct categories it contains. It would create a lookup table to map each category to its index (including unknown categories using oov buckets). Then its `call()` method would use the lookup table to map the input categories to their indices. 

Keras has a `keras.layers.TextVectorization` layer, which is capable of doing exactly that: its `adapt()` method will extract the vocabulary from a data sample, & its `call()` method will convert each category to its index in the vocabulary. You could add this layer at the beginning of your model, followed by a `Lambda` layer that would apply the `tf.one_hot()` function, if you want to convert these indices to one-hot vectors.

This may not be the best solution, though. The size of each one-hot vector is the vocabulary length plus the number of oov buckets, this is fine when there are just a few possible categories, but if the vocabulary is large, it is much more efficient to encode them using *embeddings* instead.

## Encoding Categorical Features Using Embeddings

An embedding is a trainable dense vector that represents a category. By default, embeddings are initialised randomly, so for example, the `"NEAR BAY"` category could be represented initially by a random vector such as `[0.131, 0.890]`, while the `"NEAR OCEAN"` category might be represented by another random vector such as `[0.631, 0.791]`. In this example, we use 2D embeddings, but the number of dimensions is a hyperparameter you can tweak. Since these embeddings are trainable, they will gradually improve during training; & as they represent fairly similar categories, gradient descent will certainly end up pushing them closer together, while it will tend to move them away from the `"INLAND"` category's embedding. Indeed, the better the representation, the easier it will be for the neural network to make accurate predictions, so training tends to make embeddings useful representations of the categories. This is called *representation learning*.

<img src = "Images/Embeddings.png" width = "450" style = "margin:auto"/>

Let's look at how we could implement embeddings manually, to understand how they work (then we will use a simple keras layer instead). First, we need to create an *embedding matrix* containing each category's embedding, initialised randomly; it will ahve one row per category & per oov bucket, & one column per embedding dimension:

In [None]:
embedding_dim = 2
embed_init = tf.random.uniform([len(vocab) + num_oov_buckets, embedding_dim])
embedding_matrix = tf.Variable(embed_init)

In this example, we are using 2D embeddings, but as a rule of thumb embeddings typically have 10 to 300 dimensions, depending on the task & the vocabulary size (you will have to tune this hyperparameter).

This embedding matrix is a random 6 x 2 matrix, stored in a variable (so it can be tweaked by gradient descent during training):

In [None]:
embedding_matrix

Now, let's encode the same batch of categorical features as earlier, but this time using those embeddings:

In [8]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])
cat_indices = table.lookup(categories)
cat_indices

NameError: name 'tf' is not defined

In [None]:
tf.nn.embedding_lookup(embedding_matrix, cat_indices)

The `tf.nn.embedding_lookup()` function looks up the rows in the embedding matrix, at the given indices -- that's all it does. For example, the lookup table says that the `"INLAND"` category is at index 1, so the `tf.nn.embedding_lookup()` function returns the embedding at row 1 in the embedding matrix (twice): `[0.3528825, 0.46448255]`.

Keras provides a `keras.layer.Embedding` layer that handles the embedding matrix (trainable, by default); when the layer is created, it initialises the embedding matrix randomly, & then when it is called with some category indices it returns the rows at those indices in the embedding matrix:

In [None]:
embedding = keras.layers.Embedding(input_dim = len(vocab) + num_oov_buckets, 
                                   output_dim = embedding_dim)
embedding(cat_indices)

Putting everything together, we can now create a keras model that can process categorical features (along with regular numerical features) & learn an embedding for each category (as well as for each oov bucket):

In [None]:
regular_inputs = keras.layers.Input(shape = [8])
categories = keras.layers.Input(shape = [], dtype = tf.string)
cat_indices = keras.layers.Lambda(lambda cats: table.lookup(cats))(categories)
cat_embed = keras.layers.Embedding(input_dim = 6, output_dim = 2)(cat_indices)
encoded_inputs = keras.layers.concatenate([regular_inputs, cat_embed])
outputs = keras.layers.Dense(1)(encoded_inputs)
model = keras.models.Model(inputs = [regular_inputs, categories],
                           outputs = [outputs])

This model takes two inputs: a regular input containing eight numerical features per instance, plus a categorical input (containing one categorical feature per instance). It uses a `Lambda` layer to look up each category's index, then it looks up the embeddings for these indices. Next, it concatenates the embeddings & the regular inputs in order to give the encoded inputs, which are ready to be fed to a neural network. We could add any kind of neural network at this point, but we just add a dense output layer, & we create the keras model.

When the `keras.layers.TextVectorization` layer is available, you can call its `adapt` method to make it extract the vocabulary from a data sample (it will take care of creating the lookup table for you). Then you can add it to your model, & it will perform the index lookup (replacing the `Lambda` layer in the previous code example).

## Keras Preprocessing Layers

The TensorFlow team is working on providing a set of standard keras preprocessing layers. They will probably be available by now; however, the API may change slightly, so refer to the documentation if anything behaves unexpectedly. This API will likely supersed the existing feature columns API, which is harder to use & less intuitive.

We already discussed two of these layers: the `keras.layers.Normalization` layer that will perform feature standardisation (it will be equivalent to the `Standardization` layer we defined earlier), & the `TextVectorization` layer that will be capable of encoding each word in the inputs into its index in the vocabulary. In both cases, you create the layer, you call its `adapt()` method with a data sample, & then you use the layer normally in your model. The other preprocessing layers will follow the same pattern.

The API will also include a `keras.layers.Discretization` layer that will chop continuous data into different bins & encode each bin as a one-hot vector. For example, you could use it to discretize prices into three categories (low, medium, high), which would be encoded as [1, 0, 0], [0, 1, 0], & [0, 0, 1], respectively. Of course, this loses a lot of information, but in some cases, it can help the model detect patterns that would otherwise not be obvious when just looking at the continuous values.

It will also be possible to chain multiple preprocessing layers using the `PreprocessingStage` class. For example, the following code will create a preprocessing pipeline that will first normalize the inputs, then discretize them (this may remind you of scikit-learn pipelines). After you adapt this piepline to a data sample, you can use it like a regular layer in your models (but again, only at the start of the model, since it contains a nondifferentiable preprocessing layer):

In [None]:
normalisation = keras.layers.Normalization()
discretisation = keras.layers.Discretization([...])
pipeline = keras.layers.PreprocessingStage([normalisation, discretisation])
pipeline.adapt(data_sample)

The `TextVectorization` layer will also have an option to output word-count vectors instead of word indices. For example, if the vocabulary contains three words, say `["and", "basketball", "more"]`, then the text `"more and more"` will be mapped to the vector `[1, 0, 2]`: the word `"and"` appears once, the word `"basketball"` does not appear at all, & the word `"more"` appears twice. This text representation is called a *bag of words*, since it completely loses the order of the words. Common words like `"and"` will ahve a large value in most text, even though they are usually the least interesting (e.g., in the text `"more and more basketball"`, the word `"basketball"` is clearly the most important, precisely because it is not a very frequent word). So, the word counts should be normalised in a way that reduces the important of frequent words. A common way to do this is to divide each word count by the log of the total number of training instances in which the word appears. This technique is called *Term-Frequency x Inverse-Document-Frequency* (TF-IDF). For example, let's imagine that the words `"and"`, `"basketball"`, & `"more"` appear respectively in 200, 10, & 10 text instances in the training set: in thei case, the final vector will be `[1/log(200), 0/log(10), 2/log(100)]`, which is approximately equal to `[0.19, 0.0, 0.43]`. The `TextVectorization` layer will (likely) have an option to perform TF-IDF.

As youc an see, these keras preprocessing layers will make preprocessing much easier! Now, whether you choose to write your own preprocessing layers or use Kera's (or even the feature columns API), all the preprocessing will be done on the fly. During traiing, however, it maybe preferable to perform preprocessing ahead of time. let's see why we'd want to do that & how we'd go about it.

---

# TF Transform

If preprocessing is computationally expensive, then handling it before trianing rather than on the fly may five you a significant speedup: the data will be preprocessed just once per instance *before* training, rather than once per instance & per epoch *during* training. As mentioned earlier, if the dataset is small enouh to fit in RAM, you can use its `cache()` method. But if it is too large, then tools like apache beam or spark will help. They let you run efficient data processing pipelines over large amounts of data, even distributed across multiple servers, so youc an use them to preprocess all teh training data before training.

This works great & indeed can speed up training, but there is one problem: once your model is trained, suppose you want to deploy it to a mobile app. In that case, you will need to write some code in your app to take care of preprocessing the data before it is fed tothe model. Suppose you also want to deploy the model to TensorFlow.js so that it runs in a web browser? Oce again, you will need to write some preprocessing code. This can become a maintenance nightmare: whever you want to change the preprocessing logic, you will need to update yoru apache beam code, your mobile app code, & your javascript code. This is not only time-consuming, but also error-prone: you may end up with subtle differences between the preprocessing operations performed before training & the ones performed in your app or in the browser. This *training/serving skew* will lead to bugs or degraded performance.

One improvement would be to take the trained model (trained on data that was preprocessed by your apache beam or spark code) & before deploying it to your app or the browser, add extra preprocessing layers to take care of preprocessing on the fly. That's definitely better, since now you just have two versions of your preprocessing code: the apache beam or spark code, & the preprocessing layers' code.

But what if you could define your preprocessing operations just once? This is what tf transform was designed for. It is part of TensorFlow Extended (TFX), an end-to-end platform for productionising TensorFlow models. First, to use a TFX component such as tf transform, you must install it; it does not come bundled with TensorFlow. You then define your preprocessing function just once (in Python), by using tf transform functions for scaling, bucketising, & more. You can also use any TensorFlow oepration you need. Here is what this preprocessing function might look like if we just had two features:

In [None]:
import tensorflow_transform as tft

def preprocess(inputs):  # inputs = a batch of input features
    median_age = inputs["housing_median_age"]
    ocean_proximity = inputs["ocean_proximity"]
    standardised_age = tft.scale_to_z_score(median_age)
    ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
    return {"standardised_median_age": standardised_age,
            "ocrean_proximity_id": ocean_proximity_id}

Next, tf transform lets you apply this `preprocess()` function to the whole training set using apache beam (it provides an `AnalyzeandTransformDataset` class that you can use for this purpose in your apache beam pipeline). In the process, it will also compute all the necessary statistics over the whole training set: in this example, the mean & standard deviation of the `housing_median_age` feature, & the vocabulary for the `ocean_proximity` feature. The components that compute these statistics are called *analyzers*.

Importantly, tf transform will also generate an equivalent TensorFlow function that you can plug into the model you deploy. This tf function includes some constants that correspond to all the all the necessary statistics computed by apache beam (the mean, standard deviation, & vocabulary).

With the data API, tfrecords, the keras preprocessing layers, & tf transform, you can build highly scalable input pipelines for training & benefit from fast & portable data preprocessing in production.

But what if you just wanted to use a standard dataset? Well in that case, things are much simpler: just use TFDS!

---

# The TensorFlow Datasets (TFDS) Project

The TensorFlow Datasets project makes it very easy to download common datasets, from small ones like MNIST or Fashion MNIST to huge datasets like imagenet (you will need quite a bit of disk space). The list includes image datasets, text datasets (including translation datasets), & audio & video datasets.

TFDS is not bundled with TensorFlow, so you need to install the `tensorflow_datasets` library (e.g., using pip). Then call the `tfds.load()` function, & it will download the data you want (unless it was already downloaded earlier) & return the data as a dictionary of datasets (typically one for training & one for testing, but this depends on the dataset you choose). For example, let's download MNIST:

In [23]:
import tensorflow_datasets as tfds

dataset = tfds.load(name = "mnist")
mnist_train, mnist_test = dataset["train"], dataset["test"]

2024-09-08 14:35:30.294588: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-09-08 14:35:33.864379: W external/local_tsl/tsl/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata.google.internal".


Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /Users/jiehengyu/tensorflow_datasets/mnist/3.0.1...


Dl Completed...:   0%|          | 0/5 [00:00<?, ? file/s]

Dataset mnist downloaded and prepared to /Users/jiehengyu/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.


You can then apply any transformation you want (typically shuffling, batching, & prefetching), & you're ready to train your model. Here is a simple example:

In [None]:
mnist_train = mnist_train.shuffle(10000).batch(32).prefetch(1)
for item in mnist_train:
    images = item["image"]
    labels = item["label"]
    [...]

Note that each item in the dataset is a dictionary containing both the features & the labels. But keras expects each item to be a tuple containing two elements (again, the features & the labels). You could transform the dataset using the `map()` method, like this:

In [None]:
mnist_train = mnist_train.shuffle(10000).batch(32)
mnist_train = mnist_train.map(lambda items: (items["image"], items["label"]))
mnist_train = mnist_train.prefetch(1)

But it's simpler to ask the `load()` function to do this for you by setting `as_supervised = True` (obviously this works only for labeled datasets). You can also specify the batch size if you want. Then you can pass the dataset directly to your tf.keras model:

In [None]:
dataset = tfds.load(name = "mnist", batch_size = 32, as_supervised = True)
mnist_train = dataset["train"].prefetch(1)
model = keras.models.Sequential([...])
model.compile(loss = "sparse_categorical_crossentropy", optimizer = "sgd")
model.fit(mnist_train, epochs = 5)

This was quite a technical lesson, & you may feel that it is a bit far form the abstract beauty of neural networks, but the fact is deep learning often involves large amounts of data, & knowing how to load, parse, & preprocess it efficiently is a crucial skill to have.