# Loading and Preprocessing Data with TensorFlow

So far we have used only datasets that fit in memory, but Deep Learning systems are often trained on very large datasets that will not fit in RAM.

Ingesting a large dataset and preprocessing it efficiently can be tricky to implement with other Deep Learning libraries, but TensorFlow makes it easy thanks to the Data API, you just create a dataset object, and tell it where to get the data and how to transform it.

TensorFlow takes care of all the implementation details, such as multithreading, queuing, batching, and prefetching. Moreover, the Data API works seamlessly with tf.keras.

Off the shelf, the Data API can read from text files (such as CSV files), binary files with fixed-size records, and binary files that use TensorFlow’s TFRecord format, which supports records of varying sizes. 

TFRecord is a flexible and efficient binary format usually containing protocol buffers (an open source binary format).

The Data API also has support for reading from SQL databases. Moreover, many open source extensions are available to read from all sorts of data sources, such as Google’s Big‐Query service.

Reading huge datasets efficiently is not the only difficulty: the data also needs to be preprocessed, usually normalized. Moreover, it is not always composed strictly of convenient numerical fields: there may be text features, categorical features, and so on. 

These need to be encoded, for example using one-hot encoding, bag-of-words encoding, or embedding. An embedding is a trainable dense vector that represents a category or token.

One option to handle all this preprocessing is to write your own custom preprocessing layers. Another is to use the standard preprocessing layers provided by Keras.

## Chapter Contents

In this chapter, we will cover the Data API, the TFRecord format, and how to create
custom preprocessing layers and use the standard Keras ones. We will also take a quick look at a few related projects from TensorFlow’s ecosystem:

In [1]:
import os
import numpy as np

## TensorFlow Functions

### TF Transform (tf.Transform)

Makes it possible to write a single preprocessing function that can be run in batch mode on your full training set, before training (to speed it up), and then exported to a TF Function and incorporated into your trained model so that once it is deployed in production it can take care of preprocessing new instances on the fly.

### TF Datasets (TFDS)

Provides a convenient function to download many common datasets of all kinds, including large ones like ImageNet, as well as convenient dataset objects to manipulate them using the Data API.

## The Data API

The whole Data API revolves around the concept of a dataset: as you might suspect, this represents a sequence of data items. Usually you will use datasets that gradually read data from disk, but for simplicity let’s create a dataset entirely in RAM using tf.data.Dataset.from_tensor_slices():

In [2]:
import tensorflow as tf
import tensorflow.keras as keras




In [3]:
X = tf.range(10)
X

<tf.Tensor: shape=(10,), dtype=int32, numpy=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])>

In [4]:
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

The from_tensor_slices() function takes a tensor and creates a tf.data.Dataset whose elements are all the slices of X (along the first dimension), so this dataset contains 10 items: tensors 0, 1, 2, …, 9. In this case we would have obtained the same dataset if we had used tf.data.Dataset.range(10).

You can simply iterate over a dataset’s items like this:

In [5]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


### Chaining Transformations

Once you have a dataset, you can apply all sorts of transformations to it by calling its transformation methods. Each method returns a new dataset, so you can chain transformations like this.

In [6]:
dataset = dataset.repeat(3)

In [7]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype

In [8]:
dataset = dataset.repeat(3).batch(7)

In [9]:
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9 0 1 2 3 4], shape=(7,), dtype=int32)
tf.Tensor([5 6 7 8 9 0 1], shape=(7,), dtype=int32)
tf.Tensor([2 3 4 5 6 7 8], shape=(7,), dtype=int32)
tf.Tensor([9 0 1 2 3 4 5], shape=(7,), dtype=int32)
tf.Tensor([6 7 8 9 0 1 2], shape=(7,), dtype=int32)
tf.Tensor([3 4 5 6 7 8 9], shape=(7,), dtype=int32)
tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9], shape=(6,), dtype=int32)


#### Explanation of the transformation

##### repeat()

In this example, we first call the repeat() method on the original dataset, and it returns a new dataset that will repeat the items of the original dataset three times see the dataset = dataset.repeat(3) block and the for block after that . Of course, this will not copy all the data in memory three times!

##### WARNING REPEAT!!!!!!!!!!!!!!!!!!!!!!!!!
If you call this method with no arguments, the new dataset will repeat the source dataset forever, so the code that iterates over the dataset will have to decide when to stop. Try running the command dataset = dataset.repeat(3) and then run the for loop in the next block.

#### batch()

Then we call the batch() method on this new dataset, and again this creates a new dataset.

This one will group the items of the previous dataset in batches of seven items. Finally, we iterate over the items of this final dataset.

As you can see, the batch() method had to output a final batch of size two instead of seven, but you can call it with drop_remainder=True if you want it to drop this final batch so that all batches have the exact same size.

#### map method

You can also transform the items by calling the map() method. For example, this creates a new dataset with all items doubled:

In [10]:
dataset = dataset.map(lambda x:x**3)
dataset

<_MapDataset element_spec=TensorSpec(shape=(None,), dtype=tf.int32, name=None)>

In [11]:
for item in dataset:
    print(item)

tf.Tensor([  0   1   8  27  64 125 216], shape=(7,), dtype=int32)
tf.Tensor([343 512 729   0   1   8  27], shape=(7,), dtype=int32)
tf.Tensor([ 64 125 216 343 512 729   0], shape=(7,), dtype=int32)
tf.Tensor([  1   8  27  64 125 216 343], shape=(7,), dtype=int32)
tf.Tensor([512 729   0   1   8  27  64], shape=(7,), dtype=int32)
tf.Tensor([125 216 343 512 729   0   1], shape=(7,), dtype=int32)
tf.Tensor([  8  27  64 125 216 343 512], shape=(7,), dtype=int32)
tf.Tensor([729   0   1   8  27  64 125], shape=(7,), dtype=int32)
tf.Tensor([216 343 512 729   0   1   8], shape=(7,), dtype=int32)
tf.Tensor([ 27  64 125 216 343 512 729], shape=(7,), dtype=int32)
tf.Tensor([  0   1   8  27  64 125 216], shape=(7,), dtype=int32)
tf.Tensor([343 512 729   0   1   8  27], shape=(7,), dtype=int32)
tf.Tensor([ 64 125 216 343 512 729], shape=(6,), dtype=int32)


#### map method usage
This function is the one you will call to apply any preprocessing you want to your data. Sometimes this will include computations that can be quite intensive, such as reshaping or rotating an image, so you will usually want to spawn multiple threads to speed things up: it’s as simple as setting the num_parallel_calls argument. Note that the function you pass to the map() method must be convertible to a TF Function

#### apply method

While the map() method applies a transformation to each item, the apply() method applies a transformation to the dataset as a whole.

For example, the following code. Each item in the new dataset will be a single-integer tensor instead of a batch of seven integers:applies the unbatch() function to the dataset.

In [12]:
dataset = dataset.apply(tf.data.experimental.unbatch())
dataset

Instructions for updating:
Use `tf.data.Dataset.unbatch()`.


<_UnbatchDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

#### It is also possible to simply filter the dataset using the filter() method:

In [13]:
dataset = dataset.filter(lambda x : x<7)
dataset

<_FilterDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

In [14]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)


#### You will often want to look at just a few items from a dataset. You can use the take() method for that:

In [15]:
for item in dataset.take(533):
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)


## Shuffling the Data

As you know, Gradient Descent works best when the instances in the training set are independent and identically distributed. A simple way to ensure this is to shuffle the instances, using the shuffle() method.

#### Shuffle method

1) It will create a new dataset that will start by filling up a buffer with the first items of the source dataset.

2) Then, whenever it is asked for an item, it will pull one out randomly from the buffer and replace it with a fresh one from the source dataset, until it has iterated entirely through the source dataset.

3) At this point it continues to pull out items randomly from the buffer until it is empty.

4) You must specify the buffer size, and it is important to make it large enough, or else shuffling will not be very effective.

#### A simple analogy to explain why it is important to have a bigger buffer size when shuffling
Imagine a sorted deck of cards on your left: suppose you just take the top three cards and shuffle them, then pick one randomly and put it to your right, keeping the other two in your hands. Take another card on your left, shuffle the three cards in your hands and pick one of them randomly, and put it on your right. When you are done going through all the cards like this, you will have a deck of cards on your right: do you think it will be perfectly shuffled?

5) Just don’t exceed the amount of RAM you have, and even if you have plenty of it, there’s no need to go beyond the dataset’s size.

6) You can provide a random seed if you want the same random order every time you run your program.

For example, the following code creates and displays a dataset containing the integers 0 to 9, repeated 3 times, shuffled using a buffer of size 5 and a random seed of 42, and batched with a batch size of 7:

In [16]:
dataset = tf.data.Dataset.range(10).repeat(3)
dataset

<_RepeatDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>

In [17]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype

In [18]:
dataset = dataset.shuffle(buffer_size=13,seed=231213).batch(7)
dataset

<_BatchDataset element_spec=TensorSpec(shape=(None,), dtype=tf.int64, name=None)>

In [19]:
for item in dataset:
    print(item)

tf.Tensor([9 6 3 1 1 7 6], shape=(7,), dtype=int64)
tf.Tensor([5 8 5 8 9 0 3], shape=(7,), dtype=int64)
tf.Tensor([4 2 2 0 8 6 4], shape=(7,), dtype=int64)
tf.Tensor([2 0 7 5 7 9 1], shape=(7,), dtype=int64)
tf.Tensor([3 4], shape=(2,), dtype=int64)


#### Note about repeat on a shuffled dataset
If you call repeat() on a shuffled dataset, by default it will generate a new order at every iteration. This is generally a good idea, but if you prefer to reuse the same order at each iteration (e.g., for tests or debugging), you can set reshuffle_each_iteration=False.


#### Shuffling for Large Dataset

For a large dataset that does not fit in memory, this simple shuffling-buffer approach may not be sufficient, since the buffer will be small compared to the dataset.

One solution is to shuffle the source data itself (for example, on Linux you can shuffle text files using the shuf command)

This will definitely improve shuffling a lot! Even if the source data is shuffled, you will usually want to shuffle it some more, or else the same order will be repeated at each epoch, and the model may end up being biased(e.g., due to some spurious patterns present by chance in the source data’s order).

To shuffle the instances some more, a common approach is to split the source data into multiple files, then read them in a random order during training. However, instances located in the same file will still end up close to each other

To avoid this you can pick multiple files randomly and read them simultaneously, interleaving their records(basically mixing their records, like you read first line from the first file and the next line from the seventh file).

#### Interleaving lines from multiple files

First, let’s suppose that you’ve loaded the California housing dataset, shuffled it (unless it was already shuffled), and split it into a training set, a validation set, and a test set. Then you split each set into many CSV files.

In [20]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

In [21]:
X_mean

array([ 3.89175860e+00,  2.86245478e+01,  5.45593655e+00,  1.09963474e+00,
        1.42428122e+03,  2.95886657e+00,  3.56464315e+01, -1.19584363e+02])

In [22]:
X_std

array([1.90927329e+00, 1.26409177e+01, 2.55038070e+00, 4.65460128e-01,
       1.09576000e+03, 2.36138048e+00, 2.13456672e+00, 2.00093304e+00])

In [23]:
dir = "C:/Users/manas/Downloads/"

In [24]:
def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join(dir,"datasets", "housing")
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")

    filepaths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        filepaths.append(part_csv)
        with open(part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return filepaths

In [25]:
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)

In [26]:
len(train_data)

11610

In [27]:
import pandas as pd

pd.read_csv(train_filepaths[0]).head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
0,3.5214,15.0,3.049945,1.106548,1447.0,1.605993,37.63,-122.43,1.442
1,5.3275,5.0,6.49006,0.991054,3464.0,3.44334,33.69,-117.39,1.687
2,3.1,29.0,7.542373,1.591525,1328.0,2.250847,38.44,-122.98,1.621
3,7.1736,12.0,6.289003,0.997442,1054.0,2.695652,33.55,-117.7,2.621
4,2.0549,13.0,5.312457,1.085092,3297.0,2.244384,33.93,-116.93,0.956


In [28]:
with open(train_filepaths[0]) as f:
    for i in range(5):
        print(f.readline(), end="")

MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442
5.3275,5.0,6.490059642147117,0.9910536779324056,3464.0,3.4433399602385686,33.69,-117.39,1.687
3.1,29.0,7.5423728813559325,1.5915254237288134,1328.0,2.2508474576271187,38.44,-122.98,1.621
7.1736,12.0,6.289002557544757,0.9974424552429667,1054.0,2.6956521739130435,33.55,-117.7,2.621


#### Now continuing with interleaving

By default, the list_files() function returns a dataset that shuffles the file paths. In general this is a good thing, but you can set shuffle=False if you do not want that for some reason.

In [29]:
train_datasets_filepath = tf.data.Dataset.list_files(train_filepaths, seed=42)

In [30]:
train_filepaths

['C:/Users/manas/Downloads/datasets\\housing\\my_train_00.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_01.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_02.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_03.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_04.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_05.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_06.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_07.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_08.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_09.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_10.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_11.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_12.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_13.csv',
 'C:/Users/manas/Downloads/datasets\\housing\\my_train_14.csv',
 'C:/Users/manas/Downloads/datasets\\hou

In [31]:
train_datasets_filepath

<_ShuffleDataset element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

In [32]:
# for i in train_datasets_filepath:
#     print(i)

#### Interleave()

Next, you can call the interleave() method to read from five files at a time and interleave their lines (skipping the first line of each file, which is the header row, using the skip() method):

In [33]:
n_readers = 5

In [34]:
dataset = train_datasets_filepath.interleave( lambda train_datasets_filepath: tf.data.TextLineDataset(train_datasets_filepath).skip(1),cycle_length=n_readers)

#### Working of Interleave()

1) The interleave() method will create a dataset that will pull five file paths (using the cyle length parameter) from the filepath_dataset, and for each one it will call the function you gave it (a lambda in this example) to create a new dataset (in this case a TextLineDataset).

#### This point is important and explained
2) To be clear, at this stage there will be seven datasets in all: the train_datasets_filepath dataset, the interleave dataset, and the five TextLineDatasets created internally by the interleave dataset.

#### Explanation:- 
2.i) Here although train_datasets_filepath actually contains file paths, it's still considered as a dataset.


2.ii) Now, the interleave dataset, this is the dataset that gets created by the first 5 files that were pulled and clubbed together by TextLineDatasets, hence since this TextLineDatasets is called by the interleave, thus it's said in the above statement TextLineDatasets created internally by the interleave dataset.


2.iii) Now the 5 TextLineDatasets, here, the n_reader is used, basically, since there are 5 files pulled out everytime, hence each file is considered as a dataset, thus 5 TextLineDatasets.

3) When we iterate over the interleave dataset, it will cycle through these five TextLineDatasets, reading one line at a time from each until all datasets are out of items. Then it will get the next five file paths from the filepath_dataset and interleave them the same way,
and so on until it runs out of file paths

So in a way you can say that although in the end you get the full shuffled dataset, the actual interleave dataset is created in iteration.

In [35]:
c = 0
for i in dataset:
    # print(i)
    c+=1
    

In [36]:
c

11610

By default, interleave() does not use parallelism; it just reads one line at a time from each file, sequentially. If you want it to actually read files in parallel, you can set the num_parallel_calls argument to the number of threads you want (note that the map() method also has this argument). You can even set it to tf.data.experimental.AUTOTUNE to make TensorFlow choose the right number of threads dynamically based on the available CPU

## Preprocessing the Data

#### Remember at this time our data is concatenated that means the X_train and y_train are present in one single file.

Now that our data is shuffled and loaded, let's pre-process this data.

Let’s implement a small function that will perform this preprocessing:

In [37]:
X_mean

array([ 3.89175860e+00,  2.86245478e+01,  5.45593655e+00,  1.09963474e+00,
        1.42428122e+03,  2.95886657e+00,  3.56464315e+01, -1.19584363e+02])

In [38]:
X_std

array([1.90927329e+00, 1.26409177e+01, 2.55038070e+00, 4.65460128e-01,
       1.09576000e+03, 2.36138048e+00, 2.13456672e+00, 2.00093304e+00])

In [39]:
n_inputs = 8

In [40]:
def preprocess(line):
    defs = [0.] * n_inputs + [tf.constant([], dtype=tf.float32)]
    fields = tf.io.decode_csv(line, record_defaults=defs)
    print('fields hai ',fields,'\n\n\n')
    x = tf.stack(fields[:-1])
    y = tf.stack(fields[-1:])
    return (x - X_mean) / X_std, y


Let’s walk through this code:

1) First, the function assumes that we have precomputed the mean and standard deviation of each feature in the training set. X_mean and X_std are just 1D tensors (or NumPy arrays) containing eight floats, one per input feature.

2) The preprocess() function takes one CSV line and starts by parsing it. For this it uses the tf.io.decode_csv() function, which takes two arguments: the first is the line to parse, and the second is an array containing the default value for each column, the defs that we created in the CSV file.

3) This array tells TensorFlow not only the default value for each column, but also the number of columns and their types.

4) In this example, we tell it that all feature columns are floats and that missing values should default to 0, this is for the first 8 columns, but we provide an empty array of type tf.float32 as the default value for the last column (the target) which is done using [tf.constant([], dtype=tf.float32)], hence we just concatenate the two lists, the [0.] * n_inputs and [tf.constant([], dtype=tf.float32)].

#### Important point
5) The array [tf.constant([], dtype=tf.float32)] tells TensorFlow that this column contains floats, but that there is no default value, so it will raise an exception if it encounters a missing value.

6) The decode_csv() function returns a list of scalar tensors (one per column), but we need to return 1D tensor arrays. So we call tf.stack() on all tensors except for the last one (the target): this will stack these tensors into a 1D array. We then do the same for the target value (this makes it a 1D tensor array with a single value, rather than a scalar tensor).

7) Finally, we scale the input features by subtracting the feature means and then dividing by the feature standard deviations, and we return a tuple containing the scaled features and the target.

Let’s test this preprocessing function:

In [41]:
preprocess(b'4.2083,44.0,5.3232,0.9171,846.0,2.3370,37.47,-122.2,2.782')

fields hai  [<tf.Tensor: shape=(), dtype=float32, numpy=4.2083>, <tf.Tensor: shape=(), dtype=float32, numpy=44.0>, <tf.Tensor: shape=(), dtype=float32, numpy=5.3232>, <tf.Tensor: shape=(), dtype=float32, numpy=0.9171>, <tf.Tensor: shape=(), dtype=float32, numpy=846.0>, <tf.Tensor: shape=(), dtype=float32, numpy=2.337>, <tf.Tensor: shape=(), dtype=float32, numpy=37.47>, <tf.Tensor: shape=(), dtype=float32, numpy=-122.2>, <tf.Tensor: shape=(), dtype=float32, numpy=2.782>] 





(<tf.Tensor: shape=(8,), dtype=float32, numpy=
 array([ 0.16579159,  1.216324  , -0.05204564, -0.39215982, -0.5277444 ,
        -0.2633488 ,  0.8543046 , -1.3072058 ], dtype=float32)>,
 <tf.Tensor: shape=(1,), dtype=float32, numpy=array([2.782], dtype=float32)>)

## Putting Everything Together

To make the code reusable, let’s put together everything we have discussed so far into a small helper function: it will create and return a dataset that will efficiently load California housing data from multiple CSV files, preprocess it, shuffle it, optionally repeat it, and batch it

In [42]:
def csv_reader_dataset(filepaths, repeat=1, n_readers=5,n_read_threads=None, shuffle_buffer_size=10000,n_parse_threads=5, batch_size=32):

    #Below we get the list of path in a tensor
    train_datasets_filepath = tf.data.Dataset.list_files(train_filepaths, seed=42)

    # Then we read the data from those files as explained before in text line format
    dataset = train_datasets_filepath.interleave(lambda train_datasets_filepath: tf.data.TextLineDataset(train_datasets_filepath).skip(1),cycle_length=n_readers, num_parallel_calls=n_read_threads) # Here read_threads is used because we are reading the data from the disk

    # Now we call the pre-process function that we just defined to get our data in desired tensor format
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads) # Here it's parse threads because we're just parsing the already read data for processing

    dataset = dataset.shuffle(shuffle_buffer_size).repeat(repeat)

    return dataset.batch(batch_size).prefetch(1) # Prefetching is explained below, the dataset.batch just batches the entire data into the specified batches. So you have the entire data, just in 32 tensor batches which is the batch size for our case.


## Prefetching

1) By calling prefetch(1) at the end, we are creating a dataset that will do its best to always be one batch ahead.

2) In general, just prefetching one batch is fine, but in some cases you may need to prefetch a few more. Alternatively, you can let TensorFlow decide automatically by passing tf.data.experimental.AUTOTUNE (this is an experimental feature for now).

3) In other words, while our training algorithm is working on one batch, the dataset will already be working in parallel on getting the next batch ready (e.g., reading the data from disk and preprocessing it). This can improve performance dramatically.

4) If we also ensure that loading and preprocessing are multithreaded (by setting num_parallel_calls when calling interleave() and map()), we can exploit multiple cores on the CPU and hopefully make preparing one batch of data shorter than running a training step on the GPU.

5) This way the GPU will be almost 100% utilized (except for the data transfer time from the CPU to the GPU. But check out the tf.data.experimental.prefetch_to_device() function, which can prefetch data directly to the GPU), and training will run much faster.

6) If the dataset is small enough to fit in memory, you can significantly speed up training by using the dataset’s cache() method to cache its content to RAM. You should generally do this after loading and preprocessing the data, but before shuffling, repeating, batching, and prefetching. This way, each instance will only be read and preprocessed once (instead of once per epoch as you've already loaded the entire data in your RAM and now you just shuffle and batch it for every epoch), but the data will still be shuffled differently at each epoch, and the next batch will still be prepared in advance.

### Other Methods

There are a few more you may want to look at: concatenate(), zip(), window(), reduce(), shard(), flat_map(), and padded_batch(). There are also a couple more class methods: from_generator() and from_tensors(), which create a new dataset from a Python generator or a list of tensors, respectively. Please check the API documentation for more details. Also note that there are experimental features available in tf.data.experimental, many of which will likely make it to the core API in future releases (e.g., check out the CsvDataset class, as well as the make_csv_dataset() method, which takes care of inferring the type of each column).

## Using the Dataset with tf.keras

Now we can use the csv_reader_dataset() function to create a dataset for the training set. Note that we do not need to repeat it, as this will be taken care of by tf.keras. We also create datasets for the validation set and the test set:


In [43]:
train_set = csv_reader_dataset(train_filepaths)
valid_set = csv_reader_dataset(valid_filepaths)
test_set = csv_reader_dataset(test_filepaths)
test_set

fields hai  [<tf.Tensor 'DecodeCSV:0' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:1' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:2' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:3' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:4' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:5' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:6' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:7' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:8' shape=() dtype=float32>] 



fields hai  [<tf.Tensor 'DecodeCSV:0' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:1' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:2' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:3' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:4' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:5' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:6' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:7' shape=() dtype=float32>, <tf.Tensor 'DecodeCSV:8' shape=() dtype=float32>] 



fields hai  [<tf.Tensor 'DecodeCSV:0' shape=() dtype=float32>, <tf

<_PrefetchDataset element_spec=(TensorSpec(shape=(None, 8), dtype=tf.float32, name=None), TensorSpec(shape=(None, 1), dtype=tf.float32, name=None))>

In [44]:
train_set

<_PrefetchDataset element_spec=(TensorSpec(shape=(None, 8), dtype=tf.float32, name=None), TensorSpec(shape=(None, 1), dtype=tf.float32, name=None))>

In [45]:
c = 0
for item in train_set:
    c+=1
    # print(item)

In [46]:
c

363

And now we can simply build and train a Keras model using these datasets. All we need to do is pass the training and validation datasets to the fit() method, instead of X_train, y_train, X_valid, and y_valid:


In [47]:
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

model = keras.models.Sequential([ keras.layers.Dense(30, activation="relu", input_shape=X_train.shape[1:]), 
                                 keras.layers.Dense(1), ])




In [48]:
model.compile(loss="mse", optimizer=keras.optimizers.SGD(learning_rate=1e-3))

In [49]:
batch_size = 32
model.fit(train_set,  epochs=20,
          validation_data=valid_set)

Epoch 1/20

Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x29ae4f5a510>

In [50]:
model.evaluate(test_set)



0.4256276488304138

In [51]:
c = 0
for k in test_set.take(3):
    # print(k[1])
    c=0
    

In [52]:
new_set = test_set.take(3)
new_set

<_TakeDataset element_spec=(TensorSpec(shape=(None, 8), dtype=tf.float32, name=None), TensorSpec(shape=(None, 1), dtype=tf.float32, name=None))>

In [53]:
new_set = test_set.take(3).map(lambda X, y: X) # Here we are basically taking the features, i.e. X and the labels i.e. y and only keeping the features i.e. X.
# If you look at the test set, it consists of batches and each batch has two tensors, the features, i.e. X and the labels, i.e, y. Remember that X and y that
# we are using here are just for naming in general. Now you can refer the below for loop and try printing k[0], which will give you the features tensor
# or k[1] that will give you the label tensor.

In [54]:
for k in new_set:
    c= 0
    # print(k)

## The TFRecord Format

The TFRecord format is TensorFlow’s preferred format for storing large amounts of data and reading it efficiently. 

It is a very simple binary format that just contains a sequence of binary records of varying sizes (each record is comprised of a length, a CRC checksum to check that the length was not corrupted, then the actual data, and finally a CRC checksum for the data).

You can easily create a TFRecord file using the tf.io.TFRecordWriter class:


In [55]:
with tf.io.TFRecordWriter("my_data.tfrecord") as f:
    f.write(b"This is the first record")
    f.write(b"And this is the second rord")

And you can then use a tf.data.TFRecordDataset to read one or more TFRecord files:

In [56]:
filepaths = ["my_data.tfrecord"]
dataset = tf.data.TFRecordDataset(filepaths)

In [57]:
dataset

<TFRecordDatasetV2 element_spec=TensorSpec(shape=(), dtype=tf.string, name=None)>

In [58]:
for item in dataset:
    print(item)

tf.Tensor(b'This is the first record', shape=(), dtype=string)
tf.Tensor(b'And this is the second rord', shape=(), dtype=string)


#### Note about TFRecord
By default, a TFRecordDataset will read files one by one, but you can make it read multiple files in parallel and interleave their records by setting num_parallel_reads. Alternatively, you could obtain the same result by using list_files() and interleave() as we did earlier to read multiple CSV files.

## Compressed TFRecord Files

It can sometimes be useful to compress your TFRecord files, especially if they need to be loaded via a network connection. You can create a compressed TFRecord file by setting the options argument:

In [59]:
options = tf.io.TFRecordOptions(compression_type="GZIP")

In [60]:
with tf.io.TFRecordWriter("my_compressed.tfrecord", options) as myfile:
    myfile.write('Record 1 for comp file')
    myfile.write('Record 2 for my comp file')
    

#### When reading a compressed TFRecord file, you need to specify the compression type:

In [61]:
dataset = tf.data.TFRecordDataset(["my_compressed.tfrecord"],compression_type="GZIP")

In [62]:
for item in dataset:
    print(item)

tf.Tensor(b'Record 1 for comp file', shape=(), dtype=string)
tf.Tensor(b'Record 2 for my comp file', shape=(), dtype=string)


## A Brief Introduction to Protocol Buffers

Even though each record can use any binary format you want, TFRecord files usually contain serialized protocol buffers (also called protobufs).

This is a portable, extensible, and efficient binary format developed at Google back in 2001 and made open source in 2008; protobufs are now widely used, in particular in gRPC, Google’s remote procedure call system.

### What is Remote Procedure Call (RPC)?
Remote Procedure Call (RPC) is a type of technology used in computing to enable a program to request a service from software located on another computer in a network without needing to understand the network’s details. RPC abstracts the complexities of the network by allowing the developer to think in terms of function calls rather than network details, facilitating the process of making a piece of software distributed across different systems.

RPC works by allowing one program (a client) to directly call procedures (functions) on another machine (the server). The client makes a procedure call that appears to be local but is run on a remote machine. When an RPC is made, the calling arguments are packaged and transmitted across the network to the server. The server unpacks the arguments, performs the desired procedure, and sends the results back to the client. 

They are defined using a simple language that looks like this:


First let's write a simple protobuf definition:

In [63]:
%%writefile person.proto
syntax = "proto3";
message Person {
  string name = 1;
  int32 id = 2;
  repeated string email = 3;
}

Overwriting person.proto


1) This definition says we are using version 3 of the protobuf format, and it specifies that each Person object may (optionally) have a name of type string, an id of type int32, and zero or more email fields, each of type string.

2) The numbers 1, 2, and 3 are the field identifiers: they will be used in each record’s binary representation.

3) Once you have a definition in a .proto file, you can compile it. This requires protoc, the protobuf compiler, to generate access classes in Python (or some other language).


4) To illustrate the basics, let’s look at a simple example that uses the access classes generated for the Person protobuf (the code is explained in the comments):

In [64]:
# >>> from person_pb2 import Person # import the generated access class
# >>> person = Person(name="Al", id=123, email=["a@b.com"]) # create a Person
# >>> print(person) # display the Person
# name: "Al"
# id: 123
# email: "a@b.com"
# >>> person.name # read a field
# "Al"
# >>> person.name = "Alice" # modify a field
# >>> person.email[0] # repeated fields can be accessed like arrays
# "a@b.com"
# >>> person.email.append("c@d.com") # add an email address
# >>> s = person.SerializeToString() # serialize the object to a byte string
# >>> s
# b'\n\x05Alice\x10{\x1a\x07a@b.com\x1a\x07c@d.com'
# >>> person2 = Person() # create a new Person
# >>> person2.ParseFromString(s) # parse the byte string (27 bytes long)
# 27
# >>> person == person2 # now they are equal
# True


#### Code summarization
In short, we import the Person class generated by protoc, we create an instance and play with it, visualizing it and reading and writing some fields, then we serialize it using the SerializeToString() method. This is the binary data that is ready to be saved or transmitted over the network. When reading or receiving this binary data, we can parse it using the ParseFromString() method, and we get a copy of the object that was serialized.

### To learn more about protobufs, please visit https://homl.info/protobuf

### Important Note about SerializeToString and ParseFromString functions

We could save the serialized Person object to a TFRecord file, then we could load and parse it: everything would work fine.

However, SerializeToString() and ParseFromString() are not TensorFlow operations (and neither are the other operations in this code), so they cannot be included in a TensorFlow Function (except by wrapping them in a tf.py_function() operation, which would make the code slower and less portable.

Fortunately, TensorFlow does include special protobuf definitions for which it provides parsing operations.

## TensorFlow Protobufs

The main protobuf typically used in a TFRecord file is the Example protobuf, which represents one instance in a dataset. It contains a list of named features, where each feature can either be a list of byte strings, a list of floats, or a list of integers. Here is the protobuf definition:

In [65]:
%%writefile person.proto
syntax = "proto3";
message BytesList { repeated bytes value = 1; }
message FloatList { repeated float value = 1 [packed = true]; }
message Int64List { repeated int64 value = 1 [packed = true]; }
message Feature {
 oneof kind {
 BytesList bytes_list = 1;
 FloatList float_list = 2;
 Int64List int64_list = 3;
 }
};
message Features { map<string, Feature> feature = 1; };
message Example { Features features = 1; };

Overwriting person.proto


#### Explanation of above code
The definitions of BytesList, FloatList, and Int64List are straightforward enough. Note that [packed = true] is used for repeated numerical fields, for a more efficient encoding. A Feature contains either a BytesList, a FloatList, or an Int64List. A Features (with an s) contains a dictionary that maps a feature name to the corresponding feature value. And finally, an Example contains only a Features object(Why was Example even defined, since it contains no more than a Features object? Well, TensorFlow’s developers may one day decide to add more fields to it. As long as the new Example definition still contains the features field, with the same ID, it will be backward compatible. This extensibility is one of the great features of protobufs.). Here is how you could create a tf.train.Example representing the same person as earlier and write it to a TFRecord file:

In [66]:
from tensorflow.train import BytesList, FloatList, Int64List
from tensorflow.train import Feature, Features, Example

In [67]:
person_example = Example(
 features=Features(
 feature={
 "name": Feature(bytes_list=BytesList(value=[b"Alice"])),"id": Feature(int64_list=Int64List(value=[123])),
 "emails": Feature(bytes_list=BytesList(value=[b"a@b.com",
 b"c@d.com"]))
 }))

Now that we have an Example protobuf, we can serialize it by calling its SerializeToString() method, then write the resulting data to a TFRecord file:

In [68]:
with tf.io.TFRecordWriter("my_contacts.tfrecord") as f:
    f.write(person_example.SerializeToString())

#### Important!!

Normally you would write much more than one Example! Typically, you would create a conversion script that reads from your current format (say, CSV files), creates an Example protobuf for each instance, serializes them, and saves them to several TFRecord files, ideally shuffling them in the process. This requires a bit of work, so once again make sure it is really necessary (perhaps your pipeline works fine with CSV files).

#### Now that we have a nice TFRecord file containing a serialized Example, let’s try to load it.

## Loading and Parsing Examples

1) To load the serialized Example protobufs, we will use a tf.data.TFRecordDataset once again, and we will parse each Example using tf.io.parse_single_example(). This is a TensorFlow operation, so it can be included in a TF Function. It requires at least two arguments: a string scalar tensor containing the serialized data, and a description of each feature.

2) The description is a dictionary that maps each feature name to either a tf.io.FixedLenFeature descriptor indicating the feature’s shape, type, and default value, or a tf.io.VarLenFeature descriptor indicating only the type (if the length of the feature’s list may vary, such as for the "emails" feature).

The following code defines a description dictionary, then it iterates over the TFRecord Dataset and parses the serialized Example protobuf this dataset contains:

In [69]:
feature_description = {
 "name": tf.io.FixedLenFeature([], tf.string, default_value=""),
 "id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
 "emails": tf.io.VarLenFeature(tf.string),
}


In [70]:
for serialized_example in tf.data.TFRecordDataset(["my_contacts.tfrecord"]):
    parsed_example = tf.io.parse_single_example(serialized_example,feature_description)

The fixed-length features are parsed as regular tensors, but the variable-length features are parsed as sparse tensors. You can convert a sparse tensor to a dense tensor using tf.sparse.to_dense(), but in this case it is simpler to just access its values:

In [71]:
tf.sparse.to_dense(parsed_example["emails"], default_value=b"")


<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>

In [72]:
parsed_example["emails"].values


<tf.Tensor: shape=(2,), dtype=string, numpy=array([b'a@b.com', b'c@d.com'], dtype=object)>

### Important Note about ByteList and tf.parser!!!!!!!!!!!!!!!!!!!!!!!!!!!!

A BytesList can contain any binary data you want, including any serialized object. For example, you can use tf.io.encode_jpeg() to encode an image using the JPEG format and put this binary data in a BytesList. Later, when your code reads the TFRecord, it will start by parsing the Example, then it will need to call tf.io.decode_jpeg() to parse the data and get the original image (or you can use tf.io.decode_image(), which can decode any BMP, GIF, JPEG, or PNG image). You can also store any tensor you want in a BytesList by serializing the tensor using tf.io.serialize_tensor() then putting the resulting byte string in a BytesList feature. Later, when you parse the TFRecord, you can parse this data using tf.io.parse_tensor().

Instead of parsing examples one by one using tf.io.parse_single_example(), you may want to parse them batch by batch using tf.io.parse_example():

In [73]:
dataset = tf.data.TFRecordDataset(["my_contacts.tfrecord"]).batch(10)

In [74]:
for j in dataset:
    print(j)

tf.Tensor([b'\n@\n\x11\n\x04name\x12\t\n\x07\n\x05Alice\n\x1e\n\x06emails\x12\x14\n\x12\n\x07a@b.com\n\x07c@d.com\n\x0b\n\x02id\x12\x05\x1a\x03\n\x01{'], shape=(1,), dtype=string)


In [75]:
for serialized_examples in dataset:
    parsed_examples = tf.io.parse_example(serialized_examples,feature_description)

In [76]:
parsed_examples

{'emails': SparseTensor(indices=tf.Tensor(
 [[0 0]
  [0 1]], shape=(2, 2), dtype=int64), values=tf.Tensor([b'a@b.com' b'c@d.com'], shape=(2,), dtype=string), dense_shape=tf.Tensor([1 2], shape=(2,), dtype=int64)),
 'id': <tf.Tensor: shape=(1,), dtype=int64, numpy=array([123], dtype=int64)>,
 'name': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Alice'], dtype=object)>}

## SequenceExample

As you can see, the Example protobuf will probably be sufficient for most use cases. However, it may be a bit cumbersome to use when you are dealing with lists of lists. For example, suppose you want to classify text documents. Each document may be represented as a list of sentences, where each sentence is represented as a list of words. And perhaps each document also has a list of comments, where each comment is represented as a list of words. There may be some contextual data too, such as the document’s author, title, and publication date. TensorFlow’s SequenceExample protobuf is designed for such use cases.

### Handling Lists of Lists Using the SequenceExample Protobuf

Here is the definition of the SequenceExample protobuf:

In [77]:
%%writefile sequence_exmp.proto
message FeatureList { repeated Feature feature = 1; };
message FeatureLists { map<string, FeatureList> feature_list = 1; };
message SequenceExample {
 Features context = 1;
 FeatureLists feature_lists = 2;
};

Overwriting sequence_exmp.proto


1) A SequenceExample contains a Features object for the contextual data and a FeatureLists object that contains one or more named FeatureList objects (e.g., a FeatureList named "content" and another named "comments").

2) Each FeatureList contains a list of Feature objects, each of which may be a list of byte strings, a list of 64-bit integers, or a list of floats (in this example, each Feature would represent a sentence or a comment, perhaps in the form of a list of word identifiers).

3) Building a SequenceExample, serializing it, and parsing it is similar to building, serializing, and parsing an Example, but you must use tf.io.parse_single_sequence_example() to parse a single SequenceExample or tf.io.parse_sequence_example() to parse a batch.

4) Both functions return a tuple containing the context features (as a dictionary) and the feature lists (also as a dictionary). If the feature lists contain sequences of varying sizes (as in the preceding example), you may want to convert them to ragged tensors, using tf.RaggedTensor.from_sparse()

#### Below we will see the example of working with sequenceExample

In [78]:
FeatureList = tf.train.FeatureList
FeatureLists = tf.train.FeatureLists
SequenceExample = tf.train.SequenceExample

context = Features(feature={
    "author_id": Feature(int64_list=Int64List(value=[123])),
    "title": Feature(bytes_list=BytesList(value=[b"A", b"desert", b"place", b"."])),
    "pub_date": Feature(int64_list=Int64List(value=[1623, 12, 25]))
})

content = [["When", "shall", "we", "three", "meet", "again", "?"],
           ["In", "thunder", ",", "lightning", ",", "or", "in", "rain", "?"]]
comments = [["When", "the", "hurlyburly", "'s", "done", "."],
            ["When", "the", "battle", "'s", "lost", "and", "won", "."]]

def words_to_feature(words):
    return Feature(bytes_list=BytesList(value=[word.encode("utf-8")
                                               for word in words]))

content_features = [words_to_feature(sentence) for sentence in content]
comments_features = [words_to_feature(comment) for comment in comments]
            
sequence_example = SequenceExample(
    context=context,
    feature_lists=FeatureLists(feature_list={
        "content": FeatureList(feature=content_features),
        "comments": FeatureList(feature=comments_features)
    }))

In [79]:
serialized_sequence_example = sequence_example.SerializeToString()

In [80]:
context_feature_descriptions = {
    "author_id": tf.io.FixedLenFeature([], tf.int64, default_value=0),
    "title": tf.io.VarLenFeature(tf.string),
    "pub_date": tf.io.FixedLenFeature([3], tf.int64, default_value=[0, 0, 0]),
}
sequence_feature_descriptions = {
    "content": tf.io.VarLenFeature(tf.string),
    "comments": tf.io.VarLenFeature(tf.string),
}
parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
    serialized_sequence_example, context_feature_descriptions,
    sequence_feature_descriptions)

In [81]:
parsed_context, parsed_feature_lists = tf.io.parse_single_sequence_example(
 serialized_sequence_example, context_feature_descriptions,
 sequence_feature_descriptions)
parsed_content = tf.RaggedTensor.from_sparse(parsed_feature_lists["content"])

In [82]:

parsed_context

{'title': SparseTensor(indices=tf.Tensor(
 [[0]
  [1]
  [2]
  [3]], shape=(4, 1), dtype=int64), values=tf.Tensor([b'A' b'desert' b'place' b'.'], shape=(4,), dtype=string), dense_shape=tf.Tensor([4], shape=(1,), dtype=int64)),
 'author_id': <tf.Tensor: shape=(), dtype=int64, numpy=123>,
 'pub_date': <tf.Tensor: shape=(3,), dtype=int64, numpy=array([1623,   12,   25], dtype=int64)>}

In [83]:
print(tf.RaggedTensor.from_sparse(parsed_feature_lists["content"]))

<tf.RaggedTensor [[b'When', b'shall', b'we', b'three', b'meet', b'again', b'?'],
 [b'In', b'thunder', b',', b'lightning', b',', b'or', b'in', b'rain', b'?']]>


## Preprocessing the Input Features

Preparing your data for a neural network requires converting all features into numerical features, generally normalizing them, and more.

In particular, if your data contains categorical features or text features, they need to be converted to numbers.

This can be done ahead of time when preparing your data files, using any tool you like (e.g., NumPy, pandas, or Scikit-Learn).

Alternatively, you can preprocess your data on the fly when loading it with the Data API (e.g., using the dataset’s map() method, as we saw earlier), or you can include a preprocessing layer directly in your model.

Let’s look at this last option now.

For example, here is how you can implement a standardization layer using a Lambda layer. For each feature, it subtracts the mean and divides by its standard deviation (plus a tiny smoothing term to avoid division by zero):

In [84]:
means = np.mean(X_train, axis=0, keepdims=True)
stds = np.std(X_train, axis=0, keepdims=True)

In [85]:
eps = keras.backend.epsilon()

In [86]:
model = keras.models.Sequential([keras.layers.Lambda(lambda inputs: (inputs - means) / (stds + eps))
                                 ### [Other layers]
                                ])
                                 

However, you may prefer to use a nice self-contained custom layer (much like Scikit-Learn’s StandardScaler), rather than having global variables like means and stds dangling around:

In [87]:
class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)
    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())


####  IMPORTANT POINT!!!!!!
#### Before you can use this standardization layer, you will need to adapt it to your dataset by calling the adapt() method and passing it a data sample. This will allow it to use the appropriate mean and standard deviation for each feature:

In [88]:
(X_train_full, y_train_full), (X_test, y_test) = keras.datasets.fashion_mnist.load_data()
X_valid, X_train = X_train_full[:5000], X_train_full[5000:]
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

In [89]:
train_set = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(len(X_train))
valid_set = tf.data.Dataset.from_tensor_slices((X_valid, y_valid))
test_set = tf.data.Dataset.from_tensor_slices((X_test, y_test))

In [90]:
def preprocess(tfrecord):
    feature_descriptions = {
        "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "label": tf.io.FixedLenFeature([], tf.int64, default_value=-1)
    }
    example = tf.io.parse_single_example(tfrecord, feature_descriptions)
    image = tf.io.parse_tensor(example["image"], out_type=tf.uint8)
    #image = tf.io.decode_jpeg(example["image"])
    image = tf.reshape(image, shape=[28, 28])
    return image, example["label"]

def mnist_dataset(filepaths, n_read_threads=5, shuffle_buffer_size=None,
                  n_parse_threads=5, batch_size=32, cache=True):
    dataset = tf.data.TFRecordDataset(filepaths,
                                      num_parallel_reads=n_read_threads)
    if cache:
        dataset = dataset.cache()
    if shuffle_buffer_size:
        dataset = dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.map(preprocess, num_parallel_calls=n_parse_threads)
    dataset = dataset.batch(batch_size)
    return dataset.prefetch(1)

In [91]:
train_set = mnist_dataset(train_filepaths, shuffle_buffer_size=60000)
valid_set = mnist_dataset(valid_filepaths)
test_set = mnist_dataset(test_filepaths)

In [92]:
std_layer = Standardization()

In [93]:
# sample_image_batches = train_set.take(100).map(lambda image, label: image)
# sample_images = np.concatenate(list(sample_image_batches.as_numpy_iterator()),
#                                axis=0).astype(np.float32)
# std_layer.adapt(sample_images)


This sample must be large enough to be representative of your dataset, but it does not have to be the full training set: in general, a few hundred randomly selected instances will suffice (however, this depends on your task). Next, you can use this preprocessing layer like a normal layer:

In [94]:
model = keras.models.Sequential([
    std_layer,
    keras.layers.Flatten(),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(10, activation="softmax")
])


If you are thinking that Keras should contain a standardization layer like this one the keras.layers.Normalization works very much like our custom Standardization layer: first, create the layer, then adapt it to your dataset by passing
a data sample to the adapt() method, and finally use the layer normally For adapt example see https://www.tensorflow.org/guide/keras/preprocessing_layers#the_adapt_method.

In [95]:
#standardization = keras.layers.Normalization()

## Encoding Categorical Features Using One-Hot Vectors

Now let’s look at categorical features. We will start by encoding them as one-hot vectors.

Consider the ocean_proximity feature in the California housing dataset we explored in Chapter 2: it is a categorical feature with five possible values: "<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", and "ISLAND".

We need to encode this feature before we feed it to a neural network. Since there are very few categories, we can use one-hot encoding. For this, we first need to map each category to its index (0 to 4), which can be done using a lookup table:

In [96]:
vocab = ["<1H OCEAN", "INLAND", "NEAR OCEAN", "NEAR BAY", "ISLAND"]

In [97]:
indices = tf.range(len(vocab), dtype=tf.int64)
indices

<tf.Tensor: shape=(5,), dtype=int64, numpy=array([0, 1, 2, 3, 4], dtype=int64)>

In [98]:
table_init = tf.lookup.KeyValueTensorInitializer(vocab, indices)
table_init

<tensorflow.python.ops.lookup_ops.KeyValueTensorInitializer at 0x29ae4f14810>

In [99]:
num_oov_buckets = 2


In [100]:
table = tf.lookup.StaticVocabularyTable(table_init, num_oov_buckets)

In [101]:
table

<tensorflow.python.ops.lookup_ops.StaticVocabularyTable at 0x29ae853c910>

#### Let’s go through this code:

1) We first define the vocabulary: this is the list of all possible categories.

2) Then we create a tensor with the corresponding indices (0 to 4).

#### Important Point!!!

3) Next, we create an initializer for the lookup table, passing it the list of categories and their corresponding indices. In this example, we already have this data, so we use a KeyValueTensorInitializer; but if the categories were listed in a text file (with one category per line), we would use a TextFileInitializer instead.

4) In the last two lines we create the lookup table, giving it the initializer and specifying the number of out-of-vocabulary (oov) buckets.

#### Important Point!!!
If we look up a category that does not exist in the vocabulary, the lookup table will compute a hash of this category and use it to assign the unknown category to one of the oov buckets. Their indices start after the known categories, so in this example the indices of the two oov buckets are 5 and 6. So what this means is that the new category can either be assigned at index 5 or 6, the index here is being considered as bucket. It depends on the lookup table. 

#### In the below example of categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND","HAWAII"]), there are two new values, and hence one got assigned to index 5 and other to index 6 as per our num_oov_value parameter.

#### However, if we use categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND","HAWAII","BALI"]), then both Hawaii and bali got assigned to index 6.

### oov buckets

Why use oov buckets? Well, if the number of categories is large (e.g., zip codes, cities, words, products, or users) and the dataset is large as well, or it keeps changing, then getting the full list of categories may not be convenient.

One solution is to define the vocabulary based on a data sample (rather than the whole training set) and add some oov buckets for the other categories that were not in the data sample.

The more unknown categories you expect to find during training, the more oov buckets you should use. Indeed, if there are not enough oov buckets, there will be collisions: different categories will end up in the same bucket, so the neural network will not be able to distinguish them (at least not based on this feature).

Now let’s use the lookup table to encode a small batch of categorical features to onehot vectors:

In [102]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND","HAWAII","BALI"])
categories

<tf.Tensor: shape=(6,), dtype=string, numpy=
array([b'NEAR BAY', b'DESERT', b'INLAND', b'INLAND', b'HAWAII', b'BALI'],
      dtype=object)>

In [103]:
for j in categories:
    print(j)

tf.Tensor(b'NEAR BAY', shape=(), dtype=string)
tf.Tensor(b'DESERT', shape=(), dtype=string)
tf.Tensor(b'INLAND', shape=(), dtype=string)
tf.Tensor(b'INLAND', shape=(), dtype=string)
tf.Tensor(b'HAWAII', shape=(), dtype=string)
tf.Tensor(b'BALI', shape=(), dtype=string)


In [104]:
cat_indices = table.lookup(categories)
cat_indices

<tf.Tensor: shape=(6,), dtype=int64, numpy=array([3, 5, 1, 1, 6, 6], dtype=int64)>

In [105]:
cat_one_hot = tf.one_hot(cat_indices, depth=len(vocab) + num_oov_buckets)
cat_one_hot

<tf.Tensor: shape=(6, 7), dtype=float32, numpy=
array([[0., 0., 0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 1.]], dtype=float32)>

As you can see, "NEAR BAY" was mapped to index 3, the unknown category "DESERT" was mapped to one of the two oov buckets (at index 5), and "INLAND" was mapped to index 1, twice and HAWAII got mapped to index 6. Then we used tf.one_hot() to one-hot encode these indices.

Notice that we have to tell this function the total number of indices, which is equal to the vocabulary size plus the number of oov buckets.

### Creating the Pipeline, Instructions

1) Just like earlier, it wouldn’t be too difficult to bundle all of this logic into a nice selfcontained class. Its adapt() method would take a data sample and extract all the distinct categories it contains. 

2) It would create a lookup table to map each category to its index (including unknown categories using oov buckets). Then its call() method would use the lookup table to map the input categories to their indices.

3) Keras has a layer called keras.layers.TextVectorization, which will be capable of doing exactly that: its adapt() method will extract the vocabulary from a data sample, and its call() method will convert each category to its index in the vocabulary. For adapt method see https://www.tensorflow.org/guide/keras/preprocessing_layers#the_adapt_method

4) You could add this layer at the beginning of your model, followed by a Lambda layer that would apply the tf.one_hot() function, if you want to convert these indices to one-hot vectors.

In [106]:
keras.layers.TextVectorization

keras.src.layers.preprocessing.text_vectorization.TextVectorization

### IMPORTANT POINT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

As a rule of thumb, if the number of categories is lower than 10, then one-hot encoding is generally the way to go (but your mileage may vary!). If the number of categories is greater than 50 (which is often the case when you use hash buckets), then embeddings are usually preferable. In between 10 and 50 categories, you may want to experiment with both options and see which one works best for your use case.

## IMPORTANT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Encoding Categorical Features Using Embeddings

An embedding is a trainable dense vector that represents a category. By default, embeddings are initialized randomly, so for example the "NEAR BAY" category could be represented initially by a random vector such as [0.131, 0.890]. while the "NEAR OCEAN" category might be represented by another random vector such as [0.631,0.791].

In this example, we use 2D embeddings, but the number of dimensions is a hyperparameter you can tweak.

#### IMPORTANT POINT !!!!!!!!!!!
Since these embeddings of "NEAR BAY" and "NEAR OCEAN" are trainable, they will gradually improve during training; and as they represent fairly similar categories, Gradient Descent will certainly end up pushing them closer together, while it will tend to move them away from the "INLAND" category’s embedding.

### Representation learning 

The better the representation, the easier it will be for the neural network to make accurate predictions, so training tends to make embeddings useful representations of the categories. This is called representation learning.


## Word Embeddings

Not only will embeddings generally be useful representations for the task at hand, but quite often these same embeddings can be reused successfully for other tasks. The most common example of this is word embeddings (i.e., embeddings of individual words): when you are working on a natural language processing task, you are often better off reusing pretrained word embeddings than training your own, i.e. good vector representations of the words.

The idea of using vectors to represent words dates back to the 1960s, and many sophisticated techniques have been used to generate useful vectors, including using neural networks. But things really took off in 2013, when Tomáš Mikolov and other Google researchers published a paper describing an efficient technique to learn word embeddings using neural networks, significantly outperforming previous attempts.

This allowed them to get embeddings on a very large corpus of text: they trained a neural network to predict the words near any given word, and obtained astounding word embeddings. For example, synonyms had very close embeddings, and semantically related words such as France, Spain, and Italy ended up clustered together.

It’s not just about proximity, though: word embeddings were also organized along meaningful axes in the embedding space. Here is a famous example: if you compute King – Man + Woman (adding and subtracting the embedding vectors of these words), then the result will be very close to the embedding of the word Queen. In other words, the word embeddings encode the concept of gender!

Similarly, you can compute Madrid – Spain + France, and the result is close to Paris, which seems to show that the notion of capital city was also encoded in the embeddings.

## Implementing Word Embeddings

Let’s look at how we could implement embeddings manually, to understand how they work (then we will use a simple Keras layer instead). 

1) First, we need to create an embedding matrix containing each category’s embedding, initialized randomly; it will have one row per category and per oov bucket, and one column per embedding dimension:

In [107]:
embedding_dim = 2

In [108]:
embed_init = tf.random.uniform([len(vocab) + num_oov_buckets, embedding_dim])
embed_init

<tf.Tensor: shape=(7, 2), dtype=float32, numpy=
array([[0.7413678 , 0.62854624],
       [0.01738465, 0.3431449 ],
       [0.51063764, 0.3777541 ],
       [0.07321596, 0.02137029],
       [0.2871771 , 0.4710616 ],
       [0.6936141 , 0.07321334],
       [0.93251204, 0.20843053]], dtype=float32)>

In [109]:
embedding_matrix = tf.Variable(embed_init)
embedding_matrix

<tf.Variable 'Variable:0' shape=(7, 2) dtype=float32, numpy=
array([[0.7413678 , 0.62854624],
       [0.01738465, 0.3431449 ],
       [0.51063764, 0.3777541 ],
       [0.07321596, 0.02137029],
       [0.2871771 , 0.4710616 ],
       [0.6936141 , 0.07321334],
       [0.93251204, 0.20843053]], dtype=float32)>

In this example we are using 2D embeddings, but as a rule of thumb embeddings typically have 10 to 300 dimensions, depending on the task and the vocabulary size (you will have to tune this hyperparameter).

2) This embedding matrix is a random 6 × 2 matrix, stored in a variable (so it can be
tweaked by Gradient Descent during training):

In [110]:
embedding_matrix

<tf.Variable 'Variable:0' shape=(7, 2) dtype=float32, numpy=
array([[0.7413678 , 0.62854624],
       [0.01738465, 0.3431449 ],
       [0.51063764, 0.3777541 ],
       [0.07321596, 0.02137029],
       [0.2871771 , 0.4710616 ],
       [0.6936141 , 0.07321334],
       [0.93251204, 0.20843053]], dtype=float32)>

3) Now let’s encode the same batch of categorical features as earlier, but this time using these embeddings:

In [111]:
categories = tf.constant(["NEAR BAY", "DESERT", "INLAND", "INLAND"])

In [112]:
cat_indices = table.lookup(categories)
cat_indices

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([3, 5, 1, 1], dtype=int64)>

In [113]:
tf.nn.embedding_lookup(embedding_matrix, cat_indices)

<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[0.07321596, 0.02137029],
       [0.6936141 , 0.07321334],
       [0.01738465, 0.3431449 ],
       [0.01738465, 0.3431449 ]], dtype=float32)>

The tf.nn.embedding_lookup() function looks up the rows in the embedding matrix, at the given indices—that’s all it does. For example, the lookup table says that the "INLAND" category is at index 1, so the tf.nn.embedding_lookup() function returns the embedding at row 1 in the embedding matrix (twice): [0.3528825, 0.46448255].

### keras.layers.Embedding

Keras provides a keras.layers.Embedding layer that handles the embedding matrix (trainable, by default); when the layer is created it initializes the embedding matrix randomly, and then when it is called with some category indices it returns the rows at those indices in the embedding matrix:

In [114]:
embedding = keras.layers.Embedding(input_dim=len(vocab) + num_oov_buckets, output_dim=embedding_dim)
embedding

<keras.src.layers.core.embedding.Embedding at 0x29ae8578a90>

In [115]:
embedding(cat_indices)

<tf.Tensor: shape=(4, 2), dtype=float32, numpy=
array([[ 0.04739643,  0.02759985],
       [ 0.04873708,  0.02481348],
       [-0.01103915,  0.01602587],
       [-0.01103915,  0.01602587]], dtype=float32)>

#### Creating the Model


4) Putting everything together, we can now create a Keras model that can process categorical features (along with regular numerical features) and learn an embedding for each category (as well as for each oov bucket):

In [116]:
regular_inputs = keras.layers.Input(shape=[8]) # First Input
categories = keras.layers.Input(shape=[], dtype=tf.string) # Second Input
cat_indices = keras.layers.Lambda(lambda cats: table.lookup(cats))(categories)
cat_embed = keras.layers.Embedding(input_dim=6, output_dim=2)(cat_indices)
encoded_inputs = keras.layers.concatenate([regular_inputs, cat_embed])
outputs = keras.layers.Dense(1)(encoded_inputs)
model = keras.models.Model(inputs=[regular_inputs, categories],
 outputs=[outputs ]) # The encoded inputs are passed into the output layer

#### Explanation of the Model

This model takes two inputs: a regular input containing eight numerical features per instance, plus a categorical input (containing one categorical feature per instance). It uses a Lambda layer to look up each category’s index, then it looks up the embeddings for these indices. Next, it concatenates the embeddings and the regular inputs in order to give the encoded inputs, which are ready to be fed to a neural network. We could add any kind of neural network at this point, but we just add a dense output layer, and we create the Keras model.

### keras.layers.TextVectorization

With keras.layers.TextVectorization you can call its adapt() method to make it extract the vocabulary from a data sample (it will take care of creating the lookup table for you). Then you can add it to your model, and it will perform the index lookup (replacing the Lambda layer in the previous code example). For adapt example see https://www.tensorflow.org/guide/keras/preprocessing_layers#the_adapt_method

### Important Tip
One-hot encoding followed by a Dense layer (with no activation function and no biases) is equivalent to an Embedding layer. However, the Embedding layer uses way fewer computations (the performance difference becomes clear when the size of the embedding matrix grows). The Dense layer’s weight matrix plays the role of the embedding matrix. For example, using one-hot vectors of size 20 and a Dense layer with 10 units is equivalent to using an Embedding layer with input_dim=20 and output_dim=10. As a result, it would be wasteful to use more embedding dimensions than the number of units in the layer that follows the Embedding layer.

## Keras Preprocessing Layers

https://github.com/keras-team/governance/blob/master/rfcs/20190502-preprocessing-layers.md

TensorFlow provides a set of standard Keras preprocessing layers. We already discussed two of these layers: the keras.layers.Normalization layer that will perform feature standardization (it will be equivalent to the Standardization layer we defined earlier), and the TextVectorization layer that will be capable of encoding each word in the inputs into its index in the vocabulary. In both cases, you create the layer, you call its adapt() method with a data sample, and then you use the layer normally in your model. The other preprocessing layers will follow the same pattern.


The API will also include a keras.layers.Discretization layer that will chop continuous data into different bins and encode each bin as a one-hot vector. For example, you could use it to discretize prices into three categories, (low, medium, high), which would be encoded as [1, 0, 0], [0, 1, 0], and [0, 0, 1], respectively. Of course this loses a lot of information, but in some cases it can help the model detect patterns that would otherwise not be obvious when just looking at the continuous values.

### WARNING!!!!
The Discretization layer will not be differentiable, and it should only be used at the start of your model. Indeed, the model’s preprocessing layers will be frozen during training, so their parameters will not be affected by Gradient Descent, and thus they do not need to be differentiable. This also means that you should not use an Embedding layer directly in a custom preprocessing layer, if you want it to be trainable: instead, it should be added separately to your model, as in the previous code example.

#### PreprocessingStage

It will also be possible to chain multiple preprocessing layers using the Preproces singStage class. For example, the following code will create a preprocessing pipeline that will first normalize the inputs, then discretize them. After you adapt this pipeline to a data sample, you can use it like a regular layer in your models (but again, only at the start of the model, since it contains a nondifferentiable preprocessing layer):

In [117]:
normalization = keras.layers.Normalization()
discretization = keras.layers.Discretization()
# Removed in TF 2.0 pipeline = keras.layers.PreprocessingStage([normalization, discretization])
# pipeline.adapt(data_sample)

## TextVectorization and Bag of Words

The TextVectorization layer will also have an option to output word-count vectors instead of word indices. For example, if the vocabulary contains three words, say ["and", "basketball", "more"], then the text "more and more" will be mapped to the vector [1, 0, 2]: the word "and" appears once, the word "basketball" does not appear at all, and the word "more" appears twice. This text representation is called a bag of words, since it completely loses the order of the words.

### Bag of Words Problem

Common words like "and" will have a large value in most texts, even though they are usually the least interesting (e.g., in the text "more and more basketball" the word "basketball" is clearly the most important, precisely because it is not a very frequent word). So, the word counts should be normalized in a way that reduces the importance of frequent words.

A common way to do this is to divide each word count by the log of the total number of training instances( the documents) in which the word appears. This technique is called Term-Frequency × Inverse-Document-Frequency (TF-IDF).

For example, let’s imagine that the words "and", "basketball", and "more" appear respectively in 200, 10, and 100 text instances in the training set: Then for the text "more and more", the final vector will be [1/log(200), 0/log(10), 2/log(100)], which is approximately equal to [0.19, 0., 0.43]. The TextVectorization layer will (likely) have an option to perform TF-IDF. 

### subclassing the keras.layers.PreprocessingLayer

##### keras.layers.PreprocessingLayer not present in current TF

If the standard preprocessing layers are insufficient for your task, you will still have the option to create your own custom preprocessing layer, much like we did earlier with the Standardization class. Create a subclass of the keras.layers.PreprocessingLayer class with an adapt() method, which should take a data_sample argument and optionally an extra reset_state argument: if True, then the adapt() method should reset any existing state before computing the new state; if False, it should try to update the existing state.

## TF Transform

During training, it may be preferable to perform preprocessing ahead of time. Let’s see why we’d want to do that and how we’d go about it.

If preprocessing is computationally expensive, then handling it before training rather than on the fly may give you a significant speedup: the data will be preprocessed just once per instance before training, rather than once per instance and per epoch during training as in preprocessing during training, essentially there is a layer in the model that is being called for preprocessing, hence for every epoch it will be called, and hence every training instance will get preprocessed for epoch.

As mentioned earlier, if the dataset is small enough to fit in RAM, you can use its cache() method. But if it is too large, then tools like Apache Beam or Spark will help.

They let you run efficient data processing pipelines over large amounts of data, even distributed across multiple servers, so you can use them to preprocess all the training data before training.

### Roadblock Pre-Processing

This methods described above sound great and indeed can speed up training, but there is one problem: once your model is trained, suppose you want to deploy it to a mobile app. 

1) In that case you will need to write some code in your app to take care of preprocessing the data before it is fed to the model. And suppose you also want to deploy the model to TensorFlow.js so that it runs in a web browser? Once again, you will need to write some pre‐processing code.

2) This can become a maintenance nightmare: whenever you want to change the preprocessing logic, you will need to update your Apache Beam code, your mobile app code, and your JavaScript code.

3) This is not only time-consuming, but also error-prone: you may end up with subtle differences between the preprocessing operations performed before training and the ones performed in your app or in the browser. This training/serving skew will lead to bugs or degraded performance.

### Possible Solution

One improvement would be to take the trained model (trained on data that was preprocessed by your Apache Beam or Spark code) and, before deploying it to your app or the browser, add extra preprocessing layers to take care of preprocessing on the fly. That’s definitely better, since now you just have two versions of your preprocessing code: the Apache Beam or Spark code, and the preprocessing layers’ code.


### TF Transform (Better Solution)

But what if you could define your preprocessing operations just once? This is what TF Transform was designed for.

It is part of TensorFlow Extended (TFX), an end-toend platform for productionizing TensorFlow models. https://www.tensorflow.org/tfx

In [118]:
from importlib.metadata import version

In [119]:
version('numpy')

'1.24.4'

You then define your preprocessing function just once (in Python), by using TF Transform functions for scaling, bucketizing, and more. You can also use any TensorFlow operation you need. Here is what this preprocessing function might look like if we just had two features:

In [None]:
# import tensorflow_transform as tft

In [121]:
def preprocess(inputs): # inputs = a batch of input features
    median_age = inputs["housing_median_age"]
    ocean_proximity = inputs["ocean_proximity"]
    standardized_age = tft.scale_to_z_score(median_age)
    ocean_proximity_id = tft.compute_and_apply_vocabulary(ocean_proximity)
    return {"standardized_median_age": standardized_age,"ocean_proximity_id": ocean_proximity_id}

Next, TF Transform lets you apply this preprocess() function to the whole training set using Apache Beam (it provides an AnalyzeAndTransformDataset class that you can use for this purpose in your Apache Beam pipeline).

#### Working of the Function!!!!

The function will compute all the necessary statistics over the whole training set: in this example, the mean and standard deviation of the housing_median_age feature, and the vocabulary for the ocean_proximity feature. The components that compute these statistics are called analyzers.

Importantly, TF Transform will also generate an equivalent TensorFlow Function that you can plug into the model you deploy. This TF Function includes some constants that correspond to all the all the necessary statistics computed by Apache Beam (the mean, standard deviation, and vocabulary).


## The TensorFlow Datasets (TFDS) Project

The TensorFlow Datasets project makes it very easy to download common datasets, from small ones like MNIST or Fashion MNIST to huge datasets like ImageNet (you will need quite a bit of disk space!). The list includes image datasets, text datasets (including translation datasets), and audio and video datasets. You can visit https://homl.info/tfds to view the full list, along with a description of each dataset

TFDS is not bundled with TensorFlow, so you need to install the tensorflowdatasets library (e.g., using pip). Then call the tfds.load() function, and it will download the data you want (unless it was already downloaded earlier) and return the data as a dictionary of datasets (typically one for training and one for testing, but this depends on the dataset you choose). For example, let’s download MNIST:

In [123]:
# import tensorflow_datasets as tfds

In [124]:
# dataset = tfds.load(name="mnist")
# mnist_train, mnist_test = dataset["train"], dataset["test"]

You can then apply any transformation you want (typically shuffling, batching, and prefetching), and you’re ready to train your model. Here is a simple example:

In [125]:
# mnist_train = mnist_train.shuffle(10000).batch(32).prefetch(1)
# for item in mnist_train:
#  images = item["image"]
#  labels = item["label"]
#  [...]


#### Note that each item in the dataset is a dictionary containing both the features and the labels. But Keras expects each item to be a tuple containing two elements (again, the features and the labels). You could transform the dataset using the map() method, like this:

In [126]:
# mnist_train = mnist_train.shuffle(10000).batch(32)
# mnist_train = mnist_train.map(lambda items: (items["image"], items["label"]))
# mnist_train = mnist_train.prefetch(1)


#### But it’s simpler to ask the load() function to do this for you by setting as_supervised=True (obviously this works only for labeled datasets). You can also specify the batch size if you want. Then you can pass the dataset directly to your tf.keras model:

In [127]:
# dataset = tfds.load(name="mnist", batch_size=32, as_supervised=True)
# mnist_train = dataset["train"].prefetch(1)
# model = keras.models.Sequential([...])
# model.compile(loss="sparse_categorical_crossentropy", optimizer="sgd")
# model.fit(mnist_train, epochs=5)