# Optimize Data Load and Preprocessing with tf.data

**Learning Objectives**
1. Learn how to use tf.data to read data from memory
1. Learn how to use tf.data to read data from disk
1. Learn how to write production input pipelines with feature engineering (batching, shuffling, etc.)
1. Learn how to optimize pipeline with tf.data


In this notebook, we will start by refactoring the linear regression we implemented in the previous lab so that it takes its data from a`tf.data.Dataset`, and we will learn how to implement **stochastic gradient descent** with it. In this case, the original dataset will be synthetic and read by the `tf.data` API directly from memory.

We will use TensorFlow for framework, but **tf.data works with any frameworks like JAX or Pytorch**.

In a second part, we will learn how to load a dataset with the `tf.data` API when the dataset resides on disk, and then learn how to optimize the data pipeline.

In [None]:
import os
import warnings

os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"
warnings.filterwarnings("ignore")

import json
import math
from pprint import pprint

import numpy as np
import tensorflow as tf

print(tf.version.VERSION)

## Loading data from memory

### Creating the dataset

Let's consider the synthetic dataset of the previous section:

In [None]:
N_POINTS = 10
X = tf.constant(range(N_POINTS), dtype=tf.float32)
Y = 2 * X + 10

We begin by implementing a function that takes as input

- our $X$ and $Y$ vectors of synthetic data generated by the linear function $y= 2x + 10$
- the number of passes over the dataset we want to train on (`epochs`)
- the size of the batches in the dataset (`batch_size`)
and returns a `tf.data.Dataset`: 

**Remark:** Note that the last batch may not contain the exact number of elements you specified because the dataset was exhausted.

If you want batches with the exact same number of elements per batch, we will have to discard the last batch by
setting:

```python
dataset = dataset.batch(batch_size, drop_remainder=True)
```

We will do that here.

**Exercise 1**: Implement the `create_dataset` function. <br>
Your function should create a `tf.data.Dataset` from the input tensors `X` and `Y`, and then configure it to repeat for the specified number of epochs and create batches of the specified `batch_size`. Drop the last batch if it's smaller than the batch size.

In [None]:
def create_dataset(X, Y, epochs, batch_size):
    dataset = tf.data.Dataset.from_tensor_slices((X, Y))
    # TODO: Your code goes here
    return dataset

Let's test our function by iterating twice over our dataset in batches of 3 data points:

In [None]:
BATCH_SIZE = 3
EPOCH = 2

dataset = create_dataset(X, Y, epochs=EPOCH, batch_size=BATCH_SIZE)

for i, (x, y) in enumerate(dataset):
    print("x:", x.numpy(), "y:", y.numpy())
    assert len(x) == BATCH_SIZE
    assert len(y) == BATCH_SIZE
assert EPOCH

### Loss function and gradients

The loss function and the function that computes the gradients are the same as before:

In [None]:
def loss_mse(X, Y, w0, w1):
    Y_hat = w0 * X + w1
    errors = (Y_hat - Y) ** 2
    return tf.reduce_mean(errors)


def compute_gradients(X, Y, w0, w1):
    with tf.GradientTape() as tape:
        loss = loss_mse(X, Y, w0, w1)
    return tape.gradient(loss, [w0, w1])

### Training loop

The main difference now is that now, in the training loop, we will iterate directly on the `tf.data.Dataset` generated by our `create_dataset` function. 

We will configure the dataset so that it iterates 250 times over our synthetic dataset in batches of 2.

**Exercise 2**: Implement the training loop. For each batch from the dataset, you should:
1. Compute the gradients of the loss with respect to the weights `w0` and `w1`.
2. Update the weights using the gradients and the learning rate.
3. Every 100 steps, calculate the loss and print the step number, loss, and current values of `w0` and `w1`.

Make sure to update the `loss` variable within the loop so the final assertions can check your model's performance.

In [None]:
EPOCHS = 250
BATCH_SIZE = 2
LEARNING_RATE = 0.02

MSG = "STEP {step} - loss: {loss}, w0: {w0}, w1: {w1}\n"

w0 = tf.Variable(0.0)
w1 = tf.Variable(0.0)

dataset = create_dataset(X, Y, epochs=EPOCHS, batch_size=BATCH_SIZE)

for step, (X_batch, Y_batch) in enumerate(dataset):
    # TODO: Your code goes here
    pass

assert loss < 0.0001
assert abs(w0 - 2) < 0.001
assert abs(w1 - 10) < 0.001

## Loading data from disk

### Locating the CSV files

We will start with the **taxifare dataset** CSV files that we wrote out in a previous lab. 

The taxifare dataset files have been saved into `../data`.

Check that it is the case in the cell below, and, if not, regenerate the taxifare
dataset by running the previous lab notebook:

In [None]:
!ls -l ../data/taxi*.csv

### Use Low-level tf.data API to read the CSV files

To get a more flexible pipeline, we can utilize low-level tf.data APIs to fully control the behavior of the pipeline.

For text-based data including CSV, we can use `TextLineDataset` to load data.

In [None]:
ds = tf.data.TextLineDataset("../data/taxi-train.csv")
ds

Note that the Dataset object (`ds`) is still just a definition, and it hasn't loaded the actual data yet.<br>
Let's iterate over the first two elements of this dataset using `dataset.take(2)`:

In [None]:
for data in ds.take(2):
    print(data)

It seems it loads the header row as the first element. Since it's not part of the training data, lets' skip it with the `skip()` method.

In [None]:
ds = tf.data.TextLineDataset("../data/taxi-train.csv").skip(1)

for data in ds.take(2):
    print(data)

### Transforming the features with `.map()`

At this point, we've loaded the CSV file as a text file, and each row was simply represented as a single string value containing speparators (`,`).

Let's write a parsing function that takes a row and splits it into multiple values.

**Exercise 3**: Implement the `parse_csv` function to split a single CSV row string into a list of strings.

In [None]:
def parse_csv(row):
    # TODO: Your code goes here
    return ...

Let's make sure it works by calling this function in the for loop with `.take()`. 

In [None]:
ds = tf.data.TextLineDataset("../data/taxi-train.csv").skip(1)

for data in ds.take(2):
    values = parse_csv(data)
    pprint(values)

Instead of calling the function in a for loop, we can wrap it in a `.map()` method to include it in a pipeline.

In [None]:
ds = tf.data.TextLineDataset("../data/taxi-train.csv").skip(1).map(parse_csv)

for data in ds.take(2):
    print(data)

Now let's extend the `parse_csv` function.<br>
In machine learning training, we want to pass training data in tuples `(features, label)`.

In this CSV file we have these columns:

In [None]:
!head -1 ../data/taxi-train.csv

Let's say we want to predict the `fare_amount` value, using `pickuplon`, `pickuplat`, `dropofflon` and `dropofflat` as features. If so:

**Exercise 4**: Update the `parse_csv` function to return a tuple of `(features, label)`. <br>
The label should be the first column (`fare_amount`), and the features should be the 3rd through 6th columns (`pickuplon`, `pickuplat`, `dropofflon`, `dropofflat`). 
Remember to convert the extracted values to numbers.

In [None]:
def parse_csv(row):
    columns = tf.strings.split(row, ",")
    # Label: fare_amount
    label = ...  # TODO: Your code goes here
    # Feature: pickup_longitude, pickup_latitude, dropoff_longitude, dropoff_latitude
    features = ...  # TODO: Your code goes here
    return features, label

In [None]:
ds = tf.data.TextLineDataset("../data/taxi-train.csv").skip(1)
ds = ds.map(parse_csv)

for features, label in ds.take(2):
    print(f"features: \n  {features}, \nlabel: \n  {label} \n++++")

### Batching

Typically, a machine learning training module requires batched data. Let's refactor our pipeline to batch the data by adding `.batch(BATCH_SIZE)`.

In [None]:
BATCH_SIZE = 4

ds = tf.data.TextLineDataset("../data/taxi-train.csv").skip(1)
ds = ds.map(parse_csv).batch(BATCH_SIZE)

for features, label in ds.take(2):
    print(f"features: \n  {features}, \nlabel: \n  {label} \n++++")

Now our dataset is an iterator of *batches*, instead of *rows*, which is suitable for mini-batch training for neural networks.

### Shuffling & Repeating

When training a deep learning model in batches over multiple workers, it is helpful if we shuffle the data. That way, different workers will be working on different parts of the input file at the same time, and so averaging gradients across workers will help.<br>
We can add shuffling with `.shuffle()`. But please note that the shuffle buffer specified in `buffer_size` will be stored on memory, and it is not suitable for full shuffling on very large scale datasets.

Let's wrap our data pipeline in a `create_dataset` function so that we can control its behaviour and shuffle data only when the dataset is used for training.

We will introduce an additional argument `mode` to our function to allow the function body to distinguish the case when it needs to shuffle the data (`mode == "train"`) from when it shouldn't (`mode == "eval"`).

Also, let's add `.repeat()` to read the data indefinitely during training. 

**Exercise 5**: Implement the `create_dataset` function that takes a file pattern. It should:
1. Create a `TextLineDataset` from the file pattern and skip the header row.
2. Map the `parse_csv` function to each row.
3. Make the dataset repeat indefinitely.
4. If the mode is 'train', shuffle the dataset.
5. Batch the dataset, dropping the remainder.

In [None]:
def create_dataset(pattern, batch_size, mode="eval"):
    # TODO: Your code goes here
    pass

Let's check that our function works well in both modes:

In [None]:
# Run this cell multiple times to see the results are different.
tempds = create_dataset("../data/taxi-train.csv", 2, "train")
print(list(tempds.take(1)))

In [None]:
tempds = create_dataset("../data/taxi-valid.csv", 2, "eval")
print(list(tempds.take(1)))

## Better Performance with tf.Data

Maximizing the performance of data loading and preprocessing phase is critical for many machine learning use cases.

`tf.data` offers a number of ways to optimize the process, depending on the cause of performance bottlenecks.<br>
Let's take a look at some scenarios.

For comparison, we use this `benchmark` function that simulates a training application loop.

In [None]:
import time


def benchmark(dataset, num_epochs=2):
    start_time = time.perf_counter()
    for epoch_num in range(num_epochs):
        for sample in dataset:
            # Performing a training step
            time.sleep(0.01)
    print("Execution time:", time.perf_counter() - start_time)

### Case 1: Performance bottleneck in heavy map operation 

While feature transformation `.map()` is flexible and convenient, this process can be a performance bottleneck when the preprocessing function contains heavy operations.

Let's simulate that case by adding sleep time into our parse function.

In [None]:
def heavy_parse_csv(row):
    columns = tf.strings.split(row, ",")
    label = tf.strings.to_number(columns[0])
    features = tf.strings.to_number(columns[2:6])

    # Perform a heavy preprocessing...
    tf.py_function(lambda: time.sleep(0.001), [], ())

    return features, label

In [None]:
def create_dataset(pattern, batch_size=128):
    ds = tf.data.TextLineDataset(pattern).skip(1)
    ds = ds.map(heavy_parse_csv)
    ds = ds.batch(batch_size, drop_remainder=True)
    return ds

In [None]:
tempds = create_dataset("../data/taxi-train.csv")
benchmark(tempds)

The flow looks like this. The map operation between data read and training is the bottleneck in this case.

![Map bottleneck](https://www.tensorflow.org/guide/images/data_performance/sequential_map.svg)

Let's see how we can optimize this process.

#### Solution 1: Parallelize map

Because input elements are independent of one another, the pre-processing can be parallelized across multiple CPU cores. To make this possible, the map transformation provides the num_parallel_calls argument to specify the level of parallelism.

In `.map()` you can specify the `num_parallel_calls` arg along with the function. The number of parallelism can be auto-tuned by specifying `tf.data.AUTOTUNE`.

In [None]:
def create_dataset(pattern, batch_size=128):
    ds = tf.data.TextLineDataset(pattern).skip(1)
    ds = ds.map(heavy_parse_csv, num_parallel_calls=tf.data.AUTOTUNE)
    ds = ds.batch(batch_size, drop_remainder=True)
    return ds

In [None]:
tempds = create_dataset("../data/taxi-train.csv")
benchmark(tempds)

Now it is much faster!

The flow now looks like this.
![parallelized](https://www.tensorflow.org/guide/images/data_performance/parallel_map.svg)

####Â Solution 2: Vectorize the map operation
`.map()` processes each individual element returned by a `Dataset`. Our current function is structured to work on one CSV element at a time. However, processing data in batches is always more efficient when feasible.

Let's vectorize our function (that is, have it operate over a batch of inputs at once) and apply the `batch` transformation before the `map` transformation.

In [None]:
def heavy_parse_csv_batch(row):
    columns = tf.strings.split(row, ",").to_tensor()
    label = tf.strings.to_number(columns[:, 0])
    features = tf.strings.to_number(columns[:, 2:6])

    # Perform a heavy preprocessing...
    tf.py_function(lambda: time.sleep(0.001), [], ())

    return features, label

In [None]:
def create_dataset(pattern, batch_size=128):
    ds = tf.data.TextLineDataset(pattern).skip(1)
    ds = ds.batch(batch_size, drop_remainder=True)
    ds = ds.map(heavy_parse_csv_batch, num_parallel_calls=tf.data.AUTOTUNE)
    return ds

In [None]:
tempds = create_dataset("../data/taxi-train.csv")
benchmark(tempds)

### Case 2: Performance Bottleneck in I/O (Data Loading) 
Let's take a look at the next scenario.

In a real-world setting, the input data may be stored remotely (for example on Google Cloud Storage in a different location). A dataset pipeline that works well when reading data locally might become bottlenecked on I/O when reading data remotely because of the following differences between local and remote storage:

- **Time-to-first-byte**: Reading the first byte of a file from remote storage can take orders of magnitude longer than from local storage.
- **Read throughput**: While remote storage typically offers large aggregate bandwidth, reading a single file might only be able to utilize a small fraction of this bandwidth.

In addition, once the raw bytes are loaded into memory, it may also be necessary to deserialize and/or decrypt the data (e.g. protobuf), which requires additional computation. This overhead is present irrespective of whether the data is stored locally or remotely, but can be worse in the remote case if data is not prefetched effectively.

Let's create a custom dataset to simulate this scenario.

In [None]:
class IOBoundDataset(tf.data.Dataset):
    def _generator(file_name, num_samples):
        # Opening the file
        time.sleep(0.3)

        for sample_idx in range(num_samples):
            # Reading each line from the file
            time.sleep(0.15)

            yield (sample_idx,)

    def __new__(cls, file_name, num_samples=5):
        return tf.data.Dataset.from_generator(
            cls._generator,
            output_signature=tf.TensorSpec(shape=(1,), dtype=tf.int64),
            args=(
                file_name,
                num_samples,
            ),
        )

In [None]:
ds = IOBoundDataset("dummy_file.csv").repeat(20)
benchmark(ds)

#### Solution 1: Cache
Since machine learning training often involves using the same dataset repeatedly, a good strategy is to cache the data during the first epoch and then retrieve it from the cache in subsequent epochs, rather than reloading it from a remote source each time.

You can simply insert `.cache()` to use caching.

In [None]:
ds = IOBoundDataset("dummy_file.csv").cache().repeat(20)
benchmark(ds)

It looks much faster! However, be careful about what you cache since the cached data will be stored on memory. For example, it's not realistic to cache a terabyte-scale dataset.

For example, if you have a CSV file that contains paths of videos for training, instead of caching the actual video file, consider caching the small CSV file by inserting `cache()` before the video load function.

```python 
dataset.map(parse_csv_fn).cache().map(load_video_fn)
```

#### Solution 2: Interleave

To mitigate the impact of the various data extraction overheads, the tf.data.Dataset.interleave transformation can be used to parallelize the data loading step, interleaving the contents of other datasets (such as data file readers).

Let's say we have multiple sharded files.

In [None]:
files = [
    "dummy_file_shard001.csv",
    "dummy_file_shard002.csv",
    "dummy_file_shard003.csv",
]

You can insert `interleave()` to interleave multiple file load operations.

In [None]:
ds = tf.data.Dataset.from_tensor_slices(files)
ds = ds.interleave(IOBoundDataset)

benchmark(ds)

Let's take a look at how they are loaded.

In [None]:
for d in ds:
    print(d)

Each dataset contains values from 0 to 4, and here we can see the data load from 3 files are interleaved.

This is the flow image of interleaving (with 2 files in this case)
![sequential interleave](https://www.tensorflow.org/guide/images/data_performance/sequential_interleave.svg)

Like `.map()`, you can parallelize this interleave operation by adding `num_parallel_calls` for further performance gain.

In [None]:
ds = tf.data.Dataset.from_tensor_slices(files)
ds = ds.interleave(IOBoundDataset, num_parallel_calls=len(files))

benchmark(ds)

![parallel interleave](https://www.tensorflow.org/guide/images/data_performance/parallel_interleave.svg)

By conbining these techniques, you can design a highly optimized data load and transform pipeline.

Copyright 2025 Google LLC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

     https://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.