# Chapter 13: Loading and Preprocessing Data with TensorFlow

In [1]:
import numpy as np
import tensorflow as tf
from tensorflow import keras

Since Deep Learning systems are often trained on very large datasets that will not fit in RAM, TensorFlow's **Data API** solves this issue by taking care of all the implementation details and only needs:
- A dataset object
- Where to get the data
- How to transform it

## 13.1 The Data API

The Data API revolves around the concept of a **dataset**: a sequence of data items.

In [10]:
# Create a dataset entirely in RAM
X = tf.range(10) # any data tensor
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

The `from_tensor_slices()` function takes a tensor and creates a `tf.data.Dataset` whose elements are all the slices of X. This is the same as `tf.data.Dataset.range(10)`.

In [11]:
# Iterate over dataset's items
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


### 13.1.1 Chaining Transformations

Once you have a dataset, you can apply transformations by calling its transformation methods.

In [12]:
# See Figure 13-1. Chaining dataset transformations
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


With original dataset,
1. Call `repeat(3)` to return a new dataset with 3 copies of the dataset.
    - Calling with no arguments will result in a new dataset that repeats forever, so the code that iterates over the dataset must decide when to stop.

2. Call `batch(7)` to return a new dataset that groups the items into batches of 7 items and any remaining items in the last batch (batch of 2).
    - Add `drop_remainder=True` argument to drop this final batch.

> Note: Dataset methods **do not** modify datasets; they create new ones (ie. assign with `dataset = ...`) or else nothing will happen.

In [13]:
# Creates new dataset with all items doubled
dataset = dataset.map(lambda x: x * 2) # Items:[0,2,4,6,8,10,12]
for item in dataset:
    print(item)

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int32)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int32)
tf.Tensor([16 18], shape=(2,), dtype=int32)


While the `map()` method applies a transformation to each item, the `apply()` method applies a transformation to the dataset as a whole.

> Note: `apply()` method is not used since `tf.data.Dataset.unbatch(dataset)` needs 1 argument for the dataset.

In [16]:
# tf.data.experimental.unbatch() is now deprecated
# Use tf.data.Dataset.unbatch()
# Each item in the new dataset will be single-integer tensor
dataset = tf.data.Dataset.unbatch(dataset)

# Filter the dataset
dataset = dataset.filter(lambda x: x < 10) # Items: 0 2 4 6 8 0 2 4 6...

# Look at just a few items from dataset
for item in dataset.take(3):
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)


### 13.1.2 Shuffling the Data

### 13.1.3 Preprocessing the Data

### 13.1.4 Putting Everything Together

### 13.1.5 Prefetching

### 13.1.6 Using the Dataset with tf.keras

## 13.2 The TFRecord Format

### 13.2.1 Compressed TFRecord Files

### 13.2.2 A Brief Introduction to Protocol Buffers

### 13.2.3 TensorFlow Protobufs

### 13.2.4 Loading and Parsing Examples

### 13.2.5 Handling Lists of Lists Using the SequenceExample Protobuf

## 13.3 Preprocessing the Input Features

### 13.3.1 Encoding Categorical Features Using One-Hot Vectors

### 13.3.2 Encoding Categorical Features Using Embeddings

### 13.3.3 Keras Preprocessing Layers

## 13.4 TF Transform

## 13.5 The TensorFlow Datasets (TFDS) Project