# `The Data API`
In TensorFlow, the Data API refers to a set of tools and utilities provided by TensorFlow for efficiently loading and preprocessing data. It offers a streamlined and flexible way to work with large datasets, making it easier to build and train machine learning models.

* The Data API in TensorFlow centers on the notion of a **dataset**, which is essentially a sequence of data items. While datasets typically read data from disk incrementally, for simplicity, one can create a dataset entirely in RAM.

## 1. Creating Dataset
* The `from_tensor_slices()` function in TensorFlow takes a tensor and generates a `tf.data.Dataset` where each element corresponds to a slice of the input tensor along its first dimension. For example, if the input tensor has a shape of (10, ...), the resulting dataset will contain 10 items, each representing a slice of the tensor along the first dimension, namely tensors 0 through 9.

In [12]:
import tensorflow as tf

# Generate a tensor containing values from 0 to 9 using tf.range()
X = tf.range(10)

# Create a tf.data.Dataset from the tensor X using from_tensor_slices()
# This function creates a dataset where each element is a slice of X along its first dimension
dataset = tf.data.Dataset.from_tensor_slices(X)

# Print the dataset to observe its structure
print(dataset)

# Alternatively, you can create a dataset containing a range of values from 0 to 9 using tf.data.Dataset.range()
dataset = tf.data.Dataset.range(10)

# Iterate through the dataset and print each item
for item in dataset:
    print(item)


<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)


## 2. Chaining Transformations

In the context of TensorFlow's Data API, transformations refer to the operations applied to datasets to **modify** or **preprocess** the data in various ways. These transformations are used to prepare the data for training machine learning models.

**Common transformations include:**

* **Batching**: Grouping multiple examples into batches, which enables processing multiple examples in parallel, typically to improve efficiency during training.

* **Repeating**: The `repeat()` transformation is used to repeat the elements of a dataset for a specified number of epochs or indefinitely if no argument is provided. This transformation is often used to ensure that the dataset provides enough data for training over multiple epochs.

In [13]:
# Repeat the dataset third time to create a new dataset that contains two repetitions of the original data
# Then, batch the dataset into batches of size 7, meaning each batch will contain 7 elements
dataset = dataset.repeat(3).batch(7)

# Iterate through the transformed dataset
for item in dataset:
    # Print each batch of the dataset
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)
tf.Tensor([8 9], shape=(2,), dtype=int64)


In [16]:
import tensorflow as tf

# Create a dataset containing elements from 0 to 9
dataset = tf.data.Dataset.range(10)

# Repeat the dataset twice
dataset = dataset.repeat(3)

# Batch the dataset into batches of size 7, dropping any remainder
dataset = dataset.batch(7, drop_remainder=True)

# Iterate through the dataset
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)


* **Mapping**: Applying a function to each element of the dataset. This function can be used for various purposes, such as data preprocessing, feature engineering, or data augmentation.

In [31]:
import tensorflow as tf

# Define a simple transformation function
def square(x):
    return x ** 2

# Create a dataset containing elements from 0 to 9
dataset = tf.data.Dataset.range(10)

# Apply the square function to each element of the dataset in parallel
# Specify num_parallel_calls to control the degree of parallelism
# Here, tf.data.experimental.AUTOTUNE dynamically determines the degree of parallelism
dataset = dataset.map(square, num_parallel_calls=tf.data.experimental.AUTOTUNE)

# Iterate through the transformed dataset
for item in dataset:
    print(item.numpy())  # Print each transformed element

0
1
4
9
16
25
36
49
64
81


* **Applying**: The `apply()` method is used to apply a transformation that operates on the dataset as **a whole** rather than individual elements.

   * It allows for more complex transformations that involve **aggregating**, **filtering**, or **modifying** the dataset **as a whole**.

   * The `apply()` method can be used to perform operations such as **batch-wise normalization**, or custom dataset preprocessing.

   * Unlike the `map()` method, the transformation function passed to `apply()` operates on the entire dataset or subsets of it rather than individual elements. 

   * The transformation function passed to the apply() method must return a new dataset.

In [None]:
import tensorflow as tf

# Create a dataset containing elements from 0 to 4
dataset = tf.data.Dataset.range(5)

# Define a transformation function to create a copy of the dataset
def copy_dataset(ds):
    return ds

# Apply the copy_dataset function to the dataset using the apply() method
copied_dataset = dataset.apply(copy_dataset)

# Iterate through the copied dataset
for item in copied_dataset:
    print(item.numpy())


* **Filtering**: Removing examples from the dataset based on certain criteria, such as removing outliers or selecting specific classes for classification tasks.

In [32]:
import tensorflow as tf

# Create a dataset containing elements from 0 to 9
dataset = tf.data.Dataset.range(10)

# Apply a filter using a lambda function to keep only elements greater than 5
filtered_dataset = dataset.filter(lambda x: x > 5)

# Iterate through the filtered dataset
for item in filtered_dataset:
    print(item.numpy())


6
7
8
9


**Shuffling**: Randomly shuffling the data to introduce randomness and prevent the model from learning the order of the examples.