# `The Data API`
In TensorFlow, the Data API refers to a set of tools and utilities provided by TensorFlow for efficiently loading and preprocessing data. It offers a streamlined and flexible way to work with large datasets, making it easier to build and train machine learning models.

* The Data API in TensorFlow centers on the notion of a **dataset**, which is essentially a sequence of data items. While datasets typically read data from disk incrementally, for simplicity, one can create a dataset entirely in RAM.

## 1. Creating Dataset
* The `from_tensor_slices()` function in TensorFlow takes a tensor and generates a `tf.data.Dataset` where each element corresponds to a slice of the input tensor along its first dimension. For example, if the input tensor has a shape of (10, ...), the resulting dataset will contain 10 items, each representing a slice of the tensor along the first dimension, namely tensors 0 through 9.

In [24]:
import tensorflow as tf

# Generate a tensor containing values from 0 to 9 using tf.range()
X = tf.range(10)

# Create a tf.data.Dataset from the tensor X using from_tensor_slices()
# This function creates a dataset where each element is a slice of X along its first dimension
dataset = tf.data.Dataset.from_tensor_slices(X)

# Print the dataset to observe its structure
print(dataset)

# Alternatively, you can create a dataset containing a range of values from 0 to 9 using tf.data.Dataset.range()
dataset = tf.data.Dataset.range(10)

# Iterate through the dataset and print each item
for item in dataset:
    print(item)


<TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>
tf.Tensor(0, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(5, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64)


## 2. Chaining Transformations

In the context of TensorFlow's Data API, transformations refer to the operations applied to datasets to **modify** or **preprocess** the data in various ways. These transformations are used to prepare the data for training machine learning models.

**Common transformations include:**

* **Batching**: Grouping multiple examples into batches, which enables processing multiple examples in parallel, typically to improve efficiency during training.

* **Repeating**: The `repeat()` transformation is used to repeat the elements of a dataset for a specified number of epochs or indefinitely if no argument is provided. This transformation is often used to ensure that the dataset provides enough data for training over multiple epochs.

In [25]:
# Repeat the dataset third time to create a new dataset that contains two repetitions of the original data
# Then, batch the dataset into batches of size 7, meaning each batch will contain 7 elements
dataset = dataset.repeat(3).batch(7)

# Iterate through the transformed dataset
for item in dataset:
    # Print each batch of the dataset
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)
tf.Tensor([8 9], shape=(2,), dtype=int64)


In [26]:
import tensorflow as tf

# Create a dataset containing elements from 0 to 9
dataset = tf.data.Dataset.range(10)

# Repeat the dataset twice
dataset = dataset.repeat(3)

# Batch the dataset into batches of size 7, dropping any remainder
dataset = dataset.batch(7, drop_remainder=True)

# Iterate through the dataset
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int64)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int64)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int64)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int64)


* **Mapping**: Applying a function to each element of the dataset. This function can be used for various purposes, such as data preprocessing, feature engineering, or data augmentation.

In [27]:
import tensorflow as tf

# Define a simple transformation function
def square(x):
    return x ** 2

# Create a dataset containing elements from 0 to 9
dataset = tf.data.Dataset.range(10)

# Apply the square function to each element of the dataset in parallel
# Specify num_parallel_calls to control the degree of parallelism
# Here, tf.data.experimental.AUTOTUNE dynamically determines the degree of parallelism
dataset = dataset.map(square, num_parallel_calls=tf.data.experimental.AUTOTUNE)

# Iterate through the transformed dataset
for item in dataset:
    print(item.numpy())  # Print each transformed element

0
1
4
9
16
25
36
49
64
81


* **Applying**: The `apply()` method is used to apply a transformation that operates on the dataset as **a whole** rather than individual elements.

   * It allows for more complex transformations that involve **aggregating**, **filtering**, or **modifying** the dataset **as a whole**.

   * The `apply()` method can be used to perform operations such as **batch-wise normalization**, or custom dataset preprocessing.

   * Unlike the `map()` method, the transformation function passed to `apply()` operates on the entire dataset or subsets of it rather than individual elements. 

   * The transformation function passed to the apply() method must return a new dataset.

In [28]:
import tensorflow as tf

# Create a dataset containing elements from 0 to 4
dataset = tf.data.Dataset.range(5)

# Define a transformation function to create a copy of the dataset
def copy_dataset(ds):
    return ds

# Apply the copy_dataset function to the dataset using the apply() method
copied_dataset = dataset.apply(copy_dataset)

# Iterate through the copied dataset
for item in copied_dataset:
    print(item.numpy())


0
1
2
3
4


* **Filtering**: Removing examples from the dataset based on certain criteria, such as removing outliers or selecting specific classes for classification tasks.

In [29]:
import tensorflow as tf

# Create a dataset containing elements from 0 to 9
dataset = tf.data.Dataset.range(10)

# Apply a filter using a lambda function to keep only elements greater than 5
filtered_dataset = dataset.filter(lambda x: x > 5)

# Iterate through the filtered dataset
for item in filtered_dataset:
    print(item.numpy())

6


7
8
9


* **Takeing**: Sometimes you just need to check out a few things from a dataset. That's where the `take()` method comes in handy.

In [30]:
import tensorflow as tf

# Create a dataset containing elements from 0 to 9
dataset = tf.data.Dataset.range(10)

# Using tf.take to select the first five items
subset = dataset.take(5)

# Iterating over the subset
for item in subset:
    print(item.numpy())

0
1
2
3
4


**Shuffling**: Randomly shuffling the data to introduce randomness and prevent the model from learning the order of the examples.
   * **Here's how it works:** the method creates a new dataset that initially fills a buffer with items from the source dataset. Then, whenever you request an item, it randomly picks one from the buffer and replaces it with a fresh item from the source dataset until it's gone through the entire source dataset. After that, it keeps randomly selecting items from the buffer until it's empty. 

   * It's crucial to set the buffer size large enough for effective shuffling, but not so large that it exceeds your available RAM. Even if you have plenty of memory, it's unnecessary to surpass the dataset's size. 
   
   * If you want the shuffle to produce the same random order each time you run your program, you can specify a random seed. 

In [31]:
# Create a dataset with numbers from 0 to 9, repeated three times
dataset = tf.data.Dataset.range(10).repeat(3) 

# Shuffle the dataset with a buffer size of 5 and a random seed of 42,
# then batch the shuffled dataset into groups of 7
dataset = dataset.shuffle(buffer_size=5, seed=42).batch(7)

# Iterate over the dataset and print each batch
for item in dataset:
    print(item)


tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


**Mixing lines from various files together**

* If you have a **big dataset** that **can't fit in memory**, just shuffling won't cut it because the buffer size is too small compared to the dataset.

* To add further shuffling to the instances, a typical method involves **dividing the original data** into several files, then reading them in a random sequence during training. Nonetheless, instances within the same file may still be grouped together. To prevent this, one can randomly select multiple files and read them concurrently, mixing their entries. Additionally, a shuffling buffer can be applied using the `shuffle()` method.

**Split the California dataset to multiple CSV files**

1. Let's start by loading and preparing the California housing dataset. 
   * We first load it, then split it into a training set, a validation set and a test set, and finally we scale it.

In [32]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(
    housing.data, housing.target.reshape(-1, 1), random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(
    X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

2. **Splitting CSV file**: For a very large dataset that does not fit in memory, you will typically want to split it into many files first, then have TensorFlow read these files in parallel. 
   * To demonstrate this, let's start by splitting the housing dataset and save it to 20 CSV files

In [33]:
import os  # Importing the os module for file path manipulation
import numpy as np  # Importing numpy for array operations

# Define a function to save data to multiple CSV files
def save_to_multiple_csv_files(data, name_prefix, header=None, n_parts=10):
    # Define the directory where CSV files will be stored
    housing_dir = os.path.join("datasets", "housing")
    # Create the directory if it doesn't exist
    os.makedirs(housing_dir, exist_ok=True)
    # Define the format for the file path, where {} will be replaced by name_prefix and {:02d} will be replaced by file_idx
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")
    # Initialize an empty list to store file paths
    filepaths = []
    # Get the total number of rows in the data
    m = len(data)

    # Split the indices of rows into approximately equal parts (number of parts specified by n_parts)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        # Generate the file path for the current part
        part_csv = path_format.format(name_prefix, file_idx)
        # Append the file path to the list of file paths
        filepaths.append(part_csv)
        # Open the file for writing
        with open(part_csv, "wt", encoding="utf-8") as f:
            # Write the header if provided
            if header is not None:
                f.write(header)
                f.write("\n")
            # Write each row of data to the file
            for row_idx in row_indices:
                # Write each column of the row, separated by commas
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    # Return the list of file paths
    return filepaths


In [34]:
# Concatenate the features (X) and target variable (y) for training, validation and test data
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]

# Define the header for the CSV files by concatenating feature names and target variable name
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

# Save training, validation and test data to multiple CSV files with 20 parts
train_filepaths = save_to_multiple_csv_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_csv_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_csv_files(test_data, "test", header, n_parts=10)

# Now, let's read and display the first few lines of one of the CSV files using Pandas
import pandas as pd

# Read the first few lines of the first CSV file of the training dataset
pd.read_csv(train_filepaths[0]).head()

# # Or, alternatively, we can read the first few lines in text mode
# # Open the first CSV file of the training dataset
# with open(train_filepaths[0]) as f:
#     # Read and print the first 5 lines
#     for i in range(5):
#         print(f.readline(), end="")

# # Finally, we have the file paths of the saved CSV files for training data
# train_filepaths


Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
0,3.5214,15.0,3.049945,1.106548,1447.0,1.605993,37.63,-122.43,1.442
1,5.3275,5.0,6.49006,0.991054,3464.0,3.44334,33.69,-117.39,1.687
2,3.1,29.0,7.542373,1.591525,1328.0,2.250847,38.44,-122.98,1.621
3,7.1736,12.0,6.289003,0.997442,1054.0,2.695652,33.55,-117.7,2.621
4,2.0549,13.0,5.312457,1.085092,3297.0,2.244384,33.93,-116.93,0.956
