# Loading and Preprocessing Data with TensorFlow

In [24]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import os
import pandas as pd

## The Data API

The whole Data api revolves around the concept of *datasets*. Usually, we will use datasets that gradually read data from disk, but from simplicity let's create a dataset entirely in RAM using `tf.data.Dataset.from_tensor_slices()`:

In [14]:
X = tf.range(10) # any data tensor
dataset = tf.data.Dataset.from_tensor_slices(X)
dataset

<_TensorSliceDataset element_spec=TensorSpec(shape=(), dtype=tf.int32, name=None)>

The `from_tensor_slices()` function takes a tensor and creates a `tf.data.Dataset` whose elements are all slices of `X` (along the first dimension), so this dataset contains 10 items: tensors 0, 1, 2,...,9. In this case we would have obtained the same dataset if we had used `tf.data.Dataset.range(10)`

We can simply iterate over a dataset's items like this:

In [15]:
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(7, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(9, shape=(), dtype=int32)


### Chaining Transformations

Once we have dataset, we can apply all sort of transformations to it by calling its transformation methods. Each method returns a new dataset, so we can chain transformations like this:

In [16]:
dataset = dataset.repeat(3).batch(7)
for item in dataset:
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


In this example, we first call `repeat()` method on the original dataset, and it returns a new dataset that will repeat the items in the original dataset three times. Then we call the `batch()` method on this new dataset, and again creates a new dataset. This one will group the items of the previous dataset in batches of seven items. Finally, we can iterate over the items of this final dataset.

As we can see, the `batch()` method had to output a final batch of size two instead of seven, but we can call it with a `drop_remainder=True` if we want it to drop this final batch so that all batches have the exact same size. 

We can also transform the items by calling the `map()` method. For ex., this creates a new dataset with all items doubled:

In [17]:
dataset = dataset.map(lambda x: x*2)
for item in dataset:
    print(item)

tf.Tensor([ 0  2  4  6  8 10 12], shape=(7,), dtype=int32)
tf.Tensor([14 16 18  0  2  4  6], shape=(7,), dtype=int32)
tf.Tensor([ 8 10 12 14 16 18  0], shape=(7,), dtype=int32)
tf.Tensor([ 2  4  6  8 10 12 14], shape=(7,), dtype=int32)
tf.Tensor([16 18], shape=(2,), dtype=int32)


This function is the one we will call to apply any preprocessing we want to our data. Sometimes this will include computations that can be quite intensive, such as reshaping or rotating an image, so we will usually want to spawn multiple threads to speed things up: it's as simple as setting the
`num_parallel_calls` argument. 

**NOTE**:
The function we pass to the `map()` must be convertible to TF Function

While `map()` method applies transformation to each item, the `apply()` method applies a transformation to the dataset as a whole. 

For ex., the `unbatch()` function to the dataset will create a new dataset where each item will be single-integer tensor instead of batch of seven integers:

In [18]:
dataset = dataset.unbatch()
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, shape=(), dtype=int32)
tf.Tensor(16, shape=(), dtype=int32)
tf.Tensor(18, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(10, shape=(), dtype=int32)
tf.Tensor(12, shape=(), dtype=int32)
tf.Tensor(14, sh

It is also possible to simply filter the dataset using the `filter()` method:

In [19]:
dataset = dataset.filter(lambda x: x < 10)
for item in dataset:
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(6, shape=(), dtype=int32)
tf.Tensor(8, shape=(), dtype=int32)


We will often want to look at just few items from a dataset. We can use `take()` method for that:

In [20]:
for item in dataset.take(3):
    print(item)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)


### Shuffling the Data

Gradient Descent works best when the instances in the training set are independent and identically distributed. A simple way to ensure this is to shuffle the instances using the `shuffle()` method.  For ex:

In [21]:
dataset = tf.data.Dataset.range(10).repeat(3) # 0 to 9, 3 times
dataset = dataset.shuffle(buffer_size=5, seed=42).batch(7) # we must specify the buffer size, and it is imp to make it large enough, or else shuffling will not be very effective. 
for item in dataset:
    print(item)

tf.Tensor([0 2 3 6 7 9 4], shape=(7,), dtype=int64)
tf.Tensor([5 0 1 1 8 6 5], shape=(7,), dtype=int64)
tf.Tensor([4 8 7 1 2 3 0], shape=(7,), dtype=int64)
tf.Tensor([5 4 2 7 8 9 9], shape=(7,), dtype=int64)
tf.Tensor([3 6], shape=(2,), dtype=int64)


**TIP:**

If we call `repeat()` on a shuffled dataset, by default it will generate a new order at every iteration.

For large datasets that does not fit in memory, this simple shuffling-buffer approach may not be sufficient, since the buffer will be small compared to dataset. One solution is to shuffle the source data itself. This will definitely improve the shuffling a lot. Even if the data is shuffled, we usually want to shuffle it more. To shuffle the instances some more, a common approach is to split source data into multiple files, then read them in random order during training. However, instances located in same file will still end up close together. To avoid this, we can pick multiple files randomly and read them simultaneously, interleaving records. Then on top of that we can add a shuffling buffer using `shuffle()` method. Let's see how it works:

Let's start by loading and preparing California dataset. We will first load it, split it intro training, validation and test set and finally we scale it.

In [23]:
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

housing = fetch_california_housing()
X_train_full, X_test, y_train_full, y_test = train_test_split(housing.data,
                                                              housing.target.reshape(-1,1),
                                                             random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full, random_state=42)

scaler = StandardScaler()
scaler.fit(X_train)
X_mean = scaler.mean_
X_std = scaler.scale_

For very large dataset that does not fit in memory, we will typically need to split it into many files first, then have TF read these files in parallel. To demonstrate this, let's start by splitting the housing dataset and save it in 20 csv files:

In [45]:
def save_to_multiple_files(data, name_prefix, header=None, n_parts=10):
    housing_dir = os.path.join(os.curdir, "datasets", "california_housing")
    # print(housing_dir)
    os.makedirs(housing_dir, exist_ok=True)
    path_format = os.path.join(housing_dir, "my_{}_{:02d}.csv")
    
    file_paths = []
    m = len(data)
    for file_idx, row_indices in enumerate(np.array_split(np.arange(m), n_parts)):
        part_csv = path_format.format(name_prefix, file_idx)
        file_paths.append(part_csv)
        with open (part_csv, "wt", encoding="utf-8") as f:
            if header is not None:
                f.write(header)
                f.write("\n")
            for row_idx in row_indices:
                """
                Python repr() function returns a printable representation of the object by converting
                that object to a string.
                """
                f.write(",".join([repr(col) for col in data[row_idx]]))
                f.write("\n")
    return file_paths

In [46]:
train_data = np.c_[X_train, y_train]
valid_data = np.c_[X_valid, y_valid]
test_data = np.c_[X_test, y_test]
header_cols = housing.feature_names + ["MedianHouseValue"]
header = ",".join(header_cols)

train_filepaths = save_to_multiple_files(train_data, "train", header, n_parts=20)
valid_filepaths = save_to_multiple_files(valid_data, "valid", header, n_parts=10)
test_filepaths = save_to_multiple_files(test_data, "test", header, n_parts=10)

In [47]:
train_filepaths

['./datasets/california_housing/my_train_00.csv',
 './datasets/california_housing/my_train_01.csv',
 './datasets/california_housing/my_train_02.csv',
 './datasets/california_housing/my_train_03.csv',
 './datasets/california_housing/my_train_04.csv',
 './datasets/california_housing/my_train_05.csv',
 './datasets/california_housing/my_train_06.csv',
 './datasets/california_housing/my_train_07.csv',
 './datasets/california_housing/my_train_08.csv',
 './datasets/california_housing/my_train_09.csv',
 './datasets/california_housing/my_train_10.csv',
 './datasets/california_housing/my_train_11.csv',
 './datasets/california_housing/my_train_12.csv',
 './datasets/california_housing/my_train_13.csv',
 './datasets/california_housing/my_train_14.csv',
 './datasets/california_housing/my_train_15.csv',
 './datasets/california_housing/my_train_16.csv',
 './datasets/california_housing/my_train_17.csv',
 './datasets/california_housing/my_train_18.csv',
 './datasets/california_housing/my_train_19.csv']

Okay, now let's look at the first few lines of one of these csv files:

In [48]:
pd.read_csv(train_filepaths[0]).head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
0,3.5214,15.0,3.049945,1.106548,1447.0,1.605993,37.63,-122.43,1.442
1,5.3275,5.0,6.49006,0.991054,3464.0,3.44334,33.69,-117.39,1.687
2,3.1,29.0,7.542373,1.591525,1328.0,2.250847,38.44,-122.98,1.621
3,7.1736,12.0,6.289003,0.997442,1054.0,2.695652,33.55,-117.7,2.621
4,2.0549,13.0,5.312457,1.085092,3297.0,2.244384,33.93,-116.93,0.956


Or in text mode:

In [51]:
with open(train_filepaths[0]) as f:
    for i in range(5):
        print(f.readline(), end="")

MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedianHouseValue
3.5214,15.0,3.0499445061043287,1.106548279689234,1447.0,1.6059933407325193,37.63,-122.43,1.442
5.3275,5.0,6.490059642147117,0.9910536779324056,3464.0,3.4433399602385686,33.69,-117.39,1.687
3.1,29.0,7.5423728813559325,1.5915254237288134,1328.0,2.2508474576271187,38.44,-122.98,1.621
7.1736,12.0,6.289002557544757,0.9974424552429667,1054.0,2.6956521739130435,33.55,-117.7,2.621


#### Interleaving lines from multiple files

Now let's create a dataset containing only these train file paths:

In [52]:
filepath_data = tf.data.Dataset.list_files(train_filepaths, seed=42)

By defult `list_files()` returns a dataset that shuffles the file paths. In general this is good, but we can set `shuffle=False` if we do not want that for some reason.