# ETL Pipeline

The main focus of the ETL pipeline is to make sure that the GPU and CPU are working in sync and that none of the devices are in idle state and so that we get the most out of the hardware we have. The job of CPU is to do all the ETL process and the GPU's task is take the data from the ETL pipline and train the model

As the preprocessing of the data is done by the CPU, it often can become the botlleneck. This can be avoided with a good efficient pipeline.

## Caching

One method to create a efficient ETL pipeline is to use Data caching. Datasets/Tensors can be cached in 2 ways:
- Cachin in memory
- Cachine in Storage

### Caching in memory

Here the caching takes place in the RAM. 
we use `tf.data.Dataset.cache()` to cache the dataset so that the need to do pre-processing again and again for each epoch is not needed.

### Caching in Disk

Here we use the `tf.data.Dataset.cache(filename='<file name here>')` and can then store the cache on disk if the data is too big to e stored in the memory(RAM).

## Parallel APIs


The main APIs in `tf.data.Dataset` that we can use to take the advantage of parallelism to get the most out of our hardware are:
- map
- prefetch
- interleave

### AUTOTUNE

Most of the parallel APIs will need us to give values to parameters based on our system configuraion. These values can be hardcoded in most of the casses but in case of Cloud architectures, where the system are scalled all the time eihter horizontally or vertically, with changing system configuration, we cant keep chaging the code to keep providing the hardocded data. 

This is where the Autotune API helps us. We can set almost all the variables or parameters with Autotune so that the proper values to get the max utilization of the resources are carried by the TF itself.

```python
from tensorflow.data.experimental import AUTOTUNE
```

### Map

Most often we need to perform preprocessing like augmentation on the data before it is passed further towards the model. This can be a very expensive taks if we dont utilize all the cores of the CPU to parallelize the process.
eg:
```python
def augmentation(features):
    x = tf.image.random_flip_left_right(features['image'])
    x = tf.image.random_flip_up_down(x)
    x = tf.image.random_brightness(x, max_dedlta=0.1)
    x = tf.image.random_saturation(x, lower=0.75, upper=1.5)
    x = tf.image.random_hue(x, max_delta=0.15)
    x = tf.image.random_contrast(x, lower=0.75, upper=1.5)
    x = tf.image.resize(x, (224, 224))
    image = x/255.0
    return image, features['label']

# load dataset
dataset = tfds.load(cats_dogs, split=tfds.Split.TRAIN)
# how many cores of CPU do u have?
cores = 8
augmented_dataset = dataset.map(augmentation, num_parallel_calls=cores-1)
```

### Prefetch

We can use prefetch so that the data preprocessing for the next epoch can be done by the CPU while the epoch is being executed on the GPU. This way the CPU and GPU are being used simultaneouslt at the same time thus increasing the system throughput.

eg:
```python
preped_dataset = dataset.map(augmentation_fn).prefetch(AUTOTUNE)
```

### Inteleave

We can also try to optimize the data extraction so that the data that has been extracted( or loaded into memeory or is ready for further use) can be preprocessed so that the CPU resources are used properly. So basically here we are trying to parallelize the I/O and preprocessing(map) operation.

eg:
```python
files = tf.data.Dataset.list_files('regex to cover all files')

num_parallel_reads = 4

dataset = files.interleave(
    tf.data.TFRecordDataset,  # map function
    cycle_length=num_parallel_reads, 
    num_parallel_calls=tf.data.experimental.AUTOTUNE)

```