# Introduction

TensorFlow provides an Data-API to to create dataset objects. TensorFlow takes care of: 

1. Multithreading
2. Queing
3. Batching
4. Prefetching

The TensorFlow DataAPI can read multiple file format: 

1. CSV
2. JSON
3. TFRecord
4. SQL Databases

__Usualy Datasets are used to gradually read data from disk__

In [37]:
import tensorflow as tf

## Important Functions

### Creating a Dataset Object

### ```.from_tensor_slices()```

Used to create a Dataset which contain the sliced tensors. Tensors are slices across their first dimension: 

1. __From 1-Dimensional Array__  

```python
dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3])
list(dataset.as_numpy_iterator())
# dataset = [1,2,3]
```

2. __From 2-Dimensional Array__  

```python
dataset = tf.data.Dataset.from_tensor_slices([[1, 2], [3, 4]])
list(dataset.as_numpy_iterator())
# [[1,2], [1,2]]
```

3. __From Tuples (ATTENTION)__  

```python
dataset = tf.data.Dataset.from_tensor_slices(([1, 2], [3, 4], [5, 6]))
list(dataset.as_numpy_iterator())
# [(1,3,5), (2,4,6)]
```



In [38]:
X = tf.range(10)

In [39]:
dataset = tf.data.Dataset.from_tensor_slices(X)

In [40]:
dataset

<TensorSliceDataset shapes: (), types: tf.int32>

### Chaining Transformation

Once a dataset is created we can apply multiple transformation methods - e.g.:

- ```.batch()```
- ```.repeat()```
- ```.map()```
- ```.apply()```
- ```.filter()```
- ```.shuffle()```
- ```.take()```
- ```.list_files()```

__Each transformation method returns a new dataset object - so they can be changed together!__

In [41]:
dataset.batch(7)

<BatchDataset shapes: (None,), types: tf.int32>

In [42]:
dataset_ = dataset.repeat(3).batch(7)

In [43]:
for item in dataset_:
    print(item) # The last Tensor will have a different shape size. 

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)
tf.Tensor([8 9], shape=(2,), dtype=int32)


In [45]:
# Turning on drop_remainder
for item in dataset.repeat(3).batch(7, drop_remainder=True):
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)
tf.Tensor([1 2 3 4 5 6 7], shape=(7,), dtype=int32)


In [48]:
# Only taking few examples from dataset
for item in dataset_.take(3):
    print(item)

tf.Tensor([0 1 2 3 4 5 6], shape=(7,), dtype=int32)
tf.Tensor([7 8 9 0 1 2 3], shape=(7,), dtype=int32)
tf.Tensor([4 5 6 7 8 9 0], shape=(7,), dtype=int32)


## Shuffling Data & Buffer

__Buffers:__ 

-> Buffers are used to store temporiarly data   
-> Often used when there is in asymetry between input and output of data (e.g. Inputing and Image to a neural net [input] and outputing the classification result[output])  
-> To high buffers sizes often lead to OUtOfMemory Errors  

__```.shuffling()```__

-> Will create a new dataset thats starts filling up a Buffer  
-> Whenever it's asked to pull an item, it's feeding it from the buffer  
-> Buffer filled up with new elements  
-> Loop continues till StopIteration  


__Large Dataset__  
-> Maybe shuffling is not sufficient -> e.g. Buffer to small due to memory limitations  
-> We could shuffle the dataset before  
-> __Optimal:__ Divide the data in different files and feed them randomly during training.  

In [54]:
 dataset = tf.data.Dataset.range(10).repeat(3)

In [57]:
for item in dataset.shuffle(buffer_size=3, seed=2).batch(7, drop_remainder=True):
    print(item)

tf.Tensor([2 3 4 5 6 0 7], shape=(7,), dtype=int64)
tf.Tensor([1 0 1 2 8 4 5], shape=(7,), dtype=int64)
tf.Tensor([3 6 8 9 0 9 2], shape=(7,), dtype=int64)
tf.Tensor([3 4 1 6 5 8 7], shape=(7,), dtype=int64)


## Interleaving

The idea to divide the data set into different files and to process them incremently.



In [60]:
train_fps = ["../..data/housing_1.csv", "../../data/housing_2.csv"]

### ```.list_files()```

Produces "globbed" file names. 

In [67]:
# Dataset only containing file paths
fp_dataset = tf.data.Dataset.list_files(train_fps, shuffle=True, seed=42)

### ```data.TextLineDataset()```

Creates a dataset containing lines of text file. With CSV-Files we use ```.skip(1)``` to skip the csv header. Read Data are only byte string they need to __parsed!!__

In [80]:
import os

In [81]:
for item in tf.data.TextLineDataset("../../data/housing_1.csv").skip(1):
    print(item)
    break

tf.Tensor(b'-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY', shape=(), dtype=string)


### ```.interleave()```

__General__: Execute a function on multiple files at one time, in random order.   
__More specific__: Pulls from multiple files one line at the time in random order.   


_num_parallel_cells:_ Argument to run interleave in parallel. This argument is later used to multithread the loading!

In [82]:
n_readers = 5
dataset = fp_dataset.interleave(
    lambda file_path: tf.data.TextLineDataset(file_path).skip(1),
    cycle_length = n_readers
)

In [83]:
for line in dataset.take(5):
    print(line)

tf.Tensor(b'-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY', shape=(), dtype=string)
tf.Tensor(b'-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY', shape=(), dtype=string)
tf.Tensor(b'-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY', shape=(), dtype=string)
tf.Tensor(b'-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY', shape=(), dtype=string)
tf.Tensor(b'-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY', shape=(), dtype=string)


# Prefetech & Summary

<img src="../../img/1302.png" />

### ```.prefetch()```

Tries to be one dataset on batch ahead. The Prefetch dataset will work in parallel to always try to keep a batch ready. 

Benefits: 

1. Performance
2. Ensure that preprocccing and loading are multithreaded
    

<img src="../../img/1303.png" />