# Datasets
* `tf.data` module provids classes to throw data in a model and to manipulate data
* In particular they can
    - read data from in memory
    - read data from a csv
    - apply transformations
* In particular they are designed to deal with a large amount of data

- Dataset can be accessd via 
    - looping
    - creationg a python iterator `iter(dataset)`

### Loading from memory
There are two methods that create datasets:
    - `from_tensors`
    - `from_tensor_slices`

In [1]:
import tensorflow as tf
from tensorflow.data import Dataset
import numpy as np
xs = np.array([[1,2,3,4], [4,5,6,7], [1, 1, 1,1]])

# return the whole array
dataset = Dataset.from_tensors(xs)

#access via iteration
for x in dataset:
    print(x)
    
# creating iterator
ts = iter(dataset)
next(ts)

tf.Tensor(
[[1 2 3 4]
 [4 5 6 7]
 [1 1 1 1]], shape=(3, 4), dtype=int64)


<tf.Tensor: shape=(3, 4), dtype=int64, numpy=
array([[1, 2, 3, 4],
       [4, 5, 6, 7],
       [1, 1, 1, 1]])>

In [2]:
# return slices along axis 0
dataset = Dataset.from_tensor_slices(xs)
print(type(dataset))
for x in dataset:
    print(x)

<class 'tensorflow.python.data.ops.dataset_ops.TensorSliceDataset'>
tf.Tensor([1 2 3 4], shape=(4,), dtype=int64)
tf.Tensor([4 5 6 7], shape=(4,), dtype=int64)
tf.Tensor([1 1 1 1], shape=(4,), dtype=int64)


In [3]:
ts = iter(dataset)
next(ts)

<tf.Tensor: shape=(4,), dtype=int64, numpy=array([1, 2, 3, 4])>

### Inspecting the shape of a dataset:
- using the method `element_spec` 
### Transformations
- combine slices into a batch `batch` 
- `map`
- `flat_map`
- `repeat`
- `shuffle` 
- `window` creates a window (e.g., for time series)


### Using with Keras

In [4]:
dataset = Dataset.from_tensor_slices(tf.range(102))

In [5]:
# print only the first 5 elements
for x in dataset.take(5):
    print(x)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)


In [6]:
batch = dataset.batch(10, drop_remainder=False)
for x in batch:
    print(x) # batch transformation returns a dataset containing tensors

tf.Tensor([0 1 2 3 4 5 6 7 8 9], shape=(10,), dtype=int32)
tf.Tensor([10 11 12 13 14 15 16 17 18 19], shape=(10,), dtype=int32)
tf.Tensor([20 21 22 23 24 25 26 27 28 29], shape=(10,), dtype=int32)
tf.Tensor([30 31 32 33 34 35 36 37 38 39], shape=(10,), dtype=int32)
tf.Tensor([40 41 42 43 44 45 46 47 48 49], shape=(10,), dtype=int32)
tf.Tensor([50 51 52 53 54 55 56 57 58 59], shape=(10,), dtype=int32)
tf.Tensor([60 61 62 63 64 65 66 67 68 69], shape=(10,), dtype=int32)
tf.Tensor([70 71 72 73 74 75 76 77 78 79], shape=(10,), dtype=int32)
tf.Tensor([80 81 82 83 84 85 86 87 88 89], shape=(10,), dtype=int32)
tf.Tensor([90 91 92 93 94 95 96 97 98 99], shape=(10,), dtype=int32)
tf.Tensor([100 101], shape=(2,), dtype=int32)


### Example Windows for Time Series
- Use the `window` method
- Returns a dataset of datasets

In [7]:
xs = Dataset.range(10)
xs = xs.window(5, shift=1)
for window in xs:
    for val in window:
        print(val.numpy(), end=" ")
    print()

0 1 2 3 4 
1 2 3 4 5 
2 3 4 5 6 
3 4 5 6 7 
4 5 6 7 8 
5 6 7 8 9 
6 7 8 9 
7 8 9 
8 9 
9 


In [8]:
xs = Dataset.range(10)
xs = xs.window(5, shift=1, drop_remainder=True)
xs = xs.flat_map(lambda window: window.batch(5))  # 5 timesteps -> 1 training batch
for window in xs:
    print(window.numpy())

[0 1 2 3 4]
[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7]
[4 5 6 7 8]
[5 6 7 8 9]


In [9]:
xs = Dataset.range(10)
xs = xs.window(5, shift=1, drop_remainder=True)
xs = xs.flat_map(lambda window: window.batch(5))  # 5 timesteps -> 1 training batch
xs = xs.map(lambda window: (window[:-1], window[-1:])) # take the first 4 values as fetures and the last as target
for x, y in xs:
    print(x.numpy(), y.numpy())

[0 1 2 3] [4]
[1 2 3 4] [5]
[2 3 4 5] [6]
[3 4 5 6] [7]
[4 5 6 7] [8]
[5 6 7 8] [9]


- So far we have only worked with the inner window. 
- That is we arranged the data for one time series we would like to predict
- In a model we want to insert "training" batches. That is more time series in one training step
- Wo we have to batch the outer dataset
- ... and randomize it

In [22]:
xs = Dataset.range(10)
xs = xs.window(5, shift=1, drop_remainder=True)
xs = xs.flat_map(lambda window: window.batch(5))  # 5 timesteps -> 1 training batch
xs = xs.map(lambda window: (window[:-1], window[-1:])) # take the first 4 values as fetures and the last as target
# 1. shuffle data in order to avoid sequence bias
# 2. set outer batch size
# 3. prefetch data: that is allow to prepare the next batch while another pice of code (training) is beeing exectued.
# This sould be always done in order to speed up things.
xs = xs.shuffle(buffer_size=20).batch(2).prefetch(1)
for x, y in xs:
    print(f"Input: \n {x.numpy()}")
    print(f"Output: \n {y.numpy()}")    

Input: 
 [[1 2 3 4]
 [3 4 5 6]]
Output: 
 [[5]
 [7]]
Input: 
 [[5 6 7 8]
 [0 1 2 3]]
Output: 
 [[9]
 [4]]
Input: 
 [[2 3 4 5]
 [4 5 6 7]]
Output: 
 [[6]
 [8]]


In [17]:
xs.shuffle?

In [21]:
xs.prefetch?