# Intro

I followed along with [this video](https://www.youtube.com/watch?v=fou31n3Win0) for an example.

In [3]:
import numpy as np
import tensorflow as tf
from IPython.display import HTML

# Create a dataset pipeline from numpy or lists

Tensorflow dataset is a module used to work with large datasets and to create complex pipelines from simple, reusable pieces. This is typically helpful in cases where a whole dataset does not fit into memory - which is the rule, rather than the exception for many machine learning applications. This is analogous to how file-streams will stream a file line by line, rather than the whole file.

## Tensor-inception

Tensors are a class of objects that, especially in tensorflow, can lead to some confusing nomenclature. This is because abstractly, a tensor is ' is an algebraic object that describes a (multilinear) relationship between sets of algebraic objects related to a vector space. Objects that tensors may map between include vectors and scalars, and even other tensors' ([Wikipedia - Tensor](https://en.wikipedia.org/wiki/Tensor#:~:text=In%20mathematics%2C%20a%20tensor%20is,scalars%2C%20and%20even%20other%20tensors.)). Concretely, this means that the list below `data` is a tensor of rank $1 \times 3$. So, we create a tensorflow dataset by slicing the `data` tensor along its first axis, therefore, `tf.data.Dataset.from_tensor_slices`.

In [4]:
# Consider data in memory, and we want to create a dpipeline

data = [1, 2, 3, 4] # each piece of data is an example.

# Okay - so, here, tensor-inception begins. A tensor is
dataset = tf.data.Dataset.from_tensor_slices(data)

dataset

<TensorSliceDataset shapes: (), types: tf.int32>

Did something go wrong? It looks like our data is not there...This is a pointer to our data. We are seeing the representation of the pointer. The shape describes the shape of each example in the dataset - in this case a zero-rank tensor, which is just a scalar (tensor rank 0). We also see the type of each example in the dataset, in this case, tf.int32.

In [5]:
# Access Data (iteration)

for i in dataset:
  print(i)

tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)


Okay - so now we see more information. In this case, each item in the dataset is a Tensor, with the first argument being a numeric representation of the tensor (its value), with shape () (remember - a scalar), of type int32.

# Apply instructions / transformations

Since the dataset is accessed in batches, rather than 'all at once', we have to also define the transformations that are done each time a new batch is accessed in the dataset.

Instructions are applies when the dataset is accessed.

## Map

Lets say we want to take each element of our data set, and split into two elements - one will be the first element, the second will be the first element multiplied by 2.

In [6]:
def make_new_dataset():
  dataset = tf.data.Dataset.from_tensor_slices(np.arange(start=1, stop=16, step=1))
  dataset = dataset.map(lambda x: (x, x*2))
  return dataset

In [7]:
dataset = make_new_dataset()

for example in dataset:
  print(example)

(<tf.Tensor: shape=(), dtype=int64, numpy=1>, <tf.Tensor: shape=(), dtype=int64, numpy=2>)
(<tf.Tensor: shape=(), dtype=int64, numpy=2>, <tf.Tensor: shape=(), dtype=int64, numpy=4>)
(<tf.Tensor: shape=(), dtype=int64, numpy=3>, <tf.Tensor: shape=(), dtype=int64, numpy=6>)
(<tf.Tensor: shape=(), dtype=int64, numpy=4>, <tf.Tensor: shape=(), dtype=int64, numpy=8>)
(<tf.Tensor: shape=(), dtype=int64, numpy=5>, <tf.Tensor: shape=(), dtype=int64, numpy=10>)
(<tf.Tensor: shape=(), dtype=int64, numpy=6>, <tf.Tensor: shape=(), dtype=int64, numpy=12>)
(<tf.Tensor: shape=(), dtype=int64, numpy=7>, <tf.Tensor: shape=(), dtype=int64, numpy=14>)
(<tf.Tensor: shape=(), dtype=int64, numpy=8>, <tf.Tensor: shape=(), dtype=int64, numpy=16>)
(<tf.Tensor: shape=(), dtype=int64, numpy=9>, <tf.Tensor: shape=(), dtype=int64, numpy=18>)
(<tf.Tensor: shape=(), dtype=int64, numpy=10>, <tf.Tensor: shape=(), dtype=int64, numpy=20>)
(<tf.Tensor: shape=(), dtype=int64, numpy=11>, <tf.Tensor: shape=(), dtype=int64, n

See what happened? When we access each element of the dataset, we split it into a tuple.

## Shuffle

Shuffle randomizes the order of the dataset. Shuffle will not load the whole dataset into memory - because these datasets can be enormous. So, we have to specify a buffer_size.

In [8]:
dataset = make_new_dataset()
dataset = dataset.shuffle(buffer_size=5)

for x, y in dataset:
  print(x, y)

tf.Tensor(5, shape=(), dtype=int64) tf.Tensor(10, shape=(), dtype=int64)
tf.Tensor(1, shape=(), dtype=int64) tf.Tensor(2, shape=(), dtype=int64)
tf.Tensor(3, shape=(), dtype=int64) tf.Tensor(6, shape=(), dtype=int64)
tf.Tensor(4, shape=(), dtype=int64) tf.Tensor(8, shape=(), dtype=int64)
tf.Tensor(2, shape=(), dtype=int64) tf.Tensor(4, shape=(), dtype=int64)
tf.Tensor(9, shape=(), dtype=int64) tf.Tensor(18, shape=(), dtype=int64)
tf.Tensor(10, shape=(), dtype=int64) tf.Tensor(20, shape=(), dtype=int64)
tf.Tensor(11, shape=(), dtype=int64) tf.Tensor(22, shape=(), dtype=int64)
tf.Tensor(6, shape=(), dtype=int64) tf.Tensor(12, shape=(), dtype=int64)
tf.Tensor(14, shape=(), dtype=int64) tf.Tensor(28, shape=(), dtype=int64)
tf.Tensor(12, shape=(), dtype=int64) tf.Tensor(24, shape=(), dtype=int64)
tf.Tensor(15, shape=(), dtype=int64) tf.Tensor(30, shape=(), dtype=int64)
tf.Tensor(7, shape=(), dtype=int64) tf.Tensor(14, shape=(), dtype=int64)
tf.Tensor(8, shape=(), dtype=int64) tf.Tensor(16, 

See? Above, you can see that each set of five examples has been randomized.

## Batch

Load data in chunks. Each group is loaded separately.

In [9]:
dataset = make_new_dataset()
dataset = dataset.batch(batch_size=3)

for i, batch in enumerate(dataset):
  print("Batch #", i)
  batch_x, batch_y = batch
  print("\t", "Batch X: ", batch_x)
  print("\t", "Batch Y: ", batch_y)

Batch # 0
	 Batch X:  tf.Tensor([1 2 3], shape=(3,), dtype=int64)
	 Batch Y:  tf.Tensor([2 4 6], shape=(3,), dtype=int64)
Batch # 1
	 Batch X:  tf.Tensor([4 5 6], shape=(3,), dtype=int64)
	 Batch Y:  tf.Tensor([ 8 10 12], shape=(3,), dtype=int64)
Batch # 2
	 Batch X:  tf.Tensor([7 8 9], shape=(3,), dtype=int64)
	 Batch Y:  tf.Tensor([14 16 18], shape=(3,), dtype=int64)
Batch # 3
	 Batch X:  tf.Tensor([10 11 12], shape=(3,), dtype=int64)
	 Batch Y:  tf.Tensor([20 22 24], shape=(3,), dtype=int64)
Batch # 4
	 Batch X:  tf.Tensor([13 14 15], shape=(3,), dtype=int64)
	 Batch Y:  tf.Tensor([26 28 30], shape=(3,), dtype=int64)


Look how the data has been reshaped. Rather than storing each example a scaler, now the examples have been batched together in Rank-1 tensors of length 3 (the batch size). This is done (I guess) for efficiency reasons.

# Datasets From Generators

Generators can be used to create tensorflow datasets, but this is inefficient because the generator is not scalable and it is subject to Python's global interpreter lock (GIL) so parallelization across computers is not really possible.

Pipeline transformations are **preferred**

In [10]:
def dataset_generator():
  for x in np.arange(start=0, stop=6, step=1):
    yield x
    
dataset = tf.data.Dataset.from_generator(generator=dataset_generator, output_types=tf.int32)

for i in dataset:
  print(i)

tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
tf.Tensor(2, shape=(), dtype=int32)
tf.Tensor(3, shape=(), dtype=int32)
tf.Tensor(4, shape=(), dtype=int32)
tf.Tensor(5, shape=(), dtype=int32)


In [11]:
dataset

<FlatMapDataset shapes: <unknown>, types: tf.int32>

# Working with Tensors

When applying transformations on a tf.data.Dataset, we are working with Tensors, not numpy arrays. We might want to do this, for example, when operating on data with categorical values.

In [12]:
# What's the categorical value of 5, considering 10 different ordinal classes?
tf.keras.utils.to_categorical(y=5, num_classes=10)

array([0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], dtype=float32)

This works for numeric (integer) embeddings only.

In [26]:
def cat_encode(x, y):
  return x, tf.one_hot(y, 30)

dataset = make_new_dataset()
dataset = dataset.map(cat_encode)

for example in dataset:
  x, y = example
  print(x, y)

tf.Tensor(1, shape=(), dtype=int64) tf.Tensor(
[0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.], shape=(30,), dtype=float32)
tf.Tensor(2, shape=(), dtype=int64) tf.Tensor(
[0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.], shape=(30,), dtype=float32)
tf.Tensor(3, shape=(), dtype=int64) tf.Tensor(
[0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.], shape=(30,), dtype=float32)
tf.Tensor(4, shape=(), dtype=int64) tf.Tensor(
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.], shape=(30,), dtype=float32)
tf.Tensor(5, shape=(), dtype=int64) tf.Tensor(
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.], shape=(30,), dtype=float32)
tf.Tensor(6, shape=(), dtype=int64) tf.Tensor(
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
 0. 0. 0. 0. 0. 0.], shape=(30,), d

# Training with Dataset API

Just pass to .fit!