<a href="https://colab.research.google.com/github/probml/pyprobml/blob/master/book1/mlp/tfds_intro.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to tensorflow datasets

[TFDS](https://www.tensorflow.org/datasets) is a handy way to handle large datasets as a stream of batches. It can be used by tensorflow and JAX code. 




In [1]:
# Standard Python libraries
from __future__ import absolute_import, division, print_function, unicode_literals

import os
import time
import numpy as np
import glob
import matplotlib.pyplot as plt
import PIL
import imageio

from IPython import display

import sklearn

import seaborn as sns;
sns.set(style="ticks", color_codes=True)

import pandas as pd
pd.set_option('precision', 2) # 2 decimal places
pd.set_option('display.max_rows', 20)
pd.set_option('display.max_columns', 30)
pd.set_option('display.width', 100) # wide windows



In [3]:

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
    IS_COLAB = True
except Exception:
    IS_COLAB = False

# TensorFlow ≥2.0 is required
import tensorflow as tf
from tensorflow import keras
assert tf.__version__ >= "2.0"

print("tf version {}".format(tf.__version__))

if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. DNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")

# Converting numpy data to stream of batches

The functionality similar functionality to PyTorch DataLoader, but natively supports infinite streams via the `repeat` function. Also, all minibatches have the same size (note how we 'wrap around' the dataset).


In [4]:
N_train = 5
D = 4            
np.random.seed(0)
X = np.random.randn(N_train, D)
y = np.random.randn(N_train)
print(y)
batch_size = 2
dataset = tf.data.Dataset.from_tensor_slices({"X": X, "y": y})
batches = dataset.repeat().batch(batch_size)

print('batchified version')
step = 0
num_minibatches = 4
for batch in batches:
    if step >= num_minibatches:
        break
    # print(type(batch["X"])) #<class 'tensorflow.python.framework.ops.EagerTensor'
    x, y = batch["X"].numpy(), batch["y"].numpy()
    print(y)
    step = step + 1



print('batchified version v2')
batch_stream = batches.as_numpy_iterator()
for step in range(num_minibatches):
  batch = batch_stream.next()
  # print(type(batch["X"])) #<class 'numpy.ndarray'>
  x, y = batch["X"], batch["y"]
  print(y)
  step = step + 1

[-2.55298982  0.6536186   0.8644362  -0.74216502  2.26975462]
batchified version
[-2.55298982  0.6536186 ]
[ 0.8644362  -0.74216502]
[ 2.26975462 -2.55298982]
[0.6536186 0.8644362]
batchified version v2
[-2.55298982  0.6536186 ]
[ 0.8644362  -0.74216502]
[ 2.26975462 -2.55298982]
[0.6536186 0.8644362]


# Using pre-packaged datasets

There are many standard datasets available from https://www.tensorflow.org/datasets. We give some examples below.


In [None]:
import tensorflow_datasets as tfds
dataset = tfds.load(name="mnist", split=tfds.Split.TRAIN)

batches = dataset.repeat().batch(batch_size)

step = 0
for batch in batches:
    if step >= num_minibatches:
        break
    X, y = batch['image'], batch['label']
    print(type(X))
    print(X.shape)
    step = step + 1

<class 'tensorflow.python.framework.ops.EagerTensor'>
(2, 28, 28, 1)
<class 'tensorflow.python.framework.ops.EagerTensor'>
(2, 28, 28, 1)
<class 'tensorflow.python.framework.ops.EagerTensor'>
(2, 28, 28, 1)
<class 'tensorflow.python.framework.ops.EagerTensor'>
(2, 28, 28, 1)
