# Using Tensorflow Datasets
Guide:
[https://www.tensorflow.org/datasets/overview](https://www.tensorflow.org/datasets/overview)

Datasets:
[https://www.tensorflow.org/datasets/catalog/overview](https://www.tensorflow.org/datasets/catalog/overview)

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds

In [None]:
tfds.__version__

'4.9.3'

## Download a dataset with its name

In [None]:
(train_dataset, test_dataset), info = tfds.load('mnist', split=['train', 'test'],
                                                shuffle_files=True, as_supervised=True,
                                                with_info=True)

Downloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /root/tensorflow_datasets/mnist/3.0.1...


Dl Completed...:   0%|          | 0/5 [00:00<?, ? file/s]

Dataset mnist downloaded and prepared to /root/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.


## Overview samples

In [None]:
# tfds.show_examples(train_dataset, info)
tfds.as_dataframe(train_dataset.take(5), info)

Unnamed: 0,image,label
0,,4
1,,1
2,,0
3,,7
4,,8


In [None]:
type(train_dataset)

tensorflow.python.data.ops.prefetch_op._PrefetchDataset

## Pre-process data

In [None]:
def preprocess_data(images, labels):
    # Normalize
    images = tf.cast(images, tf.float32) / 255.
    # Reshape to vector
    images = tf.reshape(images, (784,))
    return images, labels

## Prepare dataset for train

In [None]:
# Train Data
ds_train = train_dataset.map(
    preprocess_data, num_parallel_calls=tf.data.AUTOTUNE)
ds_train = ds_train.cache()
# Random transformations should be applied after caching.
ds_train = ds_train.shuffle(info.splits['train'].num_examples)
ds_train = ds_train.batch(128)
ds_train = ds_train.prefetch(tf.data.AUTOTUNE)

# Test Data
# You don't need to call tf.data.Dataset.shuffle for test dataset.
# Caching is done after batching, because batches can be the same between epochs.
ds_test = test_dataset.map(
    preprocess_data, num_parallel_calls=tf.data.AUTOTUNE)
ds_test = ds_test.batch(128)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.AUTOTUNE)

## Check data shapes/values

In [None]:
# Input dataset
sample1 = train_dataset.take(1)
for img, label in sample1:
    print(img.shape)
    print(label)

(28, 28, 1)
tf.Tensor(4, shape=(), dtype=int64)


In [None]:
# Pre-processed dataset
sample2 = ds_train.take(1)
for img, label in sample2:
    print(img.shape)
    print(label)

(128, 784)
tf.Tensor(
[4 9 6 9 0 4 6 9 2 0 9 9 2 1 1 3 0 6 8 3 2 1 7 5 3 9 2 1 4 3 6 1 7 5 4 7 3
 2 7 1 5 4 6 1 9 0 4 2 6 3 9 0 3 9 8 5 8 5 4 5 7 5 8 7 1 5 0 0 3 2 1 1 1 9
 9 1 2 4 0 9 9 2 0 6 8 5 1 1 8 8 9 1 4 6 7 9 1 0 8 8 1 3 2 0 7 9 8 0 3 8 6
 6 4 5 1 7 5 3 6 0 4 5 3 7 2 4 4 0], shape=(128,), dtype=int64)


## Create a test model + training

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Dense(16, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax'),
])

In [None]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy')

In [None]:
# It will use the batch size of 128 that we set in the dataset
model.fit(ds_train,
          epochs=10,
          validation_data=ds_test)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x78263d367a30>