# TFDS Hello World

In this notebook we will take a look at the simple Hello World scenario of TensorFlow Datasets (TFDS). We'll use TFDS to perform the extract, transform, and load processes for the MNIST dataset.

## Setup

We'll start by importing TensorFlow, TensorFlow Datasets, and Matplotlib.

In [1]:
%%bash
pip install --no-cache-dir -qU pip wheel
pip install --no-cache-dir -qU numpy pandas matplotlib seaborn scikit-learn
pip install --no-cache-dir -qU tensorflow tensorflow-datasets
pip check

No broken requirements found.


In [2]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import warnings
warnings.filterwarnings('ignore')

import numpy as np
np.random.seed(42)

import pandas as pd
import json

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')
sns.set(font='DejaVu Sans')

import tensorflow as tf
tf.keras.utils.set_random_seed(42)
tf.get_logger().setLevel('ERROR')

import tensorflow_datasets as tfds

print("\u2022 Using TensorFlow Version:", tf.__version__)

â€¢ Using TensorFlow Version: 2.11.0


## Extract - Transform - Load (ETL)

Now we'll run the **ETL** code. First, to perform the **Extract** process we use `tfts.load`. This handles everything from downloading the raw data to parsing and splitting it, giving us a dataset. Next, we perform the **Transform** process. In this simple example, our transform process will just consist of shuffling the dataset. Finally, we **Load** one record by using the `take(1)` method. In this case, each record consists of an image and its corresponding label. After loading the record we proceed to plot the image and print its corresponding label. 

In [None]:
# Construct a tf.data.Dataset by downloading and extracting
# Extract
dataset = tfds.load(name="mnist", split=tfds.Split.TRAIN)
# Transform
dataset = dataset.shuffle(NUM_SAMPLES) # buffer size
dataset = dataset.repeat(NUM_EPOCHS)
dataset = dataset.map(lambda x: ...)
dataset = dataset.batch(BATCH_SIZE)
# Load
iterator = dataset.take(10) # To fetch 10 samples from the dataset
for data in iterator:
    # Acess data and use it

In [17]:
# Construct a tf.data.Dataset from MNIST
dataset = tfds.load(name="mnist", split=tfds.Split.TRAIN)
# Inspecting shapes and datatypes
print(dataset)
# Checking if the dataset is an instance of tf.data.Dataset
assert isinstance(dataset, tf.data.Dataset)

<PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>


In [10]:
print(tfds.list_builders()[:10])

['abstract_reasoning', 'accentdb', 'aeslc', 'aflw2k3d', 'ag_news_subset', 'ai2_arc', 'ai2_arc_with_ir', 'amazon_us_reviews', 'anli', 'answer_equivalence']


In [15]:
mnist, info = tfds.load(name="mnist", with_info=True)
print(info)
print("Data dir: ", info.data_dir)
print("Image features: ", info.features['image'])
print("Label features: ", info.features['label'])
print("Number of training examples ", info.splits['train'].num_examples)
print("Number of test examples ", info.splits['test'].num_examples)

tfds.core.DatasetInfo(
    name='mnist',
    full_name='mnist/3.0.1',
    description="""
    The MNIST database of handwritten digits.
    """,
    homepage='http://yann.lecun.com/exdb/mnist/',
    data_path='/home/meng/tensorflow_datasets/mnist/3.0.1',
    file_format=tfrecord,
    download_size=11.06 MiB,
    dataset_size=21.00 MiB,
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=uint8),
        'label': ClassLabel(shape=(), dtype=int64, num_classes=10),
    }),
    supervised_keys=('image', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=10000, num_shards=1>,
        'train': <SplitInfo num_examples=60000, num_shards=1>,
    },
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
)
Data dir

In [21]:
dataset = tfds.load(name="mnist", as_supervised=True)
# Inspecting shapes of a batch
for image, label in dataset['train'].take(1):
    print(image.shape, label.shape)

(28, 28, 1) ()


In [None]:
ds = tfds.load(name="coco", split=tfds.Split("test2015"))
ds



[1mDownloading and preparing dataset 37.57 GiB (download: 37.57 GiB, generated: Unknown size, total: 37.57 GiB) to /home/meng/tensorflow_datasets/coco/2014/1.1.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]

In [None]:
mnist_builder = tfds.builder(name="mnist")
mnist_builder.download_and_prepare()
mnist_builder.as_dataset(split=tfds.Split.TRAIN)

In [None]:
# EXTRACT
dataset = tfds.load(name="mnist", split="train")
# TRANSFORM
dataset.shuffle(100)

In [None]:
# LOAD
for data in dataset.take(1):
    image = data["image"].numpy().squeeze()
    label = data["label"].numpy()
    
    print("Label: {}".format(label))
    plt.imshow(image, cmap=plt.cm.binary)
    plt.show()