<a href="https://colab.research.google.com/github/https-deeplearning-ai/tensorflow-2-public/blob/adding_C3/C3/W1/ungraded_labs/C3_W1_Lab_1_tfds_hello_world.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TFDS Data Pipelines

In [None]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In this notebook we will take a look at the simple data pipelines scenario of TensorFlow Datasets (TFDS). We'll use TFDS to perform the extract, transform, and load processes for the MNIST dataset.

## Setup

We'll start by importing TensorFlow, TensorFlow Datasets, and Matplotlib.

In [1]:
%%bash
pip install -qU pip wheel
pip install -qU tensorflow-gpu tensorflow-datasets
pip install -qU numpy pandas matplotlib seaborn
pip check

No broken requirements found.


In [1]:
try:
    %tensorflow_version 2.x
except Exception:
    pass

In [2]:
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')
sns.set(font='DejaVu Sans')

import numpy as np
import pandas as pd

import tensorflow as tf
import tensorflow_datasets as tfds

print("\u2022 Using TensorFlow Version:", tf.__version__)

• Using TensorFlow Version: 2.8.0


In [3]:
# Limiting GPU memory growth
gpus = tf.config.list_physical_devices('GPU')
if gpus:
    try:
        # Memory growth needs to be the same across GPUs
        tf.config.set_visible_devices(gpus[0], 'GPU')
        logical_gpus = tf.config.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

1 Physical GPUs, 1 Logical GPUs


2022-05-16 23:33:34.010847: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-16 23:33:34.067092: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-16 23:33:34.067314: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-16 23:33:34.068660: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags

## Configuring TensorBoard

In [4]:
# Load the TensorBoard notebook extension.
%load_ext tensorboard

In [5]:
# Clear any logs from previous runs.
!rm -rf ./logs/

## Extract - Transform - Load (ETL)

Now we'll run the **ETL** code. First, to perform the **Extract** process we use `tfts.load`. This handles everything from downloading the raw data to parsing and splitting it, giving us a dataset. Next, we perform the **Transform** process. In this simple example, our transform process will just consist of shuffling the dataset. Finally, we **Load** one record by using the `take(1)` method. In this case, each record consists of an image and its corresponding label. After loading the record we proceed to plot the image and print its corresponding label. 

In [6]:
# See available datasets
tfds.list_builders()

2022-05-16 23:37:13.868490: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".


['abstract_reasoning',
 'accentdb',
 'aeslc',
 'aflw2k3d',
 'ag_news_subset',
 'ai2_arc',
 'ai2_arc_with_ir',
 'amazon_us_reviews',
 'anli',
 'arc',
 'asset',
 'assin2',
 'bair_robot_pushing_small',
 'bccd',
 'beans',
 'bee_dataset',
 'big_patent',
 'bigearthnet',
 'billsum',
 'binarized_mnist',
 'binary_alpha_digits',
 'blimp',
 'booksum',
 'bool_q',
 'c4',
 'caltech101',
 'caltech_birds2010',
 'caltech_birds2011',
 'cardiotox',
 'cars196',
 'cassava',
 'cats_vs_dogs',
 'celeb_a',
 'celeb_a_hq',
 'cfq',
 'cherry_blossoms',
 'chexpert',
 'cifar10',
 'cifar100',
 'cifar10_1',
 'cifar10_corrupted',
 'citrus_leaves',
 'cityscapes',
 'civil_comments',
 'clevr',
 'clic',
 'clinc_oos',
 'cmaterdb',
 'cnn_dailymail',
 'coco',
 'coco_captions',
 'coil100',
 'colorectal_histology',
 'colorectal_histology_large',
 'common_voice',
 'coqa',
 'cos_e',
 'cosmos_qa',
 'covid19',
 'covid19sum',
 'crema_d',
 'cs_restaurants',
 'curated_breast_imaging_ddsm',
 'cycle_gan',
 'd4rl_adroit_door',
 'd4rl_adr

In [7]:
# Pick dataset
mnist_builder = tfds.builder('mnist')
# Download
mnist_builder.download_and_prepare()
# Extract dataset
mnist_builder.as_dataset(split=tfds.Split.TRAIN)

[1mDownloading and preparing dataset 11.06 MiB (download: 11.06 MiB, generated: 21.00 MiB, total: 32.06 MiB) to /home/meng/tensorflow_datasets/mnist/3.0.1...[0m


Dl Completed...:   0%|          | 0/4 [00:00<?, ? file/s]

[1mDataset mnist downloaded and prepared to /home/meng/tensorflow_datasets/mnist/3.0.1. Subsequent calls will reuse this data.[0m


<PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>

In [8]:
dataset, info = tfds.load(name='mnist', with_info=True)
info

tfds.core.DatasetInfo(
    name='mnist',
    full_name='mnist/3.0.1',
    description="""
    The MNIST database of handwritten digits.
    """,
    homepage='http://yann.lecun.com/exdb/mnist/',
    data_path='/home/meng/tensorflow_datasets/mnist/3.0.1',
    download_size=11.06 MiB,
    dataset_size=21.00 MiB,
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    supervised_keys=('image', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=10000, num_shards=1>,
        'train': <SplitInfo num_examples=60000, num_shards=1>,
    },
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
)

In [12]:
num_train_examples = info.splits['train'].num_examples
num_test_examples = info.splits['test'].num_examples
num_classes = info.features['label'].num_classes
input_img_size_original = info.features['image'].shape[0]
input_img_shape_original = info.features['image'].shape

print("Number of train examples:", num_train_examples)
print("Number of test examples:", num_test_examples)
print("Number of label classes:", num_classes)
print("Input image size (original):", input_img_size_original)
print("Input image shape (original):", input_img_shape_original)

Number of train examples: 60000
Number of test examples: 10000
Number of label classes: 10
Input image size (original): 28
Input image shape (original): (28, 28, 1)


In [10]:
# Construct a tf.data.Dataset from MNIST
dataset = tfds.load(name='mnist')
# Inspecting shapes and datatypes
dataset

{'test': <PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>,
 'train': <PrefetchDataset element_spec={'image': TensorSpec(shape=(28, 28, 1), dtype=tf.uint8, name=None), 'label': TensorSpec(shape=(), dtype=tf.int64, name=None)}>}

In [None]:
# EXTRACT
# Construct a tf.data.Dataset by downloading and extracting
dataset = tfds.load(name='mnist', split='test')
# Checking if the dataset is an instance of tf.data.Dataset
assert isinstance(dataset, tf.data.Dataset)

def scale(elem):
    # Make image color values to be float
    elem['image'] = tf.cast(elem['image'], tf.float32)
    # Make image color values to be in [0..1] range.
    elem['image'] /= 255.0
    # Make sure that image has a right size
    elem['image'] = tf.image.resize(image, [INPUT_IMG_SIZE, INPUT_IMG_SIZE])
    return elem

# TRANSFORM
dataset = dataset.map(scale)
dataset = dataset.shuffle(100) # number samples
dataset = dataset.repeat(1) # number epochs
dataset = dataset.batch(1) # batch size

# LOAD
iterator = dataset.take(10) # To fetch 10 samples from the dataset
for data in iterator:
    image = data['image'].numpy().squeeze()
    label = data['label'].numpy()
    
    print("Label: {}".format(label))
    plt.imshow(image, cmap=plt.cm.binary)
    plt.show()

In [None]:
dataset = tfds.load(name='mnist', as_supervised=True)
# Inspecting shapes of a batch
# tuples of data and label
iterator = dataset['train'].batch(1)
for image, label in iterator:
    print(image.shape, label.shape)

## Fashion MNIST

In [None]:
(x_train, y_train), (x_test, y_test) = \
    tf.keras.datasets.fashion_mnist.load_data()

print("x_train shape:", x_train.shape)
print("y_train shape:", y_train.shape)

In [None]:
(x_train, y_train), (x_valid, y_valid), (x_test, y_test) = \
    tfds.as_numpy(tfds.load(name='fashion_mnist', split=['train[:-10%]', 'train[-10%:]', 'test'],
        batch_size=-1, as_supervised=True))
x_train = x_train / 255.0
x_valid = x_valid / 255.0
x_test = x_test / 255.0
print("x_train shape:", x_train.shape)
print("x_valid shape:", x_valid.shape)
print("x_test shape:", x_test.shape)

inputs = tf.keras.Input(shape=(28, 28, 1))
h = tf.keras.layers.Flatten()(inputs)
h = tf.keras.layers.Dense(128, activation=tf.nn.relu)(h)
h = tf.keras.layers.Dropout(0.2)(h)
outputs = tf.keras.layers.Dense(10, activation=tf.nn.softmax)(h)

model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.summary()

model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(),
    optimizer=tf.keras.optimizers.Adam(),
    metrics=[tf.keras.metrics.SparseCategoricalAccuracy()])

history = model.fit(x_train, y_train, validation_data=(x_valid, y_valid),
    epochs=5, batch_size=64, verbose=0)
history_df = pd.DataFrame(history.history, index=history.epoch)
_, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
for metric, ax in zip(['loss', 'sparse_categorical_accuracy'], axs):
    sns.lineplot(ax=ax, data=history_df, x=history_df.index,
        y=metric, label='Train')
    sns.lineplot(ax=ax, data=history_df, x=history_df.index,
        y='val_' + metric, label='Valid')
    ax.set_xlabel('epoch')
    ax.set_title(metric)
plt.show()

result = model.evaluate(x_test, y_test, 
    batch_size=128, verbose=0)
print(f"Test loss: {result[0]:.4f} \n"
    f"Test accuracy: {result[1]:.4f}")

predict = model.predict(x_test[:3])
print(f"Predict: {np.argmax(predict, axis=1)}\n"
    f"Confidence: {100*np.max(predict, axis=1)}")

## Horses or Humans

In [None]:
_, info = tfds.load(name='horses_or_humans', with_info=True)
info

In [None]:
inputs = tf.keras.Input(shape=(300, 300, 3))
h = tf.keras.layers.Conv2D(16, (3, 3), activation=tf.nn.relu)(inputs)
h = tf.keras.layers.MaxPooling2D(2, 2)(h)
h = tf.keras.layers.Conv2D(32, (3, 3), activation=tf.nn.relu)(h)
h = tf.keras.layers.MaxPooling2D(2, 2)(h)
h = tf.keras.layers.Conv2D(64, (3, 3), activation=tf.nn.relu)(h)
h = tf.keras.layers.MaxPooling2D(2, 2)(h)
h = tf.keras.layers.Conv2D(64, (3, 3), activation=tf.nn.relu)(h)
h = tf.keras.layers.MaxPooling2D(2, 2)(h)
h = tf.keras.layers.Conv2D(64, (3, 3), activation=tf.nn.relu)(h)
h = tf.keras.layers.MaxPooling2D(2, 2)(h)
h = tf.keras.layers.Flatten()(h)
h = tf.keras.layers.Dense(512, activation=tf.nn.relu)(h)
outputs = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid)(h)

model = tf.keras.Model(inputs=inputs, outputs=outputs)
model.compile(optimizer=tf.keras.optimizers.Adam(), 
    loss=tf.keras.losses.BinaryCrossentropy(),
    metrics=[tf.keras.metrics.Accuracy()])
model.summary()

train_data = tfds.load(name='horses_or_humans', split='train', 
    as_supervised=True)
valid_data = tfds.load(name='horses_or_humans', split='test',
    as_supervised=True)

train_batch = train_data.shuffle(256).batch(32)
valid_batch = valid_data.batch(32)

history = model.fit(train_batch, validation_data=valid_batch,
    epochs=10, verbose=1)

history_df = pd.DataFrame(history.history, index=history.epoch)
_, axs = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
for metric, ax in zip(['loss', 'accuracy'], axs):
    sns.lineplot(ax=ax, data=history_df, x=history_df.index, 
        y=metric, label='Train')
    sns.lineplot(ax=ax, data=history_df, x=history_df.index,
        y='val_' + metric, label='Valid')
    ax.set_xlabel('epoch')
    ax.set_title(metric)
plt.show()

result = model.evaluate(valid_batch, verbose=1)
print(f"Test loss: {result[0]:.4f} \n"
    f"Test accuracy: {result[1]:.4f}")