<a href="https://colab.research.google.com/github/Starksood/Experimental_Conundrums/blob/main/Chapter_4_Tensorflow_Datasets.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The goal behind Tensor‐
Flow Datasets (TFDS) is to expose datasets in a way that’s easy to consume, where all
the preprocessing steps of acquiring the data and getting it into TensorFlow-friendly
APIs are done for you.
data = tf.keras.datasets.fashion_mnist

(training_images, training_labels), (test_images, test_labels) = data.load_data()

The list of available datasets is growing all the
time, in categories such as:

Audio

Speech and music data

Image

From simple learning datasets like Horses or Humans up to advanced research
datasets for uses such as diabetic retinopathy detection

Object detection

COCO, Open Images, and more

Structured data

Titanic survivors, Amazon reviews, and more

Summarization

News from CNN and the Daily Mail, scientific papers, wikiHow, and more

Text

IMDb reviews, natural language questions, and more

Translate

Various translation training datasets

Video 

Moving MNIST, Starcraft, and more

TensorFlow Datasets is a separate install from TensorFlow, so be
sure to install it before trying out any samples! If you are using
Google Colab, it’s already preinstalled.

In [1]:
pip install tensorflow-datasets




In [2]:
import tensorflow as tf
import tensorflow_datasets as tfds
mnist_data = tfds.load("fashion_mnist")
for item in mnist_data:
  print(item)

[1mDownloading and preparing dataset fashion_mnist/3.0.1 (download: 29.45 MiB, generated: 36.42 MiB, total: 65.87 MiB) to /root/tensorflow_datasets/fashion_mnist/3.0.1...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]

Extraction completed...: 0 file [00:00, ? file/s]






0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/fashion_mnist/3.0.1.incompleteJY2CAA/fashion_mnist-train.tfrecord


  0%|          | 0/60000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/fashion_mnist/3.0.1.incompleteJY2CAA/fashion_mnist-test.tfrecord


  0%|          | 0/10000 [00:00<?, ? examples/s]

[1mDataset fashion_mnist downloaded and prepared to /root/tensorflow_datasets/fashion_mnist/3.0.1. Subsequent calls will reuse this data.[0m
test
train


In [4]:
mnist_train = tfds.load(name="fashion_mnist", split="train")
assert isinstance(mnist_train, tf.data.Dataset)
print(type(mnist_train))


<class 'tensorflow.python.data.ops.dataset_ops.PrefetchDataset'>


In [5]:
for item in mnist_train.take(1):
  print(type(item))
  print(item.keys())

<class 'dict'>
dict_keys(['image', 'label'])


In [6]:
for item in mnist_train.take(1):
  print(type(item))
  print(item.keys())
  print(item['image'])
  print(item['label'])
  

<class 'dict'>
dict_keys(['image', 'label'])
tf.Tensor(
[[[  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [ 18]
  [ 77]
  [227]
  [227]
  [208]
  [210]
  [225]
  [216]
  [ 85]
  [ 32]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]]

 [[  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [ 61]
  [100]
  [ 97]
  [ 80]
  [ 57]
  [117]
  [227]
  [238]
  [115]
  [ 49]
  [ 78]
  [106]
  [108]
  [ 71]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]]

 [[  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [ 81]
  [105]
  [ 80]
  [ 69]
  [ 72]
  [ 64]
  [ 44]
  [ 21]
  [ 13]
  [ 44]
  [ 69]
  [ 75]
  [ 75]
  [ 80]
  [114]
  [ 80]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]]

 [[  0]
  [  0]
  [  0]
  [  0]
  [  0]
  [ 26]
  [ 92]
  [ 69]
  [ 68]
  [ 75]
  [ 75]
  [ 71]
  [ 74]
  [ 83]
  [ 75]
  [ 77]
  [ 78]
  [ 74]
  [ 74]
  [ 83]
  [ 77]
  [108]
  [ 34]
  [  0]
  [  0]
  [  0]
  [  0]
  [  0]]

 [[  0]
  [  0]
  [  0]
  [  0]
  [  0]


In [8]:
mnist_test, info = tfds.load(name="fashion_mnist", with_info="true")
print(info)

tfds.core.DatasetInfo(
    name='fashion_mnist',
    version=3.0.1,
    description='Fashion-MNIST is a dataset of Zalando's article images consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28x28 grayscale image, associated with a label from 10 classes.',
    homepage='https://github.com/zalandoresearch/fashion-mnist',
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=70000,
    splits={
        'test': 10000,
        'train': 60000,
    },
    supervised_keys=('image', 'label'),
    citation="""@article{DBLP:journals/corr/abs-1708-07747,
      author    = {Han Xiao and
                   Kashif Rasul and
                   Roland Vollgraf},
      title     = {Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning
                   Algorithms},
      journal   = {CoRR},
      volume

In [14]:
#simply call tfds.load, passing it the split
#that you want (in this case train), and use that in the model. The data is batched and
#shuffled to make training more effective.

import tensorflow as tf
import tensorflow_datasets as tfds
(training_images, training_labels), (test_images, test_labels) =
tfds.as_numpy(tfds.load('fashion_mnist', split = ['train', 'test'],
batch_size=-1, as_supervised=True))
training_images = training_images / 255.0
test_images = test_images / 255.0
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28,28,1)),
tf.keras.layers.Dense(128, activation=tf.nn.relu),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(training_images, training_labels, epochs=5)

SyntaxError: ignored

In [15]:
pip install tensorflow_addons as tfa

Collecting tensorflow_addons
  Downloading tensorflow_addons-0.15.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[?25l[K     |▎                               | 10 kB 23.5 MB/s eta 0:00:01[K     |▋                               | 20 kB 30.5 MB/s eta 0:00:01[K     |▉                               | 30 kB 26.6 MB/s eta 0:00:01[K     |█▏                              | 40 kB 19.2 MB/s eta 0:00:01[K     |█▌                              | 51 kB 11.5 MB/s eta 0:00:01[K     |█▊                              | 61 kB 9.4 MB/s eta 0:00:01[K     |██                              | 71 kB 9.0 MB/s eta 0:00:01[K     |██▍                             | 81 kB 10.1 MB/s eta 0:00:01[K     |██▋                             | 92 kB 10.5 MB/s eta 0:00:01[K     |███                             | 102 kB 10.3 MB/s eta 0:00:01[K     |███▎                            | 112 kB 10.3 MB/s eta 0:00:01[K     |███▌                            | 122 kB 10.3 MB/s eta 0:00:01[K  

In [16]:
import tensorflow_addons as tfa
def augmentimages(image, label):
image = tf.cast(image, tf.float32)
image = (image/255)
image = tf.image.random_flip_left_right(image)
image = tfa.image.rotate(image, 40, interpolation='NEAREST')
return image, label

IndentationError: ignored

In [None]:
For example, if you want the first 10,000 records of train to be your training data,
you can omit <start> and just call for train[:10000] (a useful mnemonic is to read
the leading colon as “the first,” so this would read “train the first 10,000 records”):
data = tfds.load('cats_vs_dogs', split='train[:10000]', as_supervised=True)
You can also use % to specify the split. For example, if you want the first 20% of the
records to be used for training, you could use :20% like this:
data = tfds.load('cats_vs_dogs', split='train[:20%]', as_supervised=True)
You could even get a little crazy and combine splits. That is, if you want your training
data to be a combination of the first and last thousand records, you could do the following
(where -1000: means “the last 1,000 records” and :1000 means “the first 1,000
records”):
data = tfds.load('cats_vs_dogs', split='train[-1000:]+train[:1000]',
as_supervised=True)
The Dogs vs. Cats dataset doesn’t have fixed training, test, and validation splits, but,
with TFDS, creating your own is simple. Suppose you want the split to be 80%, 10%,
10%. You could create the three sets like this:
train_data = tfds.load('cats_vs_dogs', split='train[:80%]',
as_supervised=True)
validation_data = tfds.load('cats_vs_dogs', split='train[80%:90%]',
as_supervised=True)
test_data = tfds.load('cats_vs_dogs', split='train[-10%:]',
as_supervised=True)
Once you have them, you can use them as you would any named split.

In [None]:
But if
we already know the features, we can create a feature description and use this to parse
the data. Here’s the code:
# Create a description of the features
feature_description = {
'image': tf.io.FixedLenFeature([], dtype=tf.string),
'label': tf.io.FixedLenFeature([], dtype=tf.int64),
}
def _parse_function(example_proto):
# Parse the input `tf.Example` proto using the dictionary above
return tf.io.parse_single_example(example_proto, feature_description)
parsed_dataset = raw_dataset.map(_parse_function)
for parsed_record in parsed_dataset.take(1):
print((parsed_record))

In [None]:
Consider the full code to train the Horses or Humans classifier, shown here. I’ve
added comments to show where the Extract, Transform, and Load phases take place:
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_addons as tfa
# MODEL DEFINITION START #
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(16, (3,3), activation='relu',
input_shape=(300, 300, 3)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='Adam', loss='binary_crossentropy',
metrics=['accuracy'])
# MODEL DEFINITION END #
# EXTRACT PHASE START #
data = tfds.load('horses_or_humans', split='train', as_supervised=True)
val_data = tfds.load('horses_or_humans', split='test', as_supervised=True)
# EXTRACT PHASE END
# TRANSFORM PHASE START #
def augmentimages(image, label):
image = tf.cast(image, tf.float32)
image = (image/255)
image = tf.image.random_flip_left_right(image)
image = tfa.image.rotate(image, 40, interpolation='NEAREST')
return image, label
train = data.map(augmentimages)
train_batches = train.shuffle(100).batch(32)
validation_batches = val_data.batch(32)
The# TRANSFORM PHASE END
# LOAD PHASE START #
history = model.fit(train_batches, epochs=10,
validation_data=validation_batches, validation_steps=1)
# LOAD PHASE END #
Using this process can make your data pipelines less susceptible to changes in the
data and the underlying schema.

In [None]:
make it work in parallel. That will be done when we call the mapping function. Here’s
how to do that:
cores = multiprocessing.cpu_count()
print(cores)
train_dataset = train_dataset.map(read_tfrecord, num_parallel_calls=cores)
train_dataset = train_dataset.cache()
First, if you don’t want to autotune, you can use the multiprocessing library to get a
count of your CPUs. Then, when you call the mapping function, you just pass this as
the number of parallel calls that you want to make. It’s really as simple as that.
The cache method will cache the dataset in memory. If you have a lot of RAM available
this is a really useful speedup. Trying this in Colab with Dogs vs. Cats will likely
crash your VM due to the dataset not fitting in RAM. After that, if available, the
Colab infrastructure will give you a new, higher-RAM machine.
Loading and training can also be parallelized. As well as shuffling and batching the
data, you can prefetch based on the number of CPU cores that are available. Here’s
the code:
train_dataset = train_dataset.shuffle(1024).batch(32)
train_dataset = train_dataset.prefetch(tf.data.experimental.AUTOTUNE)
Once your training set is all parallelized, you can train the model as before:
model.fit(train_dataset, epochs=10, verbose=1)
When I tried this in Google Colab, I found that this extra code to parallelize the ETL
process reduced the training time to about 40 seconds per epoch, as opposed to 75
seconds without it. These simple changes cut my training time almost in half!