<a href="https://colab.research.google.com/github/JAMES-YI/00_Tensorflow_Tutorials/blob/master/%E2%80%9Ctensorflow_datasets%E2%80%9D%E7%9A%84%E5%89%AF%E6%9C%AC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Codes from www.Tensorflow.org

Modified by JYI, 04/13/2020

- TFDS provides a collection of ready-to-use datasets. It handles downloading and preparing the data and constructing a [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset).
- Do not confuse [TFDS](https://www.tensorflow.org/datasets) (this library) with [tf.data](https://www.tensorflow.org/guide/data) (TensorFlow API to build efficient data pipelines). TFDS is a high level wrapper around `tf.data`.
- Please include the following citation when using `tensorflow-datasets` for a paper, in addition to any citation specific to the used datasets.

```
@misc{TFDS,
  title = { {TensorFlow Datasets}, A collection of ready-to-use datasets},
  howpublished = {\url{https://www.tensorflow.org/datasets}},
}
```


Copyright 2018 The TensorFlow Datasets Authors, Licensed under the Apache License, Version 2.0

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/datasets/overview"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/datasets/blob/master/docs/overview.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/datasets/blob/master/docs/overview.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Eager execution

- TensorFlow Datasets is compatible with both TensorFlow [Eager mode](https://www.tensorflow.org/guide/eager) and Graph mode
- difference between eager mode and graph mode? 
- For this colab, we'll run in Eager mode, which is the default in TensorFlow 2.
- Each dataset is implemented as a [`tfds.core.DatasetBuilder`](https://www.tensorflow.org/datasets/api_docs/python/tfds/core/DatasetBuilder) and you can list all available builders with `tfds.list_builders()`.
- You can see all the datasets with additional documentation on the [datasets documentation page](https://www.tensorflow.org/datasets/catalog/overview).

In [0]:
!pip install -q tensorflow tensorflow-datasets matplotlib
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

import tensorflow_datasets as tfds
tfds.disable_progress_bar()

tf.executing_eagerly()
tfds.list_builders() # give all the built dataset

## `tfds.load`: A dataset in one line

- [`tfds.load`](https://www.tensorflow.org/datasets/api_docs/python/tfds/load) is a convenience method that's the simplest way to build and load a `tf.data.Dataset`.
- `tf.data.Dataset` is the standard TensorFlow API to build input pipelines. If you're not familiar with this API, we **strongly** encourage you to read [the official TensorFlow guide](https://www.tensorflow.org/guide/datasets).
- once data has been prepared, subsequent calls of `load` will reuse the prepared data.
- You can customize where the data is saved/loaded by specifying `data_dir=` (
defaults to `~/tensorflow_datasets/`).
- [documentation on datasets versioning](https://github.com/tensorflow/datasets/blob/master/docs/) for more details.

In [0]:
ds_train = tfds.load(name="mnist", split="train")
assert isinstance(ds_train, tf.data.Dataset)
print(ds_train)
ds_all = tfds.load("mnist:3.*.*")

## Feature dictionaries

- All `tfds` datasets contain feature dictionaries mapping feature names to Tensor values. A typical dataset, like MNIST, will have 2 keys: `"image"` and `"label"`. Below we inspect a single example.
- how to access the values of samples
- In graph mode, see the [tf.data guide](https://www.tensorflow.org/guide/datasets#creating_an_iterator) to understand how to iterate on a `tf.data.Dataset`.
- define the rest of an input pipeline suitable for model training by using the [`tf.data` API](https://www.tensorflow.org/guide/datasets).
- we'll repeat the dataset so that we have an infinite stream of examples, shuffle, and create batches of 32.

In [0]:
# access sample value
for example in ds_train.take(1):  # Only take a single example
  image, label = example["image"], example["label"]

  plt.imshow(image.numpy()[:, :, 0].astype(np.float32), cmap=plt.get_cmap("gray"))
  print("Label: %d" % label.numpy())

# another way for loading dataset
mnist_builder = tfds.builder("mnist")
mnist_builder.download_and_prepare()
ds_train = mnist_builder.as_dataset(split="train")
ds_train

# preprocessing for training
ds_train = ds_train.repeat().shuffle(1024).batch(32)

# prefetch will enable the input pipeline to asynchronously fetch batches while
# your model is training.
ds_train = ds_train.prefetch(tf.data.experimental.AUTOTUNE)

# Now you could loop over batches of the dataset and train
# for batch in ds_train:
#   ...

## DatasetInfo

After generation, the builder contains useful information on the dataset:

In [0]:
info = mnist_builder.info
print(info)

print(info.features)
print(info.features["label"].num_classes)
print(info.features["label"].names)

ds_test, info = tfds.load("mnist", split="test", with_info=True)
print(info)

fig = tfds.show_examples(info, ds_test) # sample demonstration