# TFDS Datasets

### Topic

This notebook covers accessing and examining the TF builtin dataset. The datasets cover the categories of:

    - Vision (CV)
        - Binary Classification (BC)
        - Multiple Class Classification (MC)
        - Multiple Class Multiple Label Classification (MCML)
        - Object Detection (OD)
        - Video Intelligence (VI)
    - Natural Language Processing (NLP)
        - Text Classification (TC)
        - Sentiment (SN)
        - Entity Recognition (ER)
        - Translation (TR)
    - Structured (ST)
    - Audio (AU)
    
### Audience

The audience for this notebook are data scientists and machine learning engineers.

### Prerequistes

One should be familar with:

    - Python 3.X
    - Deep Learning
    - Tensorflow
    
    ### Dataset

This notebook uses builtin datasets which are part of Tensorflow Datasets:

https://www.tensorflow.org/datasets/api_docs/python/tfds
    


### Objective

The objective of this tutorial is to learn what are and how to use the TFDS builtin datasets.

### Costs 

This tutorial does not use any billable items.

### Set up your local development environment

**If you are using Colab or AI Platform Notebooks**, your environment already meets
all the requirements to run this notebook. You can skip this step.

**Otherwise**, make sure your environment meets this notebook's requirements.
You need the following:

* Git
* Python 3
* virtualenv
* Jupyter notebook running in a virtual environment with Python 3
* Tensorflow
* TFDS
* Numpy
* Matplotlib

The Google Cloud guide to [Setting up a Python development
environment](https://cloud.google.com/python/setup) and the [Jupyter
installation guide](https://jupyter.org/install) provide detailed instructions
for meeting these requirements. The following steps provide a condensed set of
instructions:

1. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)

2. [Install
   virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)
   and create a virtual environment that uses Python 3.
   
3. [Install Tensorflow](https://www.tensorflow.org/install/pip)

4. [Install Tensorflow Datasets](https://pypi.org/project/tensorflow-datasets)

5. [Install Numpy](https://pypi.org/project/numpy/)

6. [Install Matplotlib for Python](https://pypi.org/project/matplotlib/)

7. Activate that environment and run `pip install jupyter` in a shell to install
   Jupyter.

8. Run `jupyter notebook` in a shell to launch Jupyter.

9. Open this notebook in the Jupyter Notebook Dashboard.

## Setup

You will be using the following Python modules:

    - tensorflow (TF)    : Deep Learning Framework
    - tensorflow_datasets: TF builtin datasets
    - numpy              : Scientific/Math libraries for Arrays/Matrices/Tensors
    - matplotlib         : Plotting library

### Installs

You will need to install the following packages (which are not pre-installed on AI-Platform or Codelab)

In [None]:
%pip install apache_beam

%pip install tensorflow==2.0.0-beta1

If you run locally, you may need to install these additional modules.

In [None]:
%pip install tensorflow_datasets
%pip install tfds-nightly

# Tutorial

### Imports

Now import the python libraries we will use in this notebook.

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
import matplotlib.pyplot as plt

### Verify you are using TF 2.0

In [None]:
print(tf.__version__)

## Specify Dataset Category/Subcategory

Use the cell below to specify the category and subcategory for a dataset.

    | Category   | Abbr | Subcategory                | Abbr |
    | Vision     | CV   | Binary Classification      | BC   |
    |            |      | Multi Classification       | MC   |
    |            |      | Multi-Label Classification | ML   |
    |            |      | Object Detection           | OD   |
    |            |      | Video Intelligence         | VI   |
    | Language   | NL   | Text Classification        | TC   |
    |            |      | Sentiment Analysis         | SN   |
    |            |      | Entity Recognition         | ER   |
    |            |      | Translation                | TR   |
    | Structured | ST   |  
    | Audio      | AU   |                  

In [None]:
# Top Level Categories
CATEGORIES=['AU', 'CV', 'NL', 'ST']

# Select a category (CV by default)
CATEGORY='CV'

if CATEGORY == 'CV':
    SUBCATEGORIES=[ 'BC', 'MC', 'ML', 'OD', 'VI']
    # Select a subcategory (BC by default)
    SUBCATEGORY='BC'
elif CATEGORY == 'NL':
    # Select a subcategory (TC by default)
    SUBCATEGORIES=[ 'TC', 'SN', 'ER', 'TR']
    SUBCATEGORY='TC'
elif CATEGORY == 'ST':
    SUBCATEGORY=''
elif CATEGORY == 'AU':
    SUBCATEGORY=''

## Load the Dataset

Now, let's use TFDS to load the selected builtin dataset. The dataset is downloaded into memory using `tfds.load()` method. We added setting the keyword parameter `with_info=True` to include downloading metadata (information) about the dataset.

Once the dataset is downloaded, it is automatically shuffled.

In [None]:
# Let's look at our memory footprint before the download.
!free -m

# Load the Dataset from builtin source
if CATEGORY == 'CV':
    if SUBCATEGORY == 'BC':
        name = 'cats_vs_dogs'
    elif SUBCATEGORY == 'MC':
        name = 'rock_paper_scissors'
    elif SUBCATEGORY == 'ML':
        # ANDY: Does not yet work.
        name = 'bigearthnet'
    elif SUBCATEGORY == 'OD':
        name = 'voc2007'
    elif SUBCATEGORY == 'VI':
        name = 'ucf101'

elif CATEGORY == 'NL':
    if SUBCATEGORY == 'TC':
        name = 'snli'
    elif SUBCATEGORY == 'SN':
        name = 'imdb_reviews'
    elif SUBCATEGORY == 'ER':
        pass
    elif SUBCATEGORY == 'TR':
        name = 'para_crawl'
        
elif CATEGORY == 'ST':
    name = 'iris'
elif CATEGORY == 'AU':
    name = 'nsynth'

data, info = tfds.load(name, with_info=True)

# Let's look at our memory footprint after the download.
!free -m

### Metadata

Now let's display (print) the dataset's metadata (information on the dataset).

In [None]:
# print dataset information
print(info)

### Splitting the dataset

Some of the datasets come presplit into train and test (holdout), while others are stored as a single dataset (not split). 

In either case, the training data and/or entire data is in accessed from the dictionary entry `train`. If the dataset is presplit, then the test data is accessed from the dictionary entry `test`.

If it is not presplit, we will go ahead and manually split it. In this case, we use the `take()` and `skip()` method to split off 1000 examples from the train data into the test data.

In [None]:
# Split the dataset into train and test (holdout), if not already.

train = data['train']
      
# Let's get the test (holdout) portion of the dataset
try:
    test = data['test']
# Not pre-split, so let's split it ourself
except:
    # Let's use 10% of dataset for test
    n_examples = info.splits['train'].num_examples
    n_test = int(n_examples * 0.1)
    test  = train.take(n_test)
    train = train.skip(n_test)

## Data Inspection

Let's now do a quick inspection by looking at one example in the dataset, based on the dataset category.

    - CV : show as an image
    - NL : show as text
    - ST : show as table

In [None]:
for example in tfds.as_numpy(train.take(1)):
    if CATEGORY == 'CV':
        if SUBCATEGORY == 'VI':
            image = example['video'][0]
        else:

            image = example['image']
        plt.imshow(image)
        
        try:
            label = example['label']
        except:
            label = example['labels']
        plt.title('Class: ' + str(label))
        plt.show()
    elif CATEGORY == 'NL':
        if SUBCATEGORY == 'TC':
            text = example['hypothesis']
        elif SUBCATEGORY == 'TR':
            text = example['en']
        else:
            text = example['text']

        print(text)
        try:
            print(example['label'])
        except:
            pass
    elif CATEGORY == 'ST':
        print(example['features'])
        print(example['label'])