<a href="https://colab.research.google.com/github/MengOonLee/Deep_learning/blob/master/TFDS/TFRecord.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exploring the Splits API

## Setup

We'll start by importing TensorFlow and TensorFlow Datasets.

In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'

import warnings
warnings.filterwarnings('ignore')

import numpy as np
np.random.seed(42)

import pandas as pd
import json

%matplotlib inline
%config InlineBackend.figure_format = 'retina'
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')
sns.set(font='DejaVu Sans')

import tensorflow as tf
tf.keras.utils.set_random_seed(42)
tf.get_logger().setLevel('ERROR')

import tensorflow_datasets as tfds

# Good to run this to ensure you are using TF2.x
print("\u2022 Using TensorFlow Version:", tf.__version__)

• Using TensorFlow Version: 2.12.0


## Exploring the Splits API

In [3]:
# Distinct splits
# The full `train` split and the full `test` split as \
# two distinct datasets. 
train_ds, test_ds = tfds.load(name='mnist:3.*.*',
    split=['train', 'test'])

print("#train records:", len(list(train_ds)))
print("#test records:", len(list(test_ds)))

#train records: 60000
#test records: 10000


With the slicing API we can use strings to specify the slicing instructions. For example, in the cell below we will merge the training and test sets by passing the string `’train+test'` to the `split` argument.

In [4]:
# Merging
# The full `train` and `test` splits, concatenated together.
combined_ds = tfds.load('mnist:3.*.*', split='train+test')

print("# train+test records:", len(list(combined_ds)))

# train+test records: 70000


We can also use Python style list slicers to specify the data we want. For example, we can specify that we want to take the first 10,000 records of the `train` split with the string `'train[:10000]'`, as shown below:

In [5]:
# Slicing by index
first10k = tfds.load('mnist:3.*.*', split='train[:10000]')

print("# first 10k train:", len(list(first10k)))

# first 10k train: 10000


It also allows us to specify the percentage of the data we want to use. For example, we can select the first 20\% of the training set with the string `'train[:20%]'`, as shown below:

In [6]:
# Slicing by percentage
first20p = tfds.load('mnist:3.*.*', split='train[:20%]')

print("# first 20% train:", len(list(first20p)))

# first 20% train: 12000


We can see that `first20p` contains 12,000 records, which is indeed 20\% the total number of records in the training set. Recall that the training set contains 60,000 records. 

Because the slices are string-based we can use loops, like the ones shown below, to slice up the dataset and make some pretty complex splits. For example, the loops below create 10 complimentary validation and training sets (each loop returns a list with 5 data sets).

In [7]:
# K-fold splits
train_ds = tfds.load(name='mnist:3.*.*',
    split=['train[:{}%]+train[{}%:]'.format(k, k+20)
        for k in range(0, 100, 20)])
print("# train datasets:", len(list(train_ds)))

val_ds = tfds.load(name='mnist:3.*.*',
    split=['train[{}%:{}%]'.format(k, k+20)
        for k in range(0, 100, 20)])
print("# valid datasets:", len(list(val_ds)))

# train datasets: 5
# valid datasets: 5


We can also compose new datasets by using pieces from different splits. For example, we can create a new dataset from the first 10\% of the test set and the last 80\% of the training set, as shown below.

In [8]:
# Composing operations
# The first 10% of test + the last 80% of train.
composed_ds = tfds.load(name='mnist:3.*.*',
    split='test[:10%]+train[-80%:]')
print("# composed records:", len(list(composed_ds)))

# composed records: 49000


# TFRecords

In [9]:
data, info = tfds.load(name='mnist', with_info=True)
print(info)

tfds.core.DatasetInfo(
    name='mnist',
    full_name='mnist/3.0.1',
    description="""
    The MNIST database of handwritten digits.
    """,
    homepage='http://yann.lecun.com/exdb/mnist/',
    data_path='/root/tensorflow_datasets/mnist/3.0.1',
    file_format=tfrecord,
    download_size=11.06 MiB,
    dataset_size=21.00 MiB,
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=uint8),
        'label': ClassLabel(shape=(), dtype=int64, num_classes=10),
    }),
    supervised_keys=('image', 'label'),
    disable_shuffling=False,
    splits={
        'test': <SplitInfo num_examples=10000, num_shards=1>,
        'train': <SplitInfo num_examples=60000, num_shards=1>,
    },
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
)


In [10]:
# if you are running the notebook on your local machine,
# specify the path to the downloaded file, and also
# change the version you are using
filename="/root/tensorflow_datasets/mnist/3.0.1/mnist-test.tfrecord-00000-of-00001"
raw_dataset = tf.data.TFRecordDataset(filenames=filename)

for raw_record in raw_dataset.take(1):
    print(repr(raw_record))

<tf.Tensor: shape=(), dtype=string, numpy=b"\n\x85\x03\n\xf2\x02\n\x05image\x12\xe8\x02\n\xe5\x02\n\xe2\x02\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x1c\x00\x00\x00\x1c\x08\x00\x00\x00\x00Wf\x80H\x00\x00\x01)IDAT(\x91\xc5\xd2\xbdK\xc3P\x14\x05\xf0S(v\x13)\x04,.\x82\xc5Aq\xac\xedb\x1d\xdc\n.\x12\x87n\x0e\x82\x93\x7f@Q\xb2\x08\xba\tbQ0.\xe2\xe2\xd4\xb1\xa2h\x9c\x82\xba\x8a(\nq\xf0\x83Fh\x95\n6\x88\xe7R\x87\x88\xf9\xa8Y\xf5\x0e\x8f\xc7\xfd\xdd\x0b\x87\xc7\x03\xfe\xbeb\x9d\xadT\x927Q\xe3\xe9\x07:\xab\xbf\xf4\xf3\xcf\xf6\x8a\xd9\x14\xd29\xea\xb0\x1eKH\xde\xab\xea%\xaba\x1b=\xa4P/\xf5\x02\xd7\\\x07\x00\xc4=,L\xc0,>\x01@2\xf6\x12\xde\x9c\xde[t/\xb3\x0e\x87\xa2\xe2\xc2\xe0A<\xca\xb26\xd5(\x1b\xa9\xd3\xe8\x0e\xf5\x86\x17\xceE\xdarV\xae\xb7_\xf3AR\r!I\xf7(\x06m\xaaE\xbb\xb6\xac\r*\x9b$e<\xb8\xd7\xa2\x0e\x00\xd0l\x92\xb2\xd5\x15\xcc\xae'\x00\xf4m\x08O'+\xc2y\x9f\x8d\xc9\x15\x80\xfe\x99[q\x962@CN|i\xf7\xa9!=\xd7 \xab\x19\x00\xc8\xd6\xb8\xeb\xa1\xf0\xd8l\xca\xfb]\xee\xfb]*\x9fV\xe1\x07\xb7\xc

In [11]:
# Create a description of the features.
feature_description = {
    'image': tf.io.FixedLenFeature(shape=[], dtype=tf.string),
    'label': tf.io.FixedLenFeature(shape=[], dtype=tf.int64)
}

def _parse_function(example_proto):
    # Parse the input `tf.Example` proto using the dictionary above.
    return tf.io.parse_single_example(
        serialized=example_proto, features=feature_description)

parsed_dataset = raw_dataset.map(_parse_function)
for parsed_record in parsed_dataset.take(1):
    print(parsed_record)

{'image': <tf.Tensor: shape=(), dtype=string, numpy=b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x1c\x00\x00\x00\x1c\x08\x00\x00\x00\x00Wf\x80H\x00\x00\x01)IDAT(\x91\xc5\xd2\xbdK\xc3P\x14\x05\xf0S(v\x13)\x04,.\x82\xc5Aq\xac\xedb\x1d\xdc\n.\x12\x87n\x0e\x82\x93\x7f@Q\xb2\x08\xba\tbQ0.\xe2\xe2\xd4\xb1\xa2h\x9c\x82\xba\x8a(\nq\xf0\x83Fh\x95\n6\x88\xe7R\x87\x88\xf9\xa8Y\xf5\x0e\x8f\xc7\xfd\xdd\x0b\x87\xc7\x03\xfe\xbeb\x9d\xadT\x927Q\xe3\xe9\x07:\xab\xbf\xf4\xf3\xcf\xf6\x8a\xd9\x14\xd29\xea\xb0\x1eKH\xde\xab\xea%\xaba\x1b=\xa4P/\xf5\x02\xd7\\\x07\x00\xc4=,L\xc0,>\x01@2\xf6\x12\xde\x9c\xde[t/\xb3\x0e\x87\xa2\xe2\xc2\xe0A<\xca\xb26\xd5(\x1b\xa9\xd3\xe8\x0e\xf5\x86\x17\xceE\xdarV\xae\xb7_\xf3AR\r!I\xf7(\x06m\xaaE\xbb\xb6\xac\r*\x9b$e<\xb8\xd7\xa2\x0e\x00\xd0l\x92\xb2\xd5\x15\xcc\xae'\x00\xf4m\x08O'+\xc2y\x9f\x8d\xc9\x15\x80\xfe\x99[q\x962@CN|i\xf7\xa9!=\xd7 \xab\x19\x00\xc8\xd6\xb8\xeb\xa1\xf0\xd8l\xca\xfb]\xee\xfb]*\x9fV\xe1\x07\xb7\xc9\x8b55\xe7M\xef\xb0\x04\xc0\xfd&\x89\x01<\xbe\xf9\x0