<a href="https://colab.research.google.com/github/Davidxswang/ML/blob/master/Note_2_TensorFlow_Keras_Basic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Import the Packages and Check the Environment
[Refer to the link for full tutorial from TensorFlow](https://www.tensorflow.org/datasets/overview)

**Be careful, in the tutorial, two lines of code cannot work:**
- **fig = tfds.show_examples(ds, info)**
- **print(info.splits['train'].filenames)**


In [0]:
import tensorflow_datasets as tfds
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np

In [2]:
! pwd
# mnist dataset is stored in the path below
#! mkdir /content/drive/My\ Drive/colab/mnist

/content


# How to Use tfds

## See What Datasets are Included in tfds



In [0]:
# To see what datasets are available (included in tfds)
if False:
  tfds.list_builders()

## How to Load the Dataset, What Type of Structure the Data Has

There are three key things you need to know:
1. as_supervised argument of load method
2. batch_size argument of load method
3. as_numpy() method of tdfs

The 1 and 2 will affect the output structure of the load method.

The 3 can convert the **tf.data.Dataset** into **Generator[np.ndarray]** and **tf.Tensor** into **np.ndarray**.

In [4]:
# all the configs
as_supervised = True
# if True, elements in ds_train are tuple, if False, elements are dict

batch_size = None
# -1, load the full batch into a tuple/dict, ds_train will be a tuple if used with as_supervised=True, a dict if as_supervised=False
# None, ds_train will be tf.data.Dataset, no matter what as_supervised is


# load the data
# shuffle_files can be useful when the dataset stores the data in multiple files, being set to True will let the program read from the files randomly.
# This will result in better randomness. Otherwise, it's not truly randomized because the files will be read in the same order.
# MNIST only have 1 file as you can see from the output of this cell, so it doesn't really matter in MNIST case.

(ds_train, ds_val, ds_test), ds_info = tfds.load(
    'mnist',
    split=['train[:80%]', 'train[80%:]', 'test'],
    shuffle_files=True,
    as_supervised=as_supervised,     
    with_info=True,
    data_dir='/content/drive/My Drive/colab/mnist',
    batch_size=batch_size
)
print(ds_info)

print('We can access the features of the dataset by this way:')
print(ds_info.features.shape, ds_info.features.dtype)
print(ds_info.features['image'].shape, ds_info.features['image'].dtype)
print(ds_info.features['label'].num_classes, ds_info.features['label'].names, ds_info.features['label'].int2str(7), ds_info.features['label'].str2int('7'))
print("\n")

print('We can access the split of the dataset by this way:')
print(ds_info.splits)
print(list(ds_info.splits.keys()))
print(dir(ds_info.splits['train']))
print(ds_info.splits['train'].num_examples)
print(ds_info.splits['train'].file_instructions)
print(ds_info.splits['train'].file_instructions[0]['filename'])
print(ds_info.splits['train'].num_shards)
print(ds_info.splits['train[15%:25%]'].num_examples)
print(ds_info.splits['train[15%:25%]'].file_instructions)
print("\n")




print('ds_train type is (tuple, dict or tf.data.Dataset): ', type(ds_train))
# take a look at the internal structure of the data
if batch_size == -1:
  # ds_train will be dict or tuple
  if as_supervised:
    print('ds_train will be a tuple (images, labels)')
    print('How many images in ds_train: ', len(ds_train[0]))
    print('Type of elements in ds_train\'s images: ', type(ds_train[0][0]))
    print('Shape of the image: ', ds_train[0][0].shape)
    print('Type of elements in ds_train\'s labels: ', type(ds_train[1][0]))
    print('The value of the label: ', ds_train[1][0].numpy())
    print('\n')
    print('We can use tfds.as_numpy(something) to convert something into: Generator[np.ndarray] from tf.data.Dataset, or np.ndarray from tf.Tensor')
    print('We can convert ds_train[0] which is originally a tf.Tensor, now the type is: ', type(tfds.as_numpy(ds_train[0])))
    print('We can also convert the ds_train (a tuple) directly, after convert it\'s still a', type(tfds.as_numpy(ds_train)), ', the ds_train[0] is now the type of: ', type(tfds.as_numpy(ds_train)[0]))
  else:
    print('ds_train will be a dict {\'image\': image, \'label\': label}')
    print('Type of ds_train[\'image\']: ', type(ds_train['image']))
    print('Shape of ds_train["image"]: ', ds_train['image'].shape)
    print('Type of ds_train[\'label\']: ', type(ds_train['label']))
    print('Shape of ds_train["label"]: ', ds_train['label'].shape)
    print('\n')
    print('We can convert ds_train["image"] which is originally a tf.Tensor, now the type is: ', type(tfds.as_numpy(ds_train["image"])))
    print('We can also convert the ds_train (a dict) directly, after convert it\'s still a', type(tfds.as_numpy(ds_train)), ', the ds_train["image"] is now the type of: ', type(tfds.as_numpy(ds_train)["image"]))
elif batch_size is None:
  print('ds_train is a tf.data.Dataset object')
  ds = ds_train.take(1)
  print('Use element_spec to see the structure of element in the dataset', ds_train.element_spec)
  for data in ds:
    if as_supervised:
      print('The element in ds_train is tuple, as you can see: ', type(data))
      print('The type of the first element of this tuple: ', type(data[0]))
      print('The shape of this image (tf.Tensor): ', data[0].shape)
      print('The type of the second element of this tuple: ', type(data[1]))
      print('The shape of this label (tf.Tensor): ', data[1].shape)
      print('The value of this label (tf.Tensor): ', data[1].numpy())
      print('\n')
      print('We can conver the tf.data.Dataset using tfds.as_numpy() method, the type will be: ', type(tfds.as_numpy(ds)))
      print('The element from this generator will be the type of tuple: ', type(list(tfds.as_numpy(ds))[0]))
      print('The first element (an image) of this tuple will be the type of np.ndarray: ', type(list(tfds.as_numpy(ds))[0][0]))
      print('The shape of this image will be: ', list(tfds.as_numpy(ds))[0][0].shape)
    else:
      print('The element in ds_train is dict, as you can see: ', type(data))
      print('The keys are: ', list(data.keys()))
      print('The type of the first element of this dict: ', type(data['image']))
      print('The shape of this image (tf.Tensor): ', data['image'].shape)
      print('The type of the second element of this dict: ', type(data['label']))
      print('The shape of this label (tf.Tensor): ', data['label'].shape)
      print('The value of this label (tf.Tensor): ', data['label'].numpy())
      print('\n')
      print('We can conver the tf.data.Dataset using tfds.as_numpy() method, the type will be: ', type(tfds.as_numpy(ds)))
      print('The element from this generator will be the type of dict: ', type(list(tfds.as_numpy(ds))[0]))
      print('The image of this dict will be the type of np.ndarray: ', type(list(tfds.as_numpy(ds))[0]['image']))
      print('The shape of this image will be: ', list(tfds.as_numpy(ds))[0]['image'].shape)



tfds.core.DatasetInfo(
    name='mnist',
    version=3.0.0,
    description='The MNIST database of handwritten digits.',
    homepage='http://yann.lecun.com/exdb/mnist/',
    features=FeaturesDict({
        'image': Image(shape=(28, 28, 1), dtype=tf.uint8),
        'label': ClassLabel(shape=(), dtype=tf.int64, num_classes=10),
    }),
    total_num_examples=70000,
    splits={
        'test': 10000,
        'train': 60000,
    },
    supervised_keys=('image', 'label'),
    citation="""@article{lecun2010mnist,
      title={MNIST handwritten digit database},
      author={LeCun, Yann and Cortes, Corinna and Burges, CJ},
      journal={ATT Labs [Online]. Available: http://yann. lecun. com/exdb/mnist},
      volume={2},
      year={2010}
    }""",
    redistribution_info=,
)

We can access the features of the dataset by this way:
{'image': (28, 28, 1), 'label': ()} {'image': tf.uint8, 'label': tf.int64}
(28, 28, 1) <dtype: 'uint8'>
10 ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] 7 7


## How to Use the Dataset to Train a Neural Network
[Refer to the full tutorial from TensorFlow](https://www.tensorflow.org/datasets/keras_example)

### Build Training Pipeline

In [0]:
# We need to normalize the images from tf.uint8 to tf.float32
# This method will be used in the map method of a tf.data.Dataset
# When being used, every element in the Dataset will be used an the input to this function. As we can see from the cell above, every element is a tuple (image, label)
# So after using this function, the images in Dataset will be converted to tf.float32 and normalized to [0,1], label will be kept as it is.
def normalize_img(image, label):
  return tf.cast(image, tf.float32) / 255., label

# When specifying the num_parallel_calls, the system will choose how many parallels calls based on the available CPU
# The input signature of map_func is determined by the structure of each element in this dataset.
# determinitic argument controls whether determinism should be traded for performance by allowing elements to be produced out of order. 
# If deterministic is None, the tf.data.Options.experimental_deterministic dataset option (True by default) is used to decide whether to produce elements deterministically.
ds_train = ds_train.map(map_func=normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE, deterministic=None)

# Cache is a very useful function to improve the performance of the training.
# The first time the dataset is iterated over, its elements will be cached either in the specified file or in memory(we can assign the cache location in the argument).
# Subsequent iterations will use the cached data.
# Since we want to shuffle the data before training, it's better to normalize the data and cache the data right after normalization, especially before shuffle.
# In this way, when we are training the network, it will take the samples from the cache, then shuffle. This will make sure the data has been shuffled to fit in training.
# Quote from TensorFlow Dataset Tutorial: Random transformations should be applied after caching and batching.
# For small datasets, tfds will automatically cache the dataset. But we need to refer to that specific dataset doc page. Refer to: https://www.tensorflow.org/datasets/performances
# Large dataset are sharded, they usually don't fit in memory, so they should not be cached.
ds_train = ds_train.cache()

# When shuffle is enabled, this dataset fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. 
# For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required. 
# Suggestions from TF: For bigger datasets which do not fit in memory, a standard value is 1000 if your system allows it.
# The tf.data.Dataset objects are Python iterables.
# So if we want each time when we iterate over Dataset, the Dataset gives us a different ordered data, we can set reshuffle_each_iteration to True.
# If we set reshuffle_each_iteration, every iteration of the dataset, the order will be the same. Default to True.
ds_train = ds_train.shuffle(ds_info.splits['train'].num_examples,seed=None,reshuffle_each_iteration=True)

# Batch will combine consecutive elements of this dataset into batches. Batch after shuffling, we can get unique batches at each epoch.
# If drop_remainder is True, then the remainder of the dataset after batching will be ignored. Default to False.
ds_train = ds_train.batch(128,drop_remainder=True)

# Data augmentation method, to randomly change the brightness and contrast
# If the application is eligible, we can also left-right flip or top-bottom flip (if images allow), or randomly change the saturation (RGB) and so on
# TensorFlow recommend us perform data augmentation after batching, because a lot of data augmentation methods can receive images in 4D shape
# If it's batched, the system will perform data augmentation in parallel
# More detail about randomly changing brightness/contrast and so on will be talked about later.
def augmentation(image, label):
  image = tf.image.random_brightness(image, 0.2)
  image = tf.image.random_contrast(image, 0.2, 0.5)
  return image, label

ds_train = ds_train.map(augmentation, num_parallel_calls=tf.data.experimental.AUTOTUNE, deterministic=None)

# Good practice to end the pipeline by prefetching for performance.
# Prefetching overlaps the preprocessing and model execution of a training step.
# Prefetch operates on the elements of the input dataset. It has no concept of examples vs. batches. 
# examples.prefetch(2) will prefetch two elements (2 examples)
# examples.batch(20).prefetch(2) will prefetch 2 elements (in this case, 2 batches, of 20 examples each).
# The number of elements to prefetch should be equal to (or possibly greater than) the number of batches consumed by a single training step.
# You could either manually tune this value, or set it to tf.data.experimental.AUTOTUNE which will prompt the tf.data runtime to tune the value dynamically at runtime.
# Refer to: https://www.tensorflow.org/guide/data_performance#prefetching
# Refer to: https://www.tensorflow.org/api_docs/python/tf/data/Dataset#prefetch
ds_train = ds_train.prefetch(buffer_size=tf.data.experimental.AUTOTUNE)

### Build Validation and Testing Pipeline

Since shuffle has no effect on test/validation accuracy, shuffle is not performed on test set and validation set.

Since we are not going to shuffle them, we can cache the batches directly. Each time we can just yank out the batches and test them directly.

In [0]:
# Validation set.
ds_val = ds_val.map(normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_val = ds_val.batch(128, drop_remainder=True)
ds_val = ds_val.cache()

ds_val = ds_val.prefetch(tf.data.experimental.AUTOTUNE)

# Test set.
ds_test = ds_test.map(normalize_img, num_parallel_calls=tf.data.experimental.AUTOTUNE)
ds_test = ds_test.batch(128, drop_remainder=True)
ds_test = ds_test.cache()
ds_test = ds_test.prefetch(tf.data.experimental.AUTOTUNE)

### Now Train the Model

The model is not the important topic in this note. So not much explanation about model in this note.


In [7]:
model = tf.keras.models.Sequential([
  tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
])
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=tf.keras.optimizers.Adam(0.001),
    metrics=['accuracy'],
)
model.summary()
history = model.fit(
    ds_train,
    epochs=10,
    validation_data=ds_val,
    )

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
flatten (Flatten)            (None, 784)               0         
_________________________________________________________________
dense (Dense)                (None, 128)               100480    
_________________________________________________________________
dense_1 (Dense)              (None, 10)                1290      
Total params: 101,770
Trainable params: 101,770
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [8]:
model.evaluate(ds_test)
print(history)

<tensorflow.python.keras.callbacks.History object at 0x7f1adc396358>
