# Exercises

1. Why would you want to use the Data API?
2. What are the benefits of splitting a large dataset into multiple files?
3. During training, how can you tell that your input pipeline is the bottleneck? What can you do to fix it?
4. Can you save any binary data to a tfrecord file, or only serialised protocol buffers?
5. Why would you go through the hassle of converting all your data to the `Example` protobuf format? Why not use your own protobuf definition?
6. When using tf records, when would you want to activate compression? Why not do it systematically?
7. Data can be preprocessed directly when writing the data files, or within the tf.data pipeline, or in preprocessing layers within your model, or using tf transform. Can you list a few pros & cons of each option?
8. Name a few common techniques you can use to encode categorical features. What about text?
9. Load the fashion MNIST dataset; split it into a training set, validation set, & a test set; shuffle the training set; & save each dataset to multiple tfrecord files. Each record should be a serialised `Example` protobuf with two features: the serialised image (use `tf.io.serialize_tensor()` to serialise each image), & the label. Then use tf.data to create an efficient dataset for each set. Finally, use a keras model to train these datasets, including a preprocessing layer to standardise each input feature. Try to make the input pipeline as efficient as possible, using tensorboard to visualise profiling data.
10. In this exercise, you will download a dataset, split it, create a `tf.data.Dataset` to load it & preprocess it efficiently, then build & train a binaary classification model containing an `Embedding` layer.
   - Download the Large Movie Review dataset, which contains 50,000 movie reviews from the internet movie database. The data is organised in two directories *train* & *test*, each containing a *pos* subdirectory with 12,500 positive reviews & a *neg* subdirectory with 12,500 negative reviews. Each review is stored in a separate text file. There are other files & folders (including pre-processed bag-of-words), but we will ignore them in this exercise.
   - Split the test set into a validation set (15,000) & a test set (15,000).
   - Use tf.data to create an efficient dataset for each set.
   - Create a binary classification model, using `TextVectorization` layer to preprocess each review.
   - Add an `Embedding` layer & compute the mean embedding for each review, multiplied by the square root of the number of words. This rescaled mean embedding can then be passed to the rest of your model.
   - Train the model & see what accuracy you get. Try to optimise your pipelines to make training as fast as possible.
   - Use TFDS to load the same dataset more easily: `tfds.load("imdb_reviews")`.

---

1. The data API makes loading & preprocessing data with TensorFlow easier. It can ingest large datasets & preprocess them simply: you just create a dataset object & specify where to get the data & how to transfrom it. It can also read from a variety of file types & supports reading from SQL databases.
2. Gradient descent works best when the instances of a dataset are independent & identically distributed. This means that we need to shuffle the instances of a dataset. For large datasets that cannot fit into memory, one possible solution is to split the dataset into multiple files. You can then pick multiple files randomly & read them simultaneously, interleaving their records. On top of that, you could also add a shuffling buffer, which will improve shuffling a lot more.
3. You want to make sure your GPU is close to 100% utilised. To do this, you employ prefetching & multithreading. Prefetching means that while our training algorithm is working on one batch, the dataset will be working in parallel on getting the next batch ready (load & preprocess). If you also ensure that loading & preprocessing is multi-threaded (set `num_parallel_calls` when calling `interleave()` & `map()`), then you can exploit multiple cores on the CPU so that the time for loading & preprocessing is less than the time for training on the GPU.
4. You can save binary data to both a tfrecord file & serialised protocol buffers. To serialise your protocol buffer, use the `SerializetToString()` method.
5. We go through the hassle of converting all our data to the `Example` protobuf format instead of using our protobuf definition because the custom protobuf methods are not TensorFlow operations, which mean that they cannot be included in a TensorFlow function (except for wrapping them in a `tf.py_function()` operation, which slows down training & makes the code less portable.
6. You want to compress your tfrecord files if they need to be loaded via network connection. This means that if you are going to download the compressed tf record files, then you will want to activate compression. Otherwise, if the files are on the same machine, then it's preferable to leave compression off.
7. *(1)* If you preprocess the data when creating the data files, then training will run faster, since it will not have to perform preprocessing. *(2)* If the data is preprocessed with a tf.data pipeline, it can be very efficient (with multithreading & prefetching). However, preprocessing data this way slows down training, because it is done on the fly. This also means that each training instance will be preprocessed multiple times (once per epoch) rather than just once, if the data was preprocessing when creating the data files. *(3)* If you add preprocessing layers to your model, then it will also slow down training & each instance will be preprocessed multiple times (once per epoch). However, you can speed this up by multithreading & prefetching. *(4)* With tf transform, each instance is preprocessed just once, which speeds up training. It also automatically generates preprocessing layers, which is great for maintenance because your preprocessing is all in one place.
8. For categorical features, you can use encode them as one-hot vectors. For large datasets, where learning the full vocabulary of the categorical features is a hassle, you can use out-of-vocabulary buckets. You can also use embeddings for categorical features.

# 9.

In [1]:
import tensorflow as tf

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.fashion_mnist.load_data()
X_train, y_train = X_train[5000:], y_train[5000:]
X_val, y_val = X_train[:5000], y_train[:5000]

In [2]:
train_set = tf.data.Dataset.from_tensor_slices((X_train, y_train)).shuffle(buffer_size = len(X_train))
val_set = tf.data.Dataset.from_tensor_slices((X_val, y_val))
test_set = tf.data.Dataset.from_tensor_slices((X_test, y_test))

In [3]:
BytesList = tf.train.BytesList
FloatList = tf.train.FloatList
Int64List = tf.train.Int64List
Feature = tf.train.Feature
Features = tf.train.Features
Example = tf.train.Example

def create_example(image, label):
    image_data = tf.io.serialize_tensor(image)
    return Example(features = Features(feature = {"image": Feature(bytes_list = BytesList(value = [image_data.numpy()])),
                                                  "label": Feature(int64_list = Int64List(value = [label]))}))

In [4]:
for image, label in val_set.take(1):
    print(create_example(image, label))

features {
  feature {
    key: "label"
    value {
      int64_list {
        value: 4
      }
    }
  }
  feature {
    key: "image"
    value {
      bytes_list {
        value: "\010\004\022\010\022\002\010\034\022\002\010\034\"\220\006\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\001\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000,\177\266\271\241x7\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000*\306\373\377\373\371\367\377\374\326d\000\000\000\000\000\000\000\000\000\000\000\000\000\000\002\000\000\351\374\355\357\352\355\353\355\355\376\343\000\000\000\000\001\000\000\000\000\000\000\000\000\002\000\000\020\322\341\327\257\331\330\301\304\342\335\3212\000\000\002\000\000\000\000\000\000\000\000\002\000\000\307\345\350\346\365\314\333\375\365\317\302\337\347\354\353\000\000\003\000\000\000\000\000\000\000\001\000\211\353\314\321\311\321\352\276\352\332\327\356\357\314\275\340\232\000\000\000\0

In [5]:
from contextlib import ExitStack

def write_tfrecords(name, dataset, n_shards = 10):
    paths = ["{}.tfrecord-{:05d}-of-{:05d}".format(name, index, n_shards)
             for index in range(n_shards)]
    with ExitStack() as stack:
        writers = [stack.enter_context(tf.io.TFRecordWriter(path))
                   for path in paths]
        for index, (image, label) in dataset.enumerate():
            shard = index % n_shards
            example = create_example(image, label)
            writers[shard].write(example.SerializeToString())
    return paths

In [6]:
train_filepaths = write_tfrecords("my_fashion_mnist.train", train_set)
val_filepaths = write_tfrecords("my_fashion_mnist.val", val_set)
test_filepaths = write_tfrecords("my_fashion_mnist.test", test_set)

In [7]:
def preprocess(tfrecord):
    feature_descriptions = {
        "image": tf.io.FixedLenFeature([], tf.string, default_value=""),
        "label": tf.io.FixedLenFeature([], tf.int64, default_value=-1)
    }
    example = tf.io.parse_single_example(tfrecord, feature_descriptions)
    image = tf.io.parse_tensor(example["image"], out_type = tf.uint8)
    #image = tf.io.decode_jpeg(example["image"])
    image = tf.reshape(image, shape=[28, 28])
    return image, example["label"]

def mnist_dataset(filepaths, n_read_threads = 5, shuffle_buffer_size = None,
                  n_parse_threads = 5, batch_size = 32, cache = True):
    dataset = tf.data.TFRecordDataset(filepaths,
                                      num_parallel_reads = n_read_threads)
    if cache:
        dataset = dataset.cache()
    if shuffle_buffer_size:
        dataset = dataset.shuffle(shuffle_buffer_size)
    dataset = dataset.map(preprocess, num_parallel_calls = n_parse_threads)
    dataset = dataset.batch(batch_size)
    return dataset.prefetch(1)

In [8]:
train_set = mnist_dataset(train_filepaths, shuffle_buffer_size = 60000)
val_set = mnist_dataset(val_filepaths)
test_set = mnist_dataset(test_filepaths)

In [9]:
from tensorflow import keras
import numpy as np

class Standardization(keras.layers.Layer):
    def adapt(self, data_sample):
        self.means_ = np.mean(data_sample, axis=0, keepdims=True)
        self.stds_ = np.std(data_sample, axis=0, keepdims=True)
    def call(self, inputs):
        return (inputs - self.means_) / (self.stds_ + keras.backend.epsilon())

standardization = Standardization(input_shape=[28, 28])

sample_image_batches = train_set.take(100).map(lambda image, label: image)
sample_images = np.concatenate(list(sample_image_batches.as_numpy_iterator()),
                               axis = 0).astype(np.float32)
standardization.adapt(sample_images)

model = keras.models.Sequential([
    standardization,
    keras.layers.Flatten(),
    keras.layers.Dense(100, activation = "relu"),
    keras.layers.Dense(10, activation = "softmax")
])
model.compile(loss = "sparse_categorical_crossentropy",
              optimizer = "nadam", metrics = ["accuracy"])

  standardization = Standardization(input_shape=[28, 28])





In [12]:
from datetime import datetime
import os

logs = os.path.join(os.curdir, "my_logs",
                    "run_" + datetime.now().strftime("%Y%m%d_%H%M%S"))

tensorboard_cb = tf.keras.callbacks.TensorBoard(
    log_dir = logs, histogram_freq = 1, profile_batch = 10)

model.fit(train_set, epochs = 5, validation_data = val_set,
          callbacks = [tensorboard_cb])

Epoch 1/5
   1707/Unknown [1m3s[0m 1ms/step - accuracy: 0.8056 - loss: 0.5495

  self.gen.throw(typ, value, traceback)


[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 1ms/step - accuracy: 0.8059 - loss: 0.5488 - val_accuracy: 0.8790 - val_loss: 0.3301
Epoch 2/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.8794 - loss: 0.3334 - val_accuracy: 0.8934 - val_loss: 0.2791
Epoch 3/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.8936 - loss: 0.2900 - val_accuracy: 0.9038 - val_loss: 0.2505
Epoch 4/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9038 - loss: 0.2572 - val_accuracy: 0.9152 - val_loss: 0.2188
Epoch 5/5
[1m1719/1719[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 1ms/step - accuracy: 0.9125 - loss: 0.2406 - val_accuracy: 0.9166 - val_loss: 0.2238


<keras.src.callbacks.history.History at 0x1eca64c7d90>

In [13]:
%load_ext tensorboard
%tensorboard --logdir="C:\Users\yuj22\Desktop\ML_Python\Chapter 13\my_logs" --port=6006

# 10.