# Training on MNIST Dataset using Tensorflow Operator

## Prerequisites
Before we proceed, let's check that we're using the right image, that is, [TensorFlow](https://www.tensorflow.org/api_docs/) is available:

In [None]:
# ! pip3 list | grep tensorflow 
! pip3 install --user tensorflow==2.1.0
! pip3 install --user ipywidgets tensorflow-datasets nbconvert

To package the trainer in a container image, we shall need a file (on our cluster) that contains the code as well as a file with the resource definitition of the job for the Kubernetes cluster:

In [None]:
TRAINER_FILE = "tfjob.py"
KUBERNETES_FILE = "tfjob-mnist.yaml"

We also want to capture output from a cell with [`%%capture`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#cellmagic-capture) that usually looks like `some-resource created`.
To that end, let's define a helper function:

In [None]:
import re

from IPython.utils.capture import CapturedIO


def get_resource(captured_io: CapturedIO) -> str:
    """
    Gets a resource name from `kubectl apply -f <configuration.yaml>`.

    :param str captured_io: Output captured by using `%%capture` cell magic
    :return: Name of the Kubernetes resource
    :rtype: str
    :raises Exception: if the resource could not be created
    """
    out = captured_io.stdout
    matches = re.search(r"^(.+)\s+created", out)
    if matches is not None:
        return matches.group(1)
    else:
        raise Exception(f"Cannot get resource as its creation failed: {out}. It may already exist.")

## How to Load and Inspect the Data
We grab the MNIST data set with the aid of `tensorflow_datasets`.

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
from matplotlib import pyplot as plt

mnist, info = tfds.load('mnist', split='train', shuffle_files=True , with_info=True)
tfds.show_examples(mnist, info)

We can easily read off the shape of the input tensors that shows the images are all 28x28 pixels, but we do not yet know whether their greyscale values have been scaled to the [0, 1] range or not:

In [None]:
for example in mnist.take(1):
    squeezed = tf.squeeze(example["image"])
    print(tf.math.reduce_min(squeezed), tf.math.reduce_max(squeezed))

No, they have not.
This means we have to do this in the training and before serving!

In [None]:
# Clear variables as we have no need for these any longer
del mnist, squeezed

In [None]:
optimizer='rmsprop'

## How to Train the Model in the Notebook
We want to train the model in a distributed fashion, we put all the code in a single cell.
That way we can save the file and include it in a container image:

In [None]:
%%writefile $TRAINER_FILE
import argparse
import logging
import json
import os
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import tensorflow_datasets as tfds
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, models
from tensorflow.keras.optimizers import SGD, Adam, RMSprop

logging.getLogger().setLevel(logging.INFO)




def make_datasets_unbatched():
  BUFFER_SIZE = 10000

  datasets, ds_info = tfds.load(name="mnist", download=True, with_info=True, as_supervised=True)
  mnist_train, mnist_test = datasets["train"], datasets["test"]

  def scale(image, label):
      image = tf.cast(image, tf.float32) / 255
      return image, label

  train_dataset = mnist_train.map(scale).cache().shuffle(BUFFER_SIZE).repeat()
  test_dataset = mnist_test.map(scale)

  return train_dataset, test_dataset


def model(args):
  model = models.Sequential()
  model.add(
      layers.Conv2D(64, (3, 3), activation='relu', input_shape=(28, 28, 1)))
  model.add(layers.MaxPooling2D((2, 2)))
  model.add(layers.Conv2D(128, (3, 3), activation='relu'))
  model.add(layers.Flatten())
  model.add(layers.Dense(256, activation='relu'))
  model.add(layers.Dense(10, activation='softmax'))

  model.summary()
  opt = args.optimizer
  model.compile(optimizer=opt,
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
  tf.keras.backend.set_value(model.optimizer.learning_rate, args.learning_rate)
  return model


def main(args):
  # MultiWorkerMirroredStrategy creates copies of all variables in the model's
  # layers on each device across all workers
  strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
      communication=tf.distribute.experimental.CollectiveCommunication.AUTO)
  logging.debug(f"num_replicas_in_sync: {strategy.num_replicas_in_sync}")
  BATCH_SIZE_PER_REPLICA = args.batch_size
  BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

  # Datasets need to be created after instantiation of `MultiWorkerMirroredStrategy`
  train_dataset, test_dataset = make_datasets_unbatched()
  train_dataset = train_dataset.batch(batch_size=BATCH_SIZE)
  test_dataset = test_dataset.batch(batch_size=BATCH_SIZE)

  # See: https://www.tensorflow.org/api_docs/python/tf/data/experimental/DistributeOptions
  options = tf.data.Options()
  options.experimental_distribute.auto_shard_policy = \
        tf.data.experimental.AutoShardPolicy.DATA

  train_datasets_sharded  = train_dataset.with_options(options)
  test_dataset_sharded = test_dataset.with_options(options)

  with strategy.scope():
    # Model building/compiling need to be within `strategy.scope()`.
    multi_worker_model = model(args)

  # Keras' `model.fit()` trains the model with specified number of epochs and
  # number of steps per epoch. 
  multi_worker_model.fit(train_datasets_sharded,
                         epochs=10,
                         steps_per_epoch=10)
  
  eval_loss, eval_acc = multi_worker_model.evaluate(test_dataset_sharded, 
                                                    verbose=0, steps=10)

  # Log metrics for Katib
  logging.info("loss={:.4f}".format(eval_loss))
  logging.info("accuracy={:.4f}".format(eval_acc))


if __name__ == '__main__':
  parser = argparse.ArgumentParser()
  parser.add_argument("--batch_size",
                      type=int,
                      default=128,
                      metavar="N",
                      help="Batch size for training (default: 128)")
  parser.add_argument("--learning_rate", 
                      type=float,  
                      default=0.001,
                      metavar="N",
                      help='Initial learning rate')
  parser.add_argument("--optimizer", 
                      type=str, 
                      default='adam',
                      metavar="N",
                      help='optimizer')

  parsed_args, _ = parser.parse_known_args()
  main(parsed_args)

That saves the file as defined by `TRAINER_FILE` but it does not run it.

Let's see if our code is correct by running it from within our notebook:

In [None]:
%run $TRAINER_FILE --optimizer $optimizer

## How to Create a Docker Image Manually


The Dockerfile looks as follows:

```
FROM tensorflow/tensorflow:2.4.0
RUN pip install tensorflow_datasets
COPY tfjob.py /
ENTRYPOINT ["python", "/tfjob.py", "--batch_size", "100", "--learning_rate", "0.001", "--optimizer", "adam"]
```

If GPU support is not needed, you can leave off the `-gpu` suffix from the image.
`mnist.py` is the trainer code you have to download to your local machine.

Then it's easy to push images to your container registry:

```bash
docker build -t <docker_image_name_with_tag> .
docker push <docker_image_name_with_tag>
```

The image is available as `mavencodev/tf_job:5.0` in case you want to skip it for now.

## How to Create a Distributed `TFJob`
For large training jobs, we wish to run our trainer in a distributed mode.
Once the notebook server cluster can access the Docker image from the registry, we can launch a distributed PyTorch job.

The specification for a distributed `TFJob` is defined using YAML:

In [None]:
%%writefile $KUBERNETES_FILE
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "mnistjob"
  namespace: demo01 # your-user-namespace
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - name: tensorflow
            # modify this property if you would like to use a custom image
            image: mavencodev/tf_job:5.0
            command:
                - "python"
                - "/tfjob.py"
                - "--batch_size=150"
                - "--learning_rate=0.001"
                - "--optimizer=adam"

Let's deploy the distributed training job:

In [None]:
%%capture tf_output --no-stderr
! kubectl create -f $KUBERNETES_FILE

In [None]:
TF_JOB = get_resource(tf_output)

To see the job status, use the following command:

In [None]:
! kubectl describe $TF_JOB

You should now be able to see the created pods matching the specified number of workers.

In [None]:
! kubectl get pods -l job-name=mnistjob

In case of issues, it may be helpful to see the last ten events within the cluster:

```bash
! kubectl get events --sort-by='.lastTimestamp' | tail
```

In [None]:
! kubectl get events --sort-by='.lastTimestamp' | tail

To stream logs from the worker-0 pod to check the training progress, run the following command:

In [None]:
! kubectl logs -f mnist-worker-0

To delete the job, run the following command:

In [None]:
! kubectl delete tfjob --all

Check to see if the check to see if the pod is still up and running 

In [None]:
! kubectl -n demo01 logs -f mnist