### Training on MNIST Dataset using Tensorflow Operator

#### Prerequisites

##### Before we proceed, let's check that we're using the right image, that is, TensorFlow is available:

In [14]:
# ! pip3 list | grep tensorflow 
! pip3 install --user tensorflow==2.1.0
! pip3 install --user ipywidgets tensorflow-datasets nbconvert

You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


##### To package the trainer in a container image, we shall need a file (on our cluster) that contains the code as well as a file with the resource definitition of the job for the Kubernetes cluster:

In [15]:
TRAINER_FILE = "tfjob.py"
KUBERNETES_FILE = "tfjob-birds.yaml"

##### We also want to capture output from a cell with %%capture that usually looks like some-resource created. To that end, let's define a helper function:

In [16]:
import re

from IPython.utils.capture import CapturedIO


def get_resource(captured_io: CapturedIO) -> str:
    """
    Gets a resource name from `kubectl apply -f <configuration.yaml>`.

    :param str captured_io: Output captured by using `%%capture` cell magic
    :return: Name of the Kubernetes resource
    :rtype: str
    :raises Exception: if the resource could not be created
    """
    out = captured_io.stdout
    matches = re.search(r"^(.+)\s+created", out)
    if matches is not None:
        return matches.group(1)
    else:
        raise Exception(f"Cannot get resource as its creation failed: {out}. It may already exist.")

In [17]:
cd/home/jovyan

/home/jovyan


### How to Load and Inspect the Data

In [18]:
import tensorflow as tf
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Conv2D,MaxPool2D,Flatten,Dropout,BatchNormalization,Activation
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
from keras.preprocessing.image import load_img,img_to_array



train_directory='input/train'
val_directory='input/valid'
test_directory='input/test'

train_datagen=ImageDataGenerator(rescale=1/255)
val_datagen=ImageDataGenerator(rescale=1/255)
test_datagen=ImageDataGenerator(rescale=1/255)

train_generator=train_datagen.flow_from_directory(train_directory,
                                                 target_size=(224,224),
                                                 color_mode='rgb',
                                                 class_mode='sparse',batch_size=256)

val_generator=val_datagen.flow_from_directory(val_directory,
                                                 target_size=(224,224),
                                                 color_mode='rgb',
                                                 class_mode='sparse',batch_size=256)

test_generator=test_datagen.flow_from_directory(test_directory,
                                                 target_size=(224,224),
                                                 color_mode='rgb',
                                                 class_mode='sparse',batch_size=256)

Found 36611 images belonging to 260 classes.
Found 1300 images belonging to 260 classes.
Found 1300 images belonging to 260 classes.


In [19]:
optimizer='rmsprop'

### How to Train the Model in the Notebook
##### We want to train the model in a distributed fashion, we put all the code in a single cell. That way we can save the file and include it in a container image:

In [22]:
%%writefile $TRAINER_FILE
import argparse
import logging
import json
import os
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

import tensorflow as tf
import pandas as pd
import numpy as np
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Conv2D,MaxPool2D,Flatten,Dropout,BatchNormalization,Activation
from tensorflow.keras.preprocessing.image import ImageDataGenerator
import matplotlib.pyplot as plt
from keras.preprocessing.image import load_img,img_to_array
from tensorflow.keras.applications import ResNet101V2

logging.getLogger().setLevel(logging.INFO)




def make_datasets_unbatched():
    BUFFER_SIZE = 10000
    
    
    
    train_directory='input/train'
    val_directory='input/valid'
    test_directory='input/test'

    train_datagen=ImageDataGenerator(rescale=1/255)
    val_datagen=ImageDataGenerator(rescale=1/255)
    test_datagen=ImageDataGenerator(rescale=1/255)

    train_dataset=train_datagen.flow_from_directory(train_directory,
                                                 target_size=(224,224),
                                                 color_mode='rgb',
                                                 class_mode='sparse',batch_size=256)

    val_dataset=val_datagen.flow_from_directory(val_directory,
                                                 target_size=(224,224),
                                                 color_mode='rgb',
                                                 class_mode='sparse',batch_size=256)

    test_dataset=test_datagen.flow_from_directory(test_directory,
                                                 target_size=(224,224),
                                                 color_mode='rgb',
                                                 class_mode='sparse',batch_size=256)
    return train_dataset, val_dataset, test_dataset



def model(args):
    convlayer=ResNet101V2(input_shape=(224,224,3),weights='imagenet',include_top=False)
    convlayer.trainable = False
    model=Sequential()
    model.add(convlayer)
    model.add(Dropout(0.5))
    model.add(Flatten())
    model.add(BatchNormalization())
    model.add(Dense(2048,kernel_initializer='he_uniform'))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(1024,kernel_initializer='he_uniform'))
    model.add(BatchNormalization())
    model.add(Activation('relu'))
    model.add(Dropout(0.5))
    model.add(Dense(260,activation='softmax'))
    
    opt = args.optimizer
    
    model.compile(optimizer=opt,
                loss='sparse_categorical_crossentropy',
                metrics=['accuracy'])
    tf.keras.backend.set_value(model.optimizer.learning_rate, args.learning_rate)
    return model


def main(args):
    # MultiWorkerMirroredStrategy creates copies of all variables in the model's
    # layers on each device across all workers
    strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy(
      communication=tf.distribute.experimental.CollectiveCommunication.AUTO)
    logging.debug(f"num_replicas_in_sync: {strategy.num_replicas_in_sync}")
    BATCH_SIZE_PER_REPLICA = args.batch_size
    BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

    # Datasets need to be created after instantiation of `MultiWorkerMirroredStrategy`
    train_dataset, val_dataset, test_dataset= make_datasets_unbatched()
    train_dataset = train_dataset.batch(batch_size=BATCH_SIZE)
    val_dataset = val_dataset.batch(batch_size=BATCH_SIZE)
    test_dataset = test_dataset.batch(batch_size=BATCH_SIZE)

    # See: https://www.tensorflow.org/api_docs/python/tf/data/experimental/DistributeOptions
    options = tf.data.Options()
    options.experimental_distribute.auto_shard_policy = \
        tf.data.experimental.AutoShardPolicy.DATA

    train_datasets_sharded  = train_dataset.with_options(options)
    val_datasets_sharded  = val_dataset.with_options(options)
    test_dataset_sharded = test_dataset.with_options(options)

    with strategy.scope():
        # Model building/compiling need to be within `strategy.scope()`.
        multi_worker_model = model(args)

    # Keras' `model.fit()` trains the model with specified number of epochs and
    # number of steps per epoch. 
    multi_worker_model.fit(train_datasets_sharded,validation_data=val_datasets_sharded,
                         epochs=5,
                         steps_per_epoch=10)

  
    eval_loss, eval_acc = multi_worker_model.evaluate(test_dataset_sharded, 
                                                    verbose=0, steps=10)


    # Log metrics for Katib
    logging.info("loss={:.4f}".format(eval_loss))
    logging.info("accuracy={:.4f}".format(eval_acc))


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("--batch_size",
                      type=int,
                      default=128,
                      metavar="N",
                      help="Batch size for training (default: 128)")
    parser.add_argument("--learning_rate", 
                      type=float,  
                      default=0.001,
                      metavar="N",
                      help='Initial learning rate')
    parser.add_argument("--optimizer", 
                      type=str, 
                      default='adam',
                      metavar="N",
                      help='optimizer')

    parsed_args, _ = parser.parse_known_args()
    main(parsed_args)

Overwriting tfjob.py


##### That saves the file as defined by TRAINER_FILE but it does not run it.

##### Let's see if our code is correct by running it from within our notebook:

In [None]:
%run $TRAINER_FILE --optimizer $optimizer

### Create a Docker Image Manually
##### The Dockerfile looks as follows:

FROM tensorflow/tensorflow:2.4.0
RUN pip install tensorflow_datasets
COPY tfjob.py /
ENTRYPOINT ["python", "/tfjob.py", "--batch_size", "100", "--learning_rate", "0.001", "--optimizer", "adam"]
If GPU support is not needed, you can leave off the -gpu suffix from the image. mnist.py is the trainer code you have to download to your local machine.

Then it's easy to push images to your container registry:

docker build -t <docker_image_name_with_tag> .
docker push <docker_image_name_with_tag>
The image is available as mavencodev/tf_job:5.0 in case you want to skip it for now.

### How to Create a Distributed TFJob
##### For large training jobs, we wish to run our trainer in a distributed mode. Once the notebook server cluster can access the Docker image from the registry, we can launch a distributed PyTorch job.

##### The specification for a distributed TFJob is defined using YAML:

In [None]:
%%writefile $KUBERNETES_FILE
apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "birdsjob"
  namespace: ekemini # your-user-namespace
spec:
  cleanPodPolicy: None
  tfReplicaSpecs:
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
          - name: tensorflow
            # modify this property if you would like to use a custom image
            image: mavencodev/tfbirds:v.0.3
            command:
                - "python"
                - "/tfjob.py"
                - "--batch_size=150"
                - "--learning_rate=0.001"
                - "--optimizer=adam"

##### Let's deploy the distributed training job:

In [None]:
%%capture tf_output --no-stderr
! kubectl create -f $KUBERNETES_FILE

In [None]:
TF_JOB = get_resource(tf_output)

##### To see the job status, use the following command:

In [None]:
! kubectl describe $TF_JOB

##### You should now be able to see the created pods matching the specified number of workers.

In [None]:
! kubectl get pods -l job-name=birdsjob

##### In case of issues, it may be helpful to see the last ten events within the cluster:

#####                                         ! kubectl get events --sort-by='.lastTimestamp' | tail

In [None]:
! kubectl get events --sort-by='.lastTimestamp' | tail

##### To stream logs from the worker-0 pod to check the training progress, run the following command:

In [None]:
! kubectl logs -f birds-worker-0

##### To delete the job, run the following command:

In [None]:
#! kubectl delete tfjob --all

##### Check to see if the check to see if the pod is still up and running

In [None]:
#! kubectl -n ekemini logs -f mnist