## Training and serving Tensorflow models with Kubernetes Jobs and Seldon

In this notebook we will use OpenShift client tools (`oc`) to build, train and deploy a Tensorflow model.

First, we need to install the `oc` command

In [None]:
%%bash

curl -O https://mirror.openshift.com/pub/openshift-v4/clients/oc/4.1/linux/oc.tar.gz
tar xzf oc.tar.gz
cp oc /opt/app-root/bin/

The next step is to login to OpenShift server and switch project to the one where this Jupyter server is running. We rely on two preconfigured environment variable - `$TOKEN` and `$NAMESPACE` here. There are 2 reasons for this - 1. to make the notebook reproducible without users having to manually change it and 2. to avoid displaying the secret (`$TOKEN`) in the Jupyter UI.

_If this step fails you might need to go to `Control Panel > Stop My Server` and provide those environment variables in Spawner UI_

In [None]:
%%bash

oc login --server https://openshift.default.svc.cluster.local --insecure-skip-tls-verify --token=$TOKEN
oc project ${NAMESPACE}

We need to apply resources available in https://gitlab.com/opendatahub/data-engineering-and-machine-learning-workshop repository. These contain necessary `BuildConfigs` and `Templates` to build and deploy the training `Job` and serving `SeldonDeployment`.

In [None]:
%%bash

oc apply -f ../tf-random-forest/openshift

To be able to successfully run the training job we need to wait for the container image build to finish. You can watch the logs output below.

In [None]:
!oc logs -f buildconfig.build.openshift.io/forest-mnist-train

Let's take a look at the parameters we can configure for the training job. Some of them come with default value, but some of them need to be configured by the user.

In [None]:
%%bash

oc process forest-mnist-train --parameters

We use the predefined environment variables here again. The `MODEL_VERSION` parameter allows you to version your models - the value will be used for generation of the exported model file name so you will be able to switch between trained models in serving part.

You can also experiment with `NUM_STEPS` to see if and how it influences the model accuracy. Do not forget to change `MODEL_VERSION` for each training though otherwise the following command will fail.

In [None]:
%%bash

oc process forest-mnist-train \
-p S3_ENDPOINT_URL=${S3_ENDPOINT_URL} \
-p AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-p AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
-p MODEL_VERSION="1" | oc apply -f -

You can watch the training output by running the cell below.

Do not forget to change the name of job based on the output of the command above!

You can find the `Test Accuracy` value close to the end of the logs

In [None]:
!oc logs -f job.batch/forest-mnist-train-1

The training job outputs a compressed model into S3 object storage (using the endpoint and credentials from the environment variables). It also creates a bucket if it does not exists.

Let's take a look at what buckets exists in the object storage and see the trained model stored in the bucket.

If you changed the bucket name for the training job, make sure you use the same value here in `Bucket=` parameter

In [None]:
import boto3
import os
from pprint import pprint

conn = boto3.client(service_name='s3', 
                    endpoint_url=os.environ['S3_ENDPOINT_URL'])

pprint(conn.list_buckets()['Buckets'])
objects = conn.list_objects(Bucket="RHTE")

pprint(objects)
print("Stored models: ", ", ".join([x['Key'] for x in objects['Contents']]))

Now that our model is trained exported and stored in object storage we can serve it through Seldon. Let's take a look at the parameters for the deployment.

In [None]:
%%bash

oc process forest-mnist-serve --parameters

You can see they are very similar to the training job parameters, which means we will need to provide the S3 storage credentials again and make sure `MODEL_NAME` and `MODEL_VERSION` match so that we deploy correct model.

In [None]:
%%bash

oc process forest-mnist-serve \
-p S3_ENDPOINT_URL=${S3_ENDPOINT_URL} \
-p AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-p AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
-p MODEL_VERSION="1" | oc apply -f -

In [None]:
!oc get pods -o name | grep forest-mnist-predictor

In [None]:
!oc logs -c forest-experiment pod/forest-mnist-predictor-28e5946-79c4996dd8-fp9z8

If the serving container started successfully we can load some test data (using TF examples library) 

In [None]:
!pip install tensorflow
import os, sys
import tensorflow as tf


# Import MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=False)

We will use the `oc` command again to get the URL of the model prediction endpoint and store it as Python and Shell variable

In [None]:
route=!oc get route forest-mnist -o "jsonpath={.spec.host}"
route=route[0]
%env SELDON_ROUTE=$route

Next we can select our test sample. You can change the value of variable `y` to get different image from the test dataset. You will see the actual label which should later match the prediction.

In [None]:
y=666
x=[mnist.test.images[y].tolist()]
print("Label: ", mnist.test.labels[y])

There are multiple ways how to query the model for predictions. Let's take a look at two of them - using a command line tool `curl` and a Python package `requests`.

We export the variable `x` from the cell above as a shell environment variable and use it as a part of the payload to `/api/v0.1/predictions` edpoint.

You will get a JSON back which contains probabilities for all the classes. Highest probability represents the predicted label.

In [None]:
%%bash -s "$x"

curl -k -X POST -H 'Content-Type: application/json' \
    -d "{'data': {'ndarray': $1}}" \
https://${SELDON_ROUTE}/api/v0.1/predictions 2>/dev/null

It is a bit easier to work with the JSON objects in Python, so we can actually print the guessed label with it's probability. 

Does it match the `Label` printed above?

In [None]:
import requests
import json

def get_label(predictions, names):
    result = max(predictions)
    return names[predictions.index(result)].split(":")[1], result
    

response = requests.post("https://%s/api/v0.1/predictions" % route, json={'data': {'ndarray': x}}, verify=False).json()
print("Predicted number is %s (%f) " % (get_label(response['data']['ndarray'][0], response['data']['names'])))

## Training with GPUs

You will now attempt to train the same model on GPU using `tensorflow-gpu` package.

First, we need to change the training script dependency. That can be done by changing `requirements.txt` for the training script. The command below will do that for you.

In [None]:
!sed -i 's/tensorflow.*/tensorflow-gpu==1.13.*/' ../tf-random-forest/train/requirements.txt
!cat ../tf-random-forest/train/requirements.txt

As you did a local (i.e local to a container where Jupyter is running) change to the code of Python code for training, it is necessary to rebuild the training container image. It is also necessary to do it from the local directory (notice `--from-dir` parameter) instead of pulling the code from Git repositoy during the build.

The following command will start a build and use local changes as a source.

In [None]:
!oc start-build -F forest-mnist-train-gpu --from-dir=../..

You can now start the training job. You can see messages like

```
2019-09-06 18:21:09.159963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10805 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
```

which means the job was actually scheduled on a GPU node and the Tensorflow will use the GPU to train the model.

In [None]:
%%bash

oc process forest-mnist-train-gpu \
-p S3_ENDPOINT_URL=${S3_ENDPOINT_URL} \
-p AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-p AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
-p MODEL_VERSION="g1" | oc apply -f -

In [None]:
!oc logs -f job.batch/forest-mnist-train-gpu-g1

After successful training you can deploy the newly built model same way we did previously.

In [None]:
%%bash

oc process forest-mnist-serve \
-p S3_ENDPOINT_URL=${S3_ENDPOINT_URL} \
-p AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
-p AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
-p MODEL_VERSION="g1" | oc apply -f -

In [None]:
!oc get pods -o name | grep forest-mnist-predictor

In [None]:
!oc logs -f -c forest-experiment pod/forest-mnist-predictor-28e5946-74cc875f94-cnfm8


### Once the Seldon deployment is running you can scroll up in the notebook and use the same code as before to call the prediction endpoint.