# Chapter 9: Advanced Model Deployments with TensorFlow Serving

## Decoupling Deployment Cycles

The basic deployments shown in the previous chapter work well, but have one restriction: The trained and validated model needs to be either included in the deployment container image during the build step or mounted into the container during the container runtime. This requires either DevOps processes or a good coordination of the data science and DevOps temas during the deployment phase of a new model version.<br>
Since TensorFlow Serving frequently polls the model storage location, unloads the previously loaded model and loads a newer model upon detection, we only need to deploy our model serving container once.

### Workflow Overview

See figure 9-1 on page 259 for the separation of workflows.<br>
If your bucket folders are publicly accessible, you can serve the remote models by simply updating the model base path to the remote path:

    !docker run -p 8500:8500 \ # specify the default ports
                -p 8501:8501 \
                -e MODEL_NAME=my_model \ # specify your model # Remaining configuration remains the same
                -e MODEL_BASE_PATH=s3://bucketname/model_path/ \ # Remote bucket path
                -t tensorflow/serving # specify the docker image

#### Accessing private model from AWS S3

    !docker run -p 8500:8500 \ # specify the default ports
                -p 8501:8501 \
                -e MODEL_NAME=my_model \ # specify your model # Remaining configuration remains the same
                -e MODEL_BASE_PATH=s3://bucketname/model_path/ \ # Remote bucket path
                -e AWS_ACCESS_KEY_ID=XXXXX \ # The name of the environment variables is important
                -e AWS_SECRET_ACCESS_KEY=XXXXX \
                -t tensorflow/serving # specify the docker image

For details regarding further conifguration options see page 260f.<br>
With these few additional environment variables provided to TensorFlow Serving, you are now able to load models from remote AWS S3 buckets.

#### Accessing private models from GCP Buckets

GCP authenticates users through service accounts. To access private GCP Storage buckets, you need to create a service account file. Unlike in AWS GCP authentication expects a JSON file with the service account credentials. For the following example, we assume that you have saved your nely created service account credential file under /home/you_username/.credentials/ on your host machine. This has to be downloaded and saved as credentials.json.

    !docker run -p 8500:8500 \ # specify the default ports
                -p 8501:8501 \
                -e MODEL_NAME=my_model \ # specify your model # Remaining configuration remains the same
                -e MODEL_BASE_PATH=gcp://bucketname/model_path/ \ # Remote bucket path
                -v /home/your_username/.credentials/:/credentials/ # Mount host directory with credentials
                -e GOOGLE_APPLICATION_CREDENTIALS=/credentials/sa-credentials.json \ # Specify path inside of the container
                -t tensorflow/serving # specify the docker image

### Optimization of Remote Model Loading

Its recommend to reduce the polling frequency to every 120 seconds, which still provides you up to 30 potential updates per hour but generates 60 times less traffic than the default. You can change this in docker run with:
 - --file_system_poll_wait_seconds=120

If time is set to 0 than TensorFlow Serving will not attempt to refresh the loaded model.

## Model Optimization for Deployments

There are three optimization methods for smaller models that allow faster model inferences:
 - Model Quantization (allows to reduce the computation complexity of a model by reducing the precision of the weight's representation, e.g. from 32-bit floats to bfloat16 format.)
  - Model Quantization is applied after model training and are often called post-training quantization. Since a quantized model can underfit it is important to test the quantize model. Useful tools are:
    - <a href="https://developer.nvidia.com/tensorrt">Nvidia's TensorRT</a>
    - <a href="https://www.tensorflow.org/lite">TFLite Library</a>
 - Model Pruning (Idea is to reduce the trained network to a smaller one by removing unnecessary wegihts, i.e. weights that are set to 0. This speds up inference and prediction and compresses the model to smaller model sizes). You can prune the model durint their training phase through tool like TensorFlow's model optimization package *tensorflow-model-optimization*
 - Model Distillation (The idea of training a smaller, less comples neural network to learn trained tasks from a much more extensive network.)

## Using TensorRT with TensorFlow Serving

After a model is trained, you need to optimize the model with TensorRT's own optimizer or with saved_model_cli. The optimized model can then be loaded into TensorFlow Serving.

In [None]:
!saved_model_cli convert --dir saved_models/ \
                         --output_dir trt-savedmodel/ \
                         --tag_set serve tensorrt

AFter the conversion you can load the model in our GPU setup of TensorFlow Serving as follows:

    !docker run -p 8500:8500 \ # specify the default ports
                -p 8501:8501 \
                --mount type=bind,sourch=/path/to/models,target=/models/my_model \
                -e MODEL_NAME=my_model \ # specify your model # Remaining configuration remains the same
                -t tensorflow/serving # specify the docker image

## TFLite

Traditionally TFLite was used to convert ML models to smaller model sizes for deployment to mobile or IoT devices, however these models can also be used for TensorFlow Serving.

### Steps to Optimize Your Model with TFLite

The conversion process consits of four steps:

 - 1. Loading the exported model
 - 2. Defining your model optimization goals
 - 3. Converting the model
 - 4. Saving the optimized model as a TFLite model

In [1]:
import tensorflow as tf

In [None]:
saved_model_dir = "path_to_saved_model"
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_dir)
converter.optimizations = [
                           # Set the optimization strategy
                           tf.lite.Optimize.DEFAULT
]
tflite_model = converter.convert()

with open("/tmp/model.tflite", "wb") as f:
    f.write(tflite_model)

There are some further optimization options for TFLite, see page 269f. in the book.

### Serving TFLite Models with TensorFlow Serving

You only need to start TensorFlow Serving with the enabled *use_tflite_model* flag and it will lod the optimized model as shown in the following example:

    !docker run -p 8501:8501 \
                --mount type=bind,source=/path/to/models,target=/models/my_model \
                -e MODEL_NAME=my_model \ # specify your model # Remaining configuration remains the same
                -e MODEL_BASE_PATH=/models \
                -t tensorflow/serving # specify the docker image
                --use_tflite_model=true # Enable TFLite model loading

**Deploy Your Models to Edge Devices**
After optimizing your model with TFLite, you can also deploy the model to a variety of mobile and edge devices:
 - Android and iOS mobile phones
 - ARM64-based computers
 - Microcontrollers and other embedded devices (e.g. Raspberry Pi)
 - Edge devices (e.g. IoT devices)
 - Edge TPUs (e.g. Coral)

For Details read the publication *Practical Deep Learning for Clud, Mobile, Edge* by *Anirudh Koul et al (O'Reilly)*. If you are looking for materials on edge devices with a focus on TFMicro, we recommend *TinyML* by *Pete Warden and Daniel Situnayake (O'Reilly)*.

## Monitoring Your TensorFlow Serving Instances

Use Prometheus a free application for real-time event logging and alerting, that is currently und Apache License 2.0 and allows you to monitor your inference setup. Usually it is used with Kubernetes, but it can easily be used without it.<br>
TensorFlow Serving and Prometheus have to run simultaniously, so that Prometheus can pull metrics from TensorFlow Serving via a REST endpoint, which requires that they are enabled for TensorFlow Serving even if you are only using gPRC endpoints in your application.

### Prometheus Setup

Before configuring TensorFlow Serving to provide metrics to Prometheus, we need to set up and configure our Prometheus instance. For simplicity you can run two Docker instances as shown in Figure-9 on page 272. In a more elaborate setup, the applications would be Kubernetes deployment.<br>
Create a Prometheus configuration file before starting it, locate it at /tmp/prometheus.yml with the following configuration details:
    global:
        scrape_interval: 15s
        evaluation_interval: 15s
        external_labels:
            monitor: "tf-serving-monitor

    scrape_configs:
        - job_name: "prometheus
          scrape_intervall: 5s # Interval when metrics are pulled
          metrics_path: /monitoring/prometheus/metrics # Metrics enpoints from TensorFlow Serving

    static_configs:
        - targets: ["host.docker.internal:8501"] # Replace with the IP address of your application

Once you have creatd your Prometheus configuration file, you can start the Docker container, which runs the Prometheus instance:

    !docker run -p 9090:9090 \ # Enable port 9090
                -v /tmp/prometheus.yml:/etc/prometheus/prometheus.yml \ # Mount your configuration file
                prom/prometheus

### TensorFlow Serving Configuration

We need to write a small configuration file to configure the logging settings.
Ans save it, e.g. like /tmp/monitoring_config.txt

    prometheus_config {
        enable: true,
        path: "/monitoring/prometheus/metrics" # URL path for metrics data it needs to match the path we specified in the Prometheus configuration that we previously created (/tmp/prometheus.yml)
    }

Add path of *monitoring_config_file* and TensorFlow Serving will provide a REST endpoint with the metrics data for Prometheus:

    !docker run -p 8501:8501 \
                --mount type=bind,source=`pwd`,target=/models/my_model \
                --mount type=bind,source=/tmp,target=/models_config tensorflow/serving \
                --monitoring_config_file=/model_config/monitoring_config.txt

## Simple Scaling with TensorFlow Serving and Kubernetes

Previous deployment methods are really good for one or more model versions and a good number of deployments, but it is not enough for applications experiencing a high volume of prediction requests. In this case the Docker container with TensorFlow Serving needs to be replicated to reply to the additional prediction requests. The orchestration of the container is usually managed by tools like Docker Swarm or Kubernetes.<br>
For the following example it is assumed that you will habe a Kubernetes cluster runing and that access to the cluster will be via *kubectl*.
The first source code example hihglights two aspects:
 - Deploying via Kubernetes without building specific Docker containers
 - Handling the Google Cloud authentication to access the remote model storage location.

 For code details see page 276f. in the book.

With the example, we can now deploy and scale your TensorFlow or Keras models without building custom Docker images. The service account credential file within the Kubernetes environment can be created with the following command:

    !kubectl create secret generic gcp-credentials --from-file=/path/to/your/user-gcp-sa.json

For the corresponding service setup in Kubernetes look at page 279.<br>

**Further Reading on Kubernetes and Kubeflow**
 - Kubernetes: Up and Running, 2nd edition by Brendan Burns et al. (O'Reilly)
 - Kubeflow Operations Guide by Josh Patterson et al. (O'reilly)
 - Kubeflow for Machine Learning (forthcoming) by Holden Karaus et al. (O'Reilly)

# References and Additional Resources

 - <a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html#Using_CreateAccessKey9">AWS Access Keys</a>
 - <a href="https://www.tensorflow.org/model_optimization/guide/pruning">TF Optimization Methods</a>
 - <a href="https://www.tensorflow.org/model_optimization/guide/pruning/pruning_with_keras">TF indepth pruning example</a>
 - <a href="https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html">Nvidia TensorRT Documentation</a>