# Install Requirements

In [3]:
!apt-get install tree

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 6 not upgraded.
Need to get 40.7 kB of archives.
After this operation, 105 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu bionic/universe amd64 tree amd64 1.7.0-5 [40.7 kB]
Fetched 40.7 kB in 1s (38.5 kB/s)
Selecting previously unselected package tree.
(Reading database ... 144617 files and directories currently installed.)
Preparing to unpack .../tree_1.7.0-5_amd64.deb ...
Unpacking tree (1.7.0-5) ...
Setting up tree (1.7.0-5) ...
Processing triggers for man-db (2.8.3-2ubuntu0.1) ...


# Chapter 8: Model Deployment with TensorFlow Serving

This Chapter bridges the gap between DevOps engineers and Data Scientists.

Machine Learning Models can be deployed in three main ways:
 - Model Server
 - User's Browser
 - Edge Device

In [2]:
import tensorflow as tf

base_dir = "drive/My Drive/Building ML Pipelines/"
chap_dir = base_dir + "Chapter 8/"
data_dir = base_dir + "Data/"
out_dir = chap_dir + "Outputs/"
csv_data_dir = base_dir + "CSV Data/"
csv_dir = csv_data_dir + "consumer_complaints_with_narrative.csv"

## A Simple Model Server

Most introductions to deploying machine learning models follow roughly the same workflow:
 - Create a web app with Python (with Flask or Django)
 - Create an API endpoint in the web app
 - Load the model structure and its weights
 - Call the predict method on the loaded model
 - Return the prediction results as an HTTP request

In [8]:
import json
from flask import Flask, request
from tensorflow.keras.models import load_model
# from utils import preprocess

Only for demonstration purposes, it is not recommended to use such a code in production.

In [None]:
# Load your trained model
model = load_model("model.h5")
app = Flask(__name__)

@app.route("/classify", methods=["POST"])
def classify():
    complaint_data = request.form["complaint_data"]
    preprocessed_complaint_data = preprocess(complaint_data)
    # Perform the prediction
    prediction = model.predict([preprocessed_complaint_data])

    # Return the prediction in an HTTP response
    return json.dumpy({"score": prediction})

# The Downside of Model Deployments with Python-Based APIs

### Lack of Code Separation

Lack of code separation can lead to many problems between the Data Scientists creating the model and the API team deploying it.



### Lack of Model Version Control

Creating Model Versions requires extra attention to keep all endpoints structurally the same, which itself requires a lot of boiler plate code.

### Inefficient Model Inference

This happens because, e.g. in Flask each request is preprocessed and inferred individually. Like in training you can run inference in batches, which will make it alot more efficient, especially when run on GPUs.

## TensorFlow Serving

TensorFlow Serving allows to deploy any TensorFlow graph and you can make predictions from the graph through its standardized endpoints.
TensorFlow Serving will further handle the model and version management for you.

## TensorFlow Architecture Overview

## Exporting Models for TensorFlow Serving

Depending on the type of TensorFlow model, the export steps are slightly different.<br>
For Keras models you can use:

    saved_model_path = model.save(file_path="./saved_models", save_format="tf")


**Add a Timestamp to Your Export Path:**<br>
If model is saved manually it is recommend to add the timestamp to the model path.

    import time

    ts = int(time.time())
    file_path = "./saved_models/{}".format(ts)
    save_model_path = model.save(file_path=file_path, save_format="tf")


For TensorFlow Estimator models, you need to first declare a receiver function:

    import tensorflow as tf

    def serving_input_receiver_fn():
        # an example input feature
        input_feature = tf.compat.v1.placeholder(
            dtype=tf.string,
            shape=[None, 1],
            name="input"
            )

        fn = tf.estimator.export.build_raw_serving_input_receiver_fn(
            features={"input_feature": input_feature}
        )

        return fn

Export the Estimator model with the *export_saved_model* method for the Estimator:

    estimator = tf.estimator.Estimator(model_fn, "model", params={})
    estimator.export_saved_model(
        export_dir_base="saved_models/",
        serving_input_receiver_fn=sering_input_receiver_fn
        )
        

## Model Signatures

Model signatures identify the model graph's inut and outupts as well as the method of the graph signature. This allows us to map serving inputs to a given graph node for the inference, which is useful if we want to update the model without changing the requests to the model server.

### Signature Methods

There are three supported signature methods:
 - predict
 - classify
 - regress

Predict is the default method.<br>
Predict:
 - Inputs: Keys
 - Outputs: Scores (Values)

Classify:
 - Inputs: Keys
 - Outputs: Classes and Score

Regress:
 - Inputs: Keys
 - Value: Scores

## Inspecting Exported Models

You can install the TensorFlow Serving API by running the following command:

    pip install tensorflow-serving-api

Now you have a useful command-line tool called SavedModel Command Line Interface (CLI), which lets you:
 - Inspect the signatures of exported models
 - Test the exported models

### Inspecting the Model

*saved_model_cli* helps you understand the model dependencies without inspecting the original graph code:

    saved_model_cli show --dir saved_models/

*Output*<br>
  The given SavedModel contains the following tag-sets:<br>
  - *serve*

Once you know the tag_set you want to inspect, add it as an argument and saved_model_cli will provide you the available model signatures. Run:

    saved_model_cli show --dir saved_models/ --tag_set serve

This returns the SignatureDefs with its keys.<br>
With the tag_set and signature_def information you can now inspect the model's inputs and outputs. Therfore add the signature_def to the CLI arguments.

In [3]:
# !pip install tensorflow_serving_api

In [14]:
saved_models_dir = "drive/My Drive/Building ML Pipelines/Chapter 7/Outputs/Trainer/model/18/serving_model_dir"
print(saved_models_dir)

# converting string to raw string 
def to_raw(string):
    r_s = ""
    for s in string:
        if s == " ":
            r_s += "\\"
        r_s += s
        
    return r_s 
raw_saved_models_dir = to_raw(saved_models_dir)
print(raw_saved_models_dir)

drive/My Drive/Building ML Pipelines/Chapter 7/Outputs/Trainer/model/18/serving_model_dir
drive/My\ Drive/Building\ ML\ Pipelines/Chapter\ 7/Outputs/Trainer/model/18/serving_model_dir


In [15]:
!saved_model_cli show --dir {raw_saved_models_dir}

The given SavedModel contains the following tag-sets:
serve


In [16]:
!saved_model_cli show --dir {raw_saved_models_dir} --tag_set serve

The given SavedModel MetaGraphDef contains SignatureDefs with the following keys:
SignatureDef key: "__saved_model_init_op"
SignatureDef key: "serving_default"


The following example signature is taken from the model defined an trained in Chapter 7.

In [17]:
!saved_model_cli show --dir {raw_saved_models_dir} --tag_set serve --signature_def serving_default

The given SavedModel SignatureDef contains the following input(s):
  inputs['examples'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: serving_default_examples:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['outputs'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1)
      name: StatefulPartitionedCall_2:0
Method name is: tensorflow/serving/predict


If you want to see all signatures regardless of the tag_set and signature_def run the following:

In [18]:
!saved_model_cli show --dir {raw_saved_models_dir} --all


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['__saved_model_init_op']:
  The given SavedModel SignatureDef contains the following input(s):
  The given SavedModel SignatureDef contains the following output(s):
    outputs['__saved_model_init_op'] tensor_info:
        dtype: DT_INVALID
        shape: unknown_rank
        name: NoOp
  Method name is: 

signature_def['serving_default']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['examples'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: serving_default_examples:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['outputs'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1)
        name: StatefulPartitionedCall_2:0
  Method name is: tensorflow/serving/predict
W1007 12:35:10.979345 139644275099520 deprecation.py:506] From /usr/local/lib/python2.7/dist-packages/tensorflow_core/python/ops/resource_

### Testing the Model

*saved_model_cli* also lets you test the export model with sample input data.<br>
You have three different ways to submit the sample input data for the model test inference:
 - --inputs (for arguments pointing at NumPy files)
 - --input_exprs (for arguments pointing at Python expressions to specify the input data)
 - --input_examples (for arguments pointing to data formatted as tf.Example data structure)

For testing the model, you can specify exactly one of the input arguments.<br>
Furhtermore, *saved_model_cli* provides three optional arguments:
 - --outdir (target_directory, else file is written to stdout)
 - --overwrite (if output is written to a file you can specify if it can be overwritten)
 - --tf_debug (for further model inspection)

In [24]:
!saved_model_cli run --dir {raw_saved_models_dir} --tag_set serve --signature_def serving_default --input_example "examples=[{'company': ['HSBC']}]"

## Setting Up TensorFlow Serving

TensorFlow Serving can either run on Docker or if you run an Ubuntu OS you can install the Ubuntu package.

### Docker Installation

Install TensorFlow Serving by downloading the prebuilt Docker image.

In [41]:
!docker pull tensorflow/serving

If you run the Docker container on an instance with GPUs available, you will need to download the latest build with GPU support.

In [None]:
!docker pull tensorflow/serving:latest-gpu

For the Native Ubuntu Installation see page 214 of the Book.

### Building TensorFlow Serving from Source

Currently this option is only build for Linux Operating systems and the build tool *bazel* is required.<br>
You can find detailed instructions in the <a href="https://www.tensorflow.org/tfx/serving/setup#building_from_source">TensorFlow Serving Documentation</a>.<br>

**Optimize Your TensorFlow Serving Instances**
If you build TensorFlow Serving from scratch, it is highly recommend compiling the Serving versino for the specific TensorFlow versino of your models and available hardware of your serving instance.

## Configuring a TensorFlow Server

There are two different modes a TensorFlow Serving model can run:
 - Single Model Configuration
 - Multiple Model Configuration

### Single Model Configuration

Preferred when you want to run TF Serving by loading a single model and switching to newer model versions when they are available.
If run in an Docker environment, then execute:

In [None]:
!docker run -p 8500:8500 \ # specify the default ports
            -p 8501:8501 \
            --mount type=bind,source=/tmp/models,target=/models/my_model \ # mount the model directory
            -e MODEL_NAME=my_model \ # specify your model
            -e MODEL_BASE_PATH=/models/my_model \
            -t tensorflow/serving # specify the docker image

for more information see p.216 ff.

If you want to run the Docker image prebuilt for GPU images, you need to swap out the name of the docker image to the latest GPU build with:

In [None]:
!docker run -p 8500:8500 \ # specify the default ports
            -p 8501:8501 \
            --mount type=bind,source=/tmp/models,target=/models/my_model \ # mount the model directory
            -e MODEL_NAME=my_model \ # specify your model
            -e MODEL_BASE_PATH=/models/my_model \
            -t tensorflow/serving:latest-gpu # specify the docker image for GPU

In the output you should see that the server loaded our model *my_model* successfully and that created two endpoints:
 - 1 REST endpoint and 
 - 1 gRPC endpoint

One great advantage of TensorFlow Serving is the *hot swap*, which automatically unloads the existing model and load the newer model for inferencing if a new model is uploaded. In general the model with the highest version number will be loaded by TensorFlow Serving. The same procedure works for rolling back a model version, which works by deleting the current one.

### Multiple Model Configuration

You can also configure TensorFlow Serving to load multiple models at the same time by running the following command.

In [None]:
model_config_list {
    config {
        name: "my_model"
        base_path: "models/my_model"
        model_platform: "tensorflow"
    }
    config {
        name: "another_model"
        base_path "models/another_model"
        model_platform: "tensorflow"
    }
}

The run Docker as follows:

In [None]:
!docker run -p 8500:8500 \ # specify the default ports
            -p 8501:8501 \
            --mount type=bind,source=/tmp/models,target=/models/my_model \ # mount the model directory
            --mount type=bind,source=/tmp/modles_config,target=/models/model_config \ # mmount the configuration file
            -e MODEL_NAME=my_model \ # specify your model
            -t tensorflow/serving:latest-gpu \ # specify the docker image for GPU
            --model_config_file=/models/model_config # specify the model configuration file

If TensorFlow Serving is used outside of a Docker container, you can point the model server to the configuration file with the argument *model_config_file*, which loads the configuration from the file.

In [None]:
!tensorflow_model_server --port=8500 \
                         --rest_api_port=8501 \
                         --model_config_file=/models/model_config

**Running specific model versions:**

If you want to load a set of available model version, e.g. for A/B Testing, you can extend the model configuration file with:

In [None]:
model_config_list {
    config {
        name: "my_model"
        base_path: "models/my_model"
        model_platform: "tensorflow"
        model_version_policy: {all: {}}
    }
    config {
        name: "another_model"
        base_path "models/another_model"
        model_platform: "tensorflow"
    }
}

If you want to specify specific model version, you can define them as well.

In [None]:
model_config_list {
    config {
        name: "my_model"
        base_path: "models/my_model"
        model_platform: "tensorflow"
        model_version_policy {
            specific {
                versions: 1556250435
                versions: 1556251435
            }
        }
    }
    config {
        name: "another_model"
        base_path "models/another_model"
        model_platform: "tensorflow"
    }
}

You can even give the model version labels, which comes in handy for making predictions. This is only available through TensorFlow Serving's gRPC endpoints.

In [None]:
model_config_list {
    config {
        name: "my_model"
        base_path: "models/my_model"
        model_platform: "tensorflow"
        model_version_policy {
            specific {
                versions: 1556250435
                versions: 1556251435
            }
        }
        version_labels {
            key: "stable"
            value: 1556250435
        }
        
        version_labels {
            key: "testing"
            value: 1556251435
        }
    }
    config {
        name: "another_model"
        base_path "models/another_model"
        model_platform: "tensorflow"
    }
}

With that you can for example run a model A/B test.
Starting with TensorFlow Serving 2.3, the *version_label* functionality will be able for REST endpoints too.

## REST versus gRPC

### REST

REST is a communication "protocol" used by today's web services.

### gRPC

gRPC is a remote procedure protocol defined by Google

#### Which protocol to use?

REST is very easy to use and is already widely available for all sorts of clients. gRPC APIs have a higher burden of entry, but come in handy because they often lead to significant performance improvements depending on the data structures required for the model inference.
<br>
Internally TensorFlow Serving converts JSON data structures submitted via REST to tf.Example data structures, and this can lead to slower performance.

## Making Predictions from the Model Server

All following code examples concerning REST or gRPC requests are executed on the client side.

### Getting Model Predictiosn via REST

In [44]:
import requests

An example schowcase for a POST request.

In [47]:
url = "http://some-domain.abc"
payload = {"key_1": "value_1"}
# Submit the request
r = requests.post(url, json=payload)
# View the HTTP response
print(r.json())

### URL structure

The URL for your HTTP request to the model server contains informatio about which model and which version you would like to infer:

    http://{HOST}:{PORT}/v1/models/{MODEL_NAME}:{VERB}

For Details see page 225 in the book.
If you want to specify the model version for a prediction, you will need to extend the URL with the model version identifier.

    http://{HOST}:{PORT}/v1/models/{MODEL_NAME}[/versions/${MODEL_VERSION}]:{VERB}


### Payloads

TensorFlow Serving expects the input data as a JSON data structure.

    {
        "signature_name": <string> # not necessary
        "instances": <value> # list of input values or objects
    }

To submit multiple data samples, you can submit them as a list under the instances key, but if you want to submit one data example for the inference, you can use inputs and list all input values as a list.

    {
        "signature_name": <string> # not necessary
        "inputs": <value>
    }

**Either "inputs" or "instances" must be provieded but not both at the same time!**

Example model prediction request with a Python client:

In [48]:
import requests

In [None]:
def get_rest_request(text, model_name="my_model"):
    # Exchange localhost with an IP address if the server is not running on the same machine
    url = "http://localhost:8501/v1/models/{}:predict".format(model_name)
    # Add more examples to the instances list if you want to infer more samples
    payload = {"instances": [text]}
    response = requests.post(url=url, json=payload)
    
    return reponse

rs_rest = get_rest_request(text="classify my text")
rs_rest.json()

### Using TensorFlow Serving via gRPC

First you need to establisch a gRPC channel, which provides the connection to the gRPC server at a given host address and over a given port. Once the channel is established, you will create a stub, i.e. a local object which replicates the available methods from the server.

In [4]:
import grpc
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2, prediction_service_pb2_grpc

In [5]:
def create_grpc_stub(host, port=8500):
    hostport = "{}:{}".format(host, port)
    channel = grpc.insecure_channel(hostport)
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

    return stub

Once the gRPC stub is created, we can set the model and the signature to access predictions from the correct model and submit our data for the inference.

In [6]:
def grpc_request(stub, data_sample, model_name="my_model", signature_name="classification"):
    request = predict_pb2.PredictRequest()
    request.model_spec.name = model_name
    request.model_spec.signature_name = signature_name
    # inputs is the name of the input of our nerual network
    request.inputs["inputs"].CopyFrom(tf.make_tensor_proto(data_sample, shape=[1, 1]))
    # 10 is the max time in seconds before the function times out
    result_future = stub.Predict.future(request, 10)
    
    return result_future

Infer the example dataset, with the two function calls:

In [None]:
stub = create_grpc_stub(host, port=8500)
rs_grpc = grpc_request(stub, data)

**Secure Conncetions**

The grpc library also provides functionality to connect securely with the gRPC enpoints, from the client side:

In [None]:
cert = open(client_cert_file, "rb").read()
key = open(clien_key_file, "rb").read()
ca_cert = open(ca_cert_file, "rb").read() if ca_cert_file else ""
credentials = grpc.ssl_channel_credentials(
    ca_cert,
    key,
    credits
)

channel = implementations.secure_channel(hostport, credentials)

TF Serving can terminate secure connections if SSL is configured, therefore follow the example on page 229 in the book.<br>
Once the configuration file is created, you can pass the file to the TensorFlow Serving argument --ssl_config_file during the start of TensorFlow Serving

In [None]:
!tensorflow_model_server --port=8500 \
                         --rest_api_port=8501 \
                         --model_name=my_model \
                         --model_base_path=/models/my_model \
                         --ssl_config_file="<path_to_config_file>"

#### Getting predictions from classification and regression models

Therefore you can use the gRPC API like follows:

In [None]:
def grpc_request(stub, data_sample, model_name="my_model", signature_name="classification"):
    request = classification_pb2.ClassificationRequest()
    request.model_spec.name = model_name
    request.model_spec.signature_name = signature_name
    # inputs is the name of the input of our nerual network
    request.inputs["inputs"].CopyFrom(tf.make_tensor_proto(data_sample, shape=[1, 1]))
    # 10 is the max time in seconds before the function times out
    result_future = stub.Predict.future(request, 10)
    
    return result_future

For predictions from a regression model, do:

In [7]:
from tensorflow_serving.apis import regression_pb2

In [None]:
def grpc_request(stub, data_sample, model_name="my_model", signature_name="classification"):
    request = regression_pb2.RegressionRequest()
    request.model_spec.name = model_name
    request.model_spec.signature_name = signature_name
    # inputs is the name of the input of our nerual network
    request.inputs["inputs"].CopyFrom(tf.make_tensor_proto(data_sample, shape=[1, 1]))
    # 10 is the max time in seconds before the function times out
    result_future = stub.Predict.future(request, 10)
    
    return result_future

#### Payloads

gRPC API uses Protocol Buffers as the data strucutres for the API request, which use less bandwidth compared to JSON payloads. Depending on the model input data structure, you can even get faster predictions as with REST endpoints. For the conversion of the data, TensorFlow provides you the function *tf.make_tensor_proto*, which allows various data formats, including scalars, lists, NumPy scalars and NumPy arrays.

## Model A/B Testing with TensorFlow Serving

In [1]:
from random import random

In [3]:
def get_rest_url(model_name, host="localhost", port=8501, verb="predict", version=None):
    url = "http://{}:{}/v1/models/{}/".format(host, port, model_name)
    if version:
        url += "versions/{}/".format(version)
    url += ":{}".format(verb)
    
    return url

In [None]:
# Submit 10% of all requests from this client to version 1.
# 90 % of the requests should go to the default models.
threshold = 0.1
# If version = None, TensorFlow Serving will infer with the default version.
version = 1 if random() < threshold else None

url = get_rest_url(model_name="complaints_classification", version=version)

If you would like to extend theses capabilities by performing the random routing of the model inference on the server side, we highly recommend routing tools like <a href="https://istio.io/">Istio</a> for this purpose. Istio was originally deigned for web traffic and can be used to route traffic to specific models.

## Requesting Model Metadata from the Model Server

A critical component of the continuous life cycle is gegenerating accuracy or general performance feedback about your model versions. Therefore we have to know which model version performed a correct are false prediction to improved it afterwards.

### REST Requests for Model Metadata

TensorFlow Serving provides you an endpoint for model metadata:

    http://{HOST}:{PORT}/v1/models/{MODEL_NAME}[/versions/${MODEL_VERSION}]/metadata

Similar to the REST API inference requests, you have the option to specify the model verion in the request URL. We can request model metadata with a single GET request.

In [4]:
import requests

In [5]:
def metadata_rest_request(model_name, host="localhost", port=8501, version=None):
    url = "http://{}:{}/v1/models/{}/".format(host, port, model_name)
    if version:
        url += "versions/{}/".format(version)
    # Append /metadata for model information
    url += "/metadata"
    # Perform GET request
    reponse = request.get(url=url)
    
    return response

The return will look like the specifications dictionary as in the model_config_lists above, but with a *model_spec* and *metadata* dictionary.

### gRPC for Model Metadata

Here you file a GetModelMetadataRequest, add the model name to the specifications and submit the request via the GetModelMetadata method of the stub.

In [None]:
def get_model_version(model_name, stub):
    request = get_model_metadata_pb2.GetModelMetadataRequest()
    request.model_spec.name = model_name
    request.metadat_field.append("signature_def")
    # 5 is the max time in seconds before the function times out
    response = stub.GetModelMetadata(request, 5)
    
    return response

model_name = "complaints_classficiation"
stub = create_grpc_stub("localhost")
get_model_version(model_name, stub)

The gRPC response contains a ModelSpech object that contains the version number of the loaded model. Even more useful is the obtaining of the model signature information of the loaded model.

In [7]:
from tensorflow_serving.apis import get_model_metadata_pb2

In [None]:
def get_model_metadata(model_name, stub):
    request = get_model_metadata_pb2.GetModelMetadataRequest()
    request.model_spec.name = model_name
    request.metadat_field.append("signature_def")
    # 5 is the max time in seconds before the function times out
    response = stub.GetModelMetadata(request, 5)
    
    return response.metadata["signature_def"]


model_name = "complaints_classficiation"
stub = create_grpc_stub("localhost")
get_model_metadata(model_name, stub)

# The information needs to be serialized to be human readable
# Use SerializeToString function to convert the protocol buffer information
print(meta.SerializeToString().decode("utf-8", "ignore"))

## Batching Inferene Requests

Without batching like in training, we under-utilize the available memory of the CPU or GPU, because we handle each request for inference individually. Multiple clients can request model predictions and the server batches the different client requests into one "batch" to compoute.

## Configuring Batch Predictions

In TensorFlow Serving you have five configuration options for batching predictions:
 - max_batch_size (Int)
 - batch_timeout_micros (Int)
 - num_batch_threads (Int)
 - max_enqueued_batches (Int)
 - pad_variable_length_inputs (Bool)

Setting parameters for optimal batcing requires some tuning and is application dependet, see for an initial help page 239. TensorFlow Serving will make predictions on the batch when either the max_batch_size or the timeout is reached.<br>
For CPU predictions tune num_batch_threads to configure the number of CPU cores and for GPU predictios set max_batch_size to get an optimial utilization of the GPU memory.<br>
The parameters can be set in a test_file like:

    max_batch_size { value: 32 }
    batch_timeout_micros { value: 5000 }
    pad_variable_length_inputs: true

For enabling batching you need to pass two additional parameters to the Docker container running TensorFlow Serving:
 - enable_batching = True
 - batching_paramters_file = "path/of/the/batching/configuration/file/inside/of/the/container"

**Complete Example of the *docker run* command:**

    !docker run -p 8500:8500 \ # specify the default ports
                -p 8501:8501 \
                --mount type=bind,source=/tmp/models,target=/models/my_model \ # mount the model directory
                --mount type=bind,source=/path/to/batch_config,target=/server_config \ # mount the configuration file
                -e MODEL_NAME=my_model\ # specify your model
                -t tensorflow/serving:latest-gpu \ # specify the docker image for GPU
                --enable_batching=true
                --batching_parameters_file=/server_config/batching_parameters.txt

It is highly recommend to make use of the TensorFlow Serving features, because it is especially useful for inferring a large number of data sampes with offline batch processes.

## Other TensorFlow Serving Optimizations

 - --file_system_poll_wait_seconds=1
 - --tensorflow_session_parallelism=0
 - --tensorflow_intra_op_parallelism=0
 - --tensorflow_inter_op_parallelism=0

Similar to above you can pass this arguments to the docker run command to improve performance and avoid unnecessary cloud provider charges.

## TensorFlow Serving Alternatives

 - <a href="https://docs.bentoml.org/en/latest/">BentoML</a>
 - <a href="https://www.seldon.io/tech/products/core/">Seldon</a>
 - <a href="https://oracle.github.io/graphpipe/#/">GraphPipe</a>
 - <a href="https://stfs.readthedocs.io/en/latest/">Simple TensorFlow Serving</a>
 - <a href="https://mlflow.org/">MLflow</a>
 - <a href="https://ray.io/">Ray Serve</a>

## Deploying with Cloud Providers

Up till now every model server solution have to be managed by you personally.<br>
But all primary cloud providers - Google Cloud, AWS and Microsoft Azure - offer machine learning products, including hosting of machine learning models.

### Use Cases

 - Seamless model deployment
 - No scaling problems

But the advantages comes at a cost and further have the downside of deploying via their own software development kits.


### Example Deployment with GCP

Instead of writing configuration files and executing temrinal commands, we can set up model endpoints through a web UI.

**Limits of Model Size on GCP's AI Platform**

GCP's endpoints are limited to model sizes up to 500 MB. However there are options to increase that up to 2GB with the compute engines of type N1.

#### Model Deployment

The deployment consists of three steps:
 - Make the model accessible on Google Cloud
 - Create a new model instance with Google Cloud's AI Platform
 - Create a new version with the model instance

For the details see pages 246 - 252 in the book.

#### Model Inference

To connect to the Google Cloud API, you will need to install the library google-api-python-client with:

In [None]:
!pip install google-api-python-client

All Google services can be connected via a service object.

In [8]:
import googleapiclient.discovery

In [9]:
def _connect_service():
    return googleapiclient.discovery.build(serviceName="ml", version="v1")

Similar to the REST and gRPC examples, we nest our inference data under a fixed instances key, which carries a list of input dictionaries.

In [10]:
def _generate_payload(sentence):
    return {"instances": [{"sentence": sentence}]}

Now request the prediction from the Google Cloud-hosted machine learning model:

In [None]:
project = "yourGCPProjectName"
model_name = "demo_model"
version_name = "v1"
request = service.projects().predict(
    name="projects/{}/models/{}/versions/{}".format(
        project,
        model_name,
        version_name
    ),
    body=_generate_payload(sentence)
)
response = request.execute()

The Google Cloud AI Platform response contains the predict scores for the different categories similar to a REST response fro a TensorFlow Serving instance.

# Model Deployment with TFX Pipelines

See page 255.

# References and Additional Resources

 - <a href="https://github.com/NVIDIA/nvidia-docker#quick-start">Nvidia Container Toolkit</a> to use Docker with GPU support
 - <a href="https://istio.io/">Istio</a> for phasing models, performing A/B tests or creating policies for data routed to specific models.
 - <a href="https://docs.bentoml.org/en/latest/">BentoML</a>
 - <a href="https://www.seldon.io/tech/products/core/">Seldon</a>
 - <a href="https://oracle.github.io/graphpipe/#/">GraphPipe</a>
 - <a href="https://stfs.readthedocs.io/en/latest/">Simple TensorFlow Serving</a>
 - <a href="https://mlflow.org/">MLflow</a>
 - <a href="https://ray.io/">Ray Serve</a>
 - <a href="https://github.com/tensorflow/serving/blob/master/tensorflow_serving/config/ssl_config.proto">SSL config file</a> for configuration of a secure gRPC channel
 - <a href="https://github.com/tensorflow/serving">TensorFlow Serving Github</a>