## Lesson Overview 

1. TF Serving Overview & Architecture

2. Review & launch model server

3. Predict using Tensorflow Serving

Before you continue:

+ Please make sure you have installed docker in your enviornment.
+ Ensure your trained model exists (exported for TF Serving)

# TensorFlow Serving

[TensorFlow Serving](https://www.tensorflow.org/tfx/serving/) is an open-source software library for serving machine learning models. Here are some features of TF Serving:

1. high performance model hosting system. Designed for primarily for synchronous inference but also supports bulk-processing (e.g. map-reduce) in production enviornments. 

2. TF Serving includes support for model lifecycle management. Multiple models, or multiple versions of the same model can be served simultaneously.
 
3. Facilitates canarying new versions, migrating clients to new models or versions, and A/B testing experimental models.

4. TensorFlow Serving comes with a scheduler that groups individual inference requests into batches for joint execution on a GPU, with configurable latency controls.

5. TensorFlow Serving has out-of-the-box support for TensorFlow models. 

6. In addition to trained TensorFlow models, TF servables can include other assets needed for inference such as embeddings, vocabularies and feature transformation configs, or even non-TensorFlow-based machine learning models.

7. The architecture is highly modular. You can use some parts individually (e.g. batch scheduling) or use all the parts together. 

## Key Concepts

[Servables](https://www.tensorflow.org/tfx/serving/overview#servables) are the underlying objects that clients use to perform computation (for example, a lookup or inference). Servables do not manage their own lifecycle & include the following:

+ a TensorFlow SavedModelBundle (tensorflow::Session)
+ a lookup table for embedding or vocabulary lookups


[Servable Versions](https://www.tensorflow.org/tfx/serving/overview#servable_versions) TensorFlow Serving can handle one or more versions of a servable over the lifetime of a single server instance. 

[Servable Streams](https://www.tensorflow.org/tfx/serving/overview#servable_streams) A servable stream is the sequence of versions of a servable, sorted by increasing version numbers.

[Models](https://www.tensorflow.org/tfx/serving/overview#models) TensorFlow Serving represents a model as one or more servables.

[Loaders](https://www.tensorflow.org/tfx/serving/overview#loaders) manage a servable's life cycle. 

[Sources](https://www.tensorflow.org/tfx/serving/overview#sources) are plugin modules that find and provide servables.

[Aspired versions](https://www.tensorflow.org/tfx/serving/overview#aspired_versions) represent the set of servable versions that should be loaded and ready. 

[Managers](https://www.tensorflow.org/tfx/serving/overview#managers) handle the full lifecycle of Servables, including:

+ loading Servables
+ serving Servables
+ unloading Servables

Managers listen to Sources and track all versions. 

![title](../assets/tf_serving_servable.png)

## Model Server

We will want to serve our Tensorflow model using docker. Please ensure you have installed Docker installed. For more details, visit [Using TensorFlow Serving with Docker](https://www.tensorflow.org/tfx/serving/docker). You will also need to install `pip install grpcio grpcio-tools` or install the package dependencies included in `requirements.txt` available in this repo.

### Running a serving image
The serving images (both CPU and GPU) have the following properties:

```
Port 8500 exposed for gRPC
Port 8501 exposed for the REST API
Optional environment variable MODEL_NAME (defaults to model)
Optional environment variable MODEL_BASE_PATH (defaults to /models)
```


If you look through the source code for `start_model_server.sh`, you'll observe the following CLI command which runs the Docker container, publish the container's ports to your host's ports, and mounting your host's path to the SavedModel to where the container expects models...

```
docker run -it \
  -p 127.0.0.1:$HOST_PORT:$CONTAINER_PORT \
  -v $MODEL_BASE_PATH:$CONTAINER_MODEL_BASE_PATH \
  -e MODEL_NAME=nyc-taxi\
  --rm $DOCKER_IMAGE_NAME
```

#### The following command is an example for how to run the model server...

```bash ./start_model_server.sh```

#### Output...

```
Download TF Serving docker image: tensorflow/serving
Using default tag: latest
latest: Pulling from tensorflow/serving
Digest: sha256:1aaf111b4abb9f2aee618d13f556ab24fee4fff4c44993683772643a7c513b1d
Status: Image is up to date for tensorflow/serving:latest
Starting the Model Server to serve from: /Users/arm/code/tfx/oreilly/tf/run_0/serving_model_dir/export/nyc-taxi
Model directory: /Users/arm/code/tfx/oreilly/tf/run_0/serving_model_dir/export/nyc-taxi
2019-02-25 00:49:10.535859: I tensorflow_serving/model_servers/server.cc:82] Building single TensorFlow model file config:  model_name: nyc-taxi model_base_path: /models/nyc-taxi
2019-02-25 00:49:10.536171: I tensorflow_serving/model_servers/server_core.cc:461] Adding/updating models.
2019-02-25 00:49:10.536244: I tensorflow_serving/model_servers/server_core.cc:558]  (Re-)adding model: nyc-taxi
2019-02-25 00:49:10.653355: I tensorflow_serving/core/basic_manager.cc:739] Successfully reserved resources to load servable {name: nyc-taxi version: 1551042367}
2019-02-25 00:49:10.653463: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: nyc-taxi version: 1551042367}
2019-02-25 00:49:10.653501: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: nyc-taxi version: 1551042367}
2019-02-25 00:49:10.654161: I external/org_tensorflow/tensorflow/contrib/session_bundle/bundle_shim.cc:363] Attempting to load native SavedModelBundle in bundle-shim from: /models/nyc-taxi/1551042367
2019-02-25 00:49:10.654272: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:31] Reading SavedModel from: /models/nyc-taxi/1551042367
2019-02-25 00:49:10.660314: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:54] Reading meta graph with tags { serve }
2019-02-25 00:49:10.690752: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:162] Restoring SavedModel bundle.
2019-02-25 00:49:10.712442: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:138] Running MainOp with key saved_model_main_op on SavedModel bundle.
2019-02-25 00:49:10.722428: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:259] SavedModel load for tags { serve }; Status: success. Took 68199 microseconds.
2019-02-25 00:49:10.723922: I tensorflow_serving/servables/tensorflow/saved_model_warmup.cc:83] No warmup data file found at /models/nyc-taxi/1551042367/assets.extra/tf_serving_warmup_requests
2019-02-25 00:49:10.738422: I tensorflow_serving/core/loader_harness.cc:86] Successfully loaded servable version {name: nyc-taxi version: 1551042367}
2019-02-25 00:49:10.741308: I tensorflow_serving/model_servers/server.cc:286] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
2019-02-25 00:49:10.743122: I tensorflow_serving/model_servers/server.cc:302] Exporting HTTP/REST API at:localhost:8501 ...
[evhttp_server.cc : 237] RAW: Entering the event loop ...
```

## Third party packages already installed!

Third party dependencies can be found in `requirements.txt` and already have been installed.

## Predict using Tensorflow Serving

In order to make predictions using our trained model, you will need to create a client that is able to send requests over [gRPC protocol](https://grpc.io/). 

If you look through the source code for `start_predict.sh`, you'll observe the following CLI command which sends a set of instances for prediction.

In [1]:
from __future__ import division
from __future__ import print_function

from tensorflow_transform import coders as tft_coders
from tensorflow_transform.tf_metadata import dataset_schema
from tensorflow_transform.tf_metadata import schema_utils

from google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2

In [2]:
# Categorical features are assumed to each have a maximum value in the dataset.
MAX_CATEGORICAL_FEATURE_VALUES = [2]

CATEGORICAL_FEATURE_KEYS = []

DENSE_FLOAT_FEATURE_KEYS = ['trip_distance', 'passenger_count', 'tip_amount']

# Number of buckets used by tf.transform for encoding each feature.
FEATURE_BUCKET_COUNT = 10

BUCKET_FEATURE_KEYS = ['pickup_hour', 
                       'pickup_month', 
                       'pickup_day_of_week', 
                       'dropoff_month',
                       'dropoff_hour',
                       'dropoff_day_of_week']

# Number of vocabulary terms used for encoding VOCAB_FEATURES by tf.transform
VOCAB_SIZE = 1000

# Count of out-of-vocab buckets in which unrecognized VOCAB_FEATURES are hashed.
OOV_SIZE = 10

VOCAB_FEATURE_KEYS = []

LABEL_KEY = 'fare_amount'

CSV_COLUMN_NAMES = [
    'vendor_id',
    'pickup_month',
    'pickup_hour',
    'pickup_day_of_week',
    'dropoff_month',
    'dropoff_hour',
    'dropoff_day_of_week',
    'passenger_count',
    'trip_distance',
    'fare_amount',
    'tip_amount',
    'payment_type',
    'trip_type',]

    # 'store_and_fwd_flag',

def transformed_name(key):
    return key + '_xf'


def transformed_names(keys):
    return [transformed_name(key) for key in keys]


# Tf.Transform considers these features as "raw"
def get_raw_feature_spec(schema):
    return schema_utils.schema_as_feature_spec(schema).feature_spec


def make_proto_coder(schema):
    raw_feature_spec = get_raw_feature_spec(schema)
    raw_schema = dataset_schema.from_feature_spec(raw_feature_spec)
    return tft_coders.ExampleProtoCoder(raw_schema)


def make_csv_coder(schema):
    """Return a coder for tf.transform to read csv files."""
    raw_feature_spec = get_raw_feature_spec(schema)
    parsing_schema = dataset_schema.from_feature_spec(raw_feature_spec)
    return tft_coders.CsvCoder(CSV_COLUMN_NAMES, parsing_schema)


def clean_raw_data_dict(input_dict, raw_feature_spec):
    """Clean raw data dict."""
    output_dict = {}

    for key in raw_feature_spec:
        if key not in input_dict or not input_dict[key]:
            output_dict[key] = []
        else:
            output_dict[key] = [input_dict[key]]
    return output_dict


def read_schema(path):
    """Reads a schema from the provided location.

    Args:
    path: The location of the file holding a serialized Schema proto.

    Returns:
    An instance of Schema or None if the input argument is None
    """
    result = schema_pb2.Schema()
    contents = file_io.read_file_to_string(path)
    text_format.Parse(contents, result)
    return result

In [3]:
from __future__ import print_function
import argparse
import base64
import os
import subprocess
import tempfile
from grpc.beta import implementations
import tensorflow as tf
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2

from tensorflow.python.lib.io import file_io  # pylint: disable=g-direct-tensorflow-import

INFERENCE_TIMEOUT_SECONDS = 5.0


def do_local_inference(host, port, serialized_examples):
    """Performs inference on a model hosted by the host:port server."""

    # create a connection
    channel = implementations.insecure_channel(host, int(port))
    stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)

    # initialize a request
    request = predict_pb2.PredictRequest()
    request.model_spec.name = 'nyc-taxi'
    request.model_spec.signature_name = 'predict'

    tfproto = tf.contrib.util.make_tensor_proto([serialized_examples],
                                              shape=[len(serialized_examples)],
                                              dtype=tf.string)
    # The name of the input tensor is 'examples' based on
    # https://github.com/tensorflow/tensorflow/blob/r1.11/tensorflow/python/estimator/export/export.py#L306
    request.inputs['examples'].CopyFrom(tfproto)
    
    # call predict
    print(stub.Predict(request, INFERENCE_TIMEOUT_SECONDS))


def do_inference(model_handle, examples_file, num_examples, schema):
    """Sends requests to the model and prints the results.

    Args:
    model_handle: handle to the model. This can be either
     "mlengine:model:version" or "host:port"
    examples_file: path to csv file containing examples, with the first line
      assumed to have the column headers
    num_examples: number of requests to send to the server
    schema: a Schema describing the input data

    Returns:
    Response from model server
    """
    filtered_features = [
      feature for feature in schema.feature if feature.name != LABEL_KEY
    ]
    del schema.feature[:]
    schema.feature.extend(filtered_features)

    csv_coder = make_csv_coder(schema)
    proto_coder = make_proto_coder(schema)

    input_file = open(examples_file, 'r')
    # skip header line
    input_file.readline()  

    serialized_examples = []
    for _ in range(num_examples):
        one_line = input_file.readline()
        if not one_line:
            print('End of example file reached')
            break
        one_example = csv_coder.decode(one_line)

        serialized_example = proto_coder.encode(one_example)
        serialized_examples.append(serialized_example)

    parsed_model_handle = model_handle.split(':')
    do_local_inference(
      host=parsed_model_handle[0],
      port=parsed_model_handle[1],
      serialized_examples=serialized_examples)

Before running this next cell, make sure bash `./start_model_server.sh` is running.

In [4]:
SERVER="127.0.0.1:9000"
SCHEMA_FILE="./schema.pbtxt"
NUM_INSTANCES=15
INSTANCE_FILE="../data/train/train.csv"

do_inference(SERVER,
              INSTANCE_FILE, 
              NUM_INSTANCES,
              read_schema(SCHEMA_FILE))

outputs {
  key: "predictions"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 15
      }
      dim {
        size: 1
      }
    }
    float_val: 32.9800186157
    float_val: 17.2102165222
    float_val: 31.7519054413
    float_val: 33.0067481995
    float_val: 36.803691864
    float_val: 33.8259849548
    float_val: 34.7675018311
    float_val: 22.1285438538
    float_val: 43.9800224304
    float_val: 22.4845104218
    float_val: 37.6693458557
    float_val: 34.8329048157
    float_val: 31.9676322937
    float_val: 70.7163391113
    float_val: 37.2004470825
  }
}
model_spec {
  name: "nyc-taxi"
  version {
    value: 1551254584
  }
  signature_name: "predict"
}





## Command line interface to inspect a SavedModel

A **SavedModel** contains one or more MetaGraphDefs, identified by their tag-sets. To serve a model, you might wonder what kind of SignatureDefs are in each model, and what are their inputs and outputs. The show command let you examine the contents of the SavedModel in hierarchical order. You can find more details [here](https://www.tensorflow.org/guide/saved_model#cli_to_inspect_and_execute_savedmodel).

To inspect a SavedModel, run the following command through your terminal...

In [6]:
!saved_model_cli show --dir=./tf/run_0/serving_model_dir/export/nyc-taxi/1551167001/ --all


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['predict']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['examples'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: input_example_tensor:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['predictions'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1)
        name: dnn/logits/BiasAdd:0
  Method name is: tensorflow/serving/predict

signature_def['regression']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['inputs'] tensor_info:
        dtype: DT_STRING
        shape: (-1)
        name: input_example_tensor:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['outputs'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, 1)
        name: dnn/logits/BiasAdd:0
  Method name is: tensorflow/serving/regress

If you installed TensorFlow through a pre-built TensorFlow binary, then the SavedModel CLI is already installed on your system. 