# Model Deployment using TFX

We can develop our models anyway we can. But when it comes to real world applications most important factor is how we are going to use the model and therefore model serving aspects act as a major consideration. 

ML model can be served in 3 major methods. 

1. ML inference server (An API or scheduled process)
2. User browser (When data is sensitive)
3. Edge device  (IoT, Remote sensors etc.)

In this notebook, we will consider the first point mainly.

When creating a Inference server, we rougly follow the same steps. These include,

* Create a web app (Django, Flask, FastAPI)
* Create the API endpoint
* Load the model weights and parameters
* Define the predict function based on the expected input
* Return the predictions

Example simple server code is below.

In [None]:
## Do not run the code, This is for demonstration only.

import json
from flask import Flask, request
from tensorflow.keras.models import load_model

model = load_model('my_model_saved_path')
app = Flask('app_name')

@app.route('/classify', methods=['POST'])
def classifiy():
    data = request.form['data']
    procesed_data = preprocess(data)
    prediction = model.predict(procesed_data)
    return json.dumps({'preds': prediction})


Above is great for demonstration. But there are risks associated with such approaches.

1. **Lack of code separation** - As we can see data science code and model deployment related(web development) code is now on same module. This is problematic as ML models are frequently change compared to the web application and therefore causes unnecessary dependencies and ownership issues.
2. **Lack of Model version control**
3. **Inefficiencies in model inference under high load**


### Tensorflow Serving

Tensorflow serving provides support the functionality to load models from a given source and notifies the loader if the source has changed(version). Everything inside the Tensorflow serving is handled by a Model manager component. Its overall architecture is as below.

<center><image src="imgs/3.jpg" width="400"/></center>

Model manager will handle the model loading based on the policies defined and will provide a serving handler. Therefore data scientists can provide new model versions and TFS can update automatically once it detects a new version.

Before using TF models in TFS, they need to be in certain type.

<center>

`saved_model_path = model.save(file path="./saved_models", save_format="tf")`
</center>

Also is is recommended to add timestamp to the model name when we are manually saving the model like above. It helps in reconginize the models later easier.

TFS uses a concept called Model signatures to identify model graph inputs, outputs and graph signature. This definition allows us to update the model without changing the requests to the model server since signature helps to identify the mappings between inputs and related graph nodes. More details about this can be found in the here.

To install Tensorflow Serving python API use below.

<center>

`pip install tensorflow-serving-api`
</center>

This provides a commandline tool which can be used to inspect exported model signatures and test the exported models without deploying.

Below are some example for this tool usages.




In [3]:
!saved_model_cli show --dir data/tfx/Trainer/model/6/format-Serving/

The given SavedModel contains the following tag-sets:
'serve'


Based on our computation graph we may have different tags for CPU based inference, GPU based, Training or Serving. These tags are required to understand the computational graph execution. Once we have the tag we can inspect the model signature like below.

In [4]:
!saved_model_cli show --dir data/tfx/Trainer/model/6/format-Serving/ --tag_set serve

The given SavedModel MetaGraphDef contains SignatureDefs with the following keys:
SignatureDef key: "__saved_model_init_op"
SignatureDef key: "serving_default"


Above output shows the our model with outlined tag have only one model signature named 'serving_default'. To obtain the detailed details about the model we can use the tag and signature like below.

In [5]:
!saved_model_cli show --dir data/tfx/Trainer/model/6/format-Serving/ --tag_set serve --signature_def serving_default

The given SavedModel SignatureDef contains the following input(s):
  inputs['examples'] tensor_info:
      dtype: DT_STRING
      shape: (-1)
      name: serving_default_examples:0
The given SavedModel SignatureDef contains the following output(s):
  outputs['outputs'] tensor_info:
      dtype: DT_FLOAT
      shape: (-1, 1)
      name: StatefulPartitionedCall_11:0
Method name is: tensorflow/serving/predict


Above displays the model signature in terms of its inputs and outputs. Their are several model signature types and more details can be found in TF documentation.

Other than above, we can test the model in CLI like below as well.

In [None]:
!saved_model_cli show --dir data/tfx/Trainer/model/6/format-Serving/ \
                --tag_set serve --signature_def serving_default \
                    --input_examples "examples=[{'key':'value', ...}]"

### Setting Tensorflow Serving

There are 2 main methods of using TF Serving (technically same!).

1. **Docker usage**
    - This is the easiest way as we can download the prebuilt docker image. 
    - `docker pull tensorflow/serving` (for CPU based)
    - `docker pull tensorflow/serving:latest-gpu (for GPU based)

2. **Native installation**
    - If we have dedicated server we can use this method.
    - Need to setup the environment explicitly by adding custom packages.


Once we are done with setup, we need to configure the Tensorflow server. Out of the box, TF server can run in 2 modes. First we can define the model and serving will always provide the latest model. Else we can define a configuration file with model and versions we want to load and TF server will only serve those.


#### Single model configuration

We can make tensorflow run loading a single model and switching to new models whenever they are available. Such configutation is called Single model configuration.

In docker environment we can define it like below.

In [None]:
!docker run -p 8500:8500 \ 
            -p 8501:8501 \
            --mount type=bind,source=/tmp/models,target=/models/my_model \   # Mounting Storage to docker container
            -e MODEL_NAME=my_model \     # Model name
            -e MODEL_BASE_PATH=/models/my_model \  # Model location
            -t tensorflow/serving # Docker image name (if GPU version change this)

The 2 port are for REST api and google Remote Procedure Call (gRPC) endpoints.

If we want to deploy model in native installation (not in docker) then we can use below command.

In [None]:
!tensorflow_model_server --port=8500 \
                          --rest_api_port=8501 \
                          --model_name=my_model \
                          --model_base_path=/models/my_model

One of the main advantage in tensorflow serving is `hot-swappable` capability of models. When a new model is delivered, TFS manager will unload the old model and load the new one automatically. This is also useful in rollback situations based on the model versioning.

#### Multi model configuration

In this method we configure tensorflow serving to load multiple models at the same time. To do that we need to define a configuation file to specify the models.

<pre>
model_config_list {
  config {
    name: 'my_model'
    base_path: '/models/my_model/'
    model_platform: 'tensorflow'
  }
  config {
    name: 'another_model'
    base_path: '/models/another_model/'
    model_platform: 'tensorflow'
  }
}
</pre>

Also if needed we can define model loading policies like specific versions, tags etc for advance usages (like A/B testing).

After defining the model config file, we can load the config files in the tensorflow serve (docker environment) like below.

In [None]:
!docker run -p 8500:8500 \
             -p 8501:8501 \
             --mount type=bind,source=/tmp/models,target=/models/my_model \
             --mount type=bind,source=/tmp/model_config,\
             target=/models/model_config \ 
             -e MODEL_NAME=my_model \
             -t tensorflow/serving \
             --model_config_file=/models/model_config

Or in native environment like below.

In [None]:
!tensorflow_model_server --port=8500 \
                          --rest_api_port=8501 \
                          --model_config_file=/models/model_config

Once we setup our model server, We can do our predictions by calling its REST api endpoint (or invoking the RPC). We can use generic python libraries for that.

**Calling REST**

The generic URL pattern would be something like below.

<center>

`http://{HOST}:{PORT}/v1/models/{MODEL_NAME}:{VERB}`
</center>

Here VERB means the type of signature we need use (predict, regress or classify). Also if we need to specify the model version need to use URL pattern would be like below.

<center>

`http://{HOST}:{PORT}/v1/models/{MODEL_NAME}[/versions/${MODEL_VERSION}]:{VERB}`
</center>

The request payload would be a simple json data, with either 'instances (for multi examples)' or 'inputs (for single example)' keys (But not both in same request).

**Calling gRPC**

Invoking RPCs are bit different than calling a REST api. 
[gRPC Documentation](https://grpc.io/docs/what-is-grpc/introduction/)

First we need to establish a gRPC channel. This chennel provides us the connection to the  grpc server at a given host ip and port. Then we will create a stub object. (I dont remember the exact workings of RPC methods, so its better to read about them!)

This is bit complex than using just REST apis, but generally more performant as well. Read more about the usage in the [documentation of Tensorflow serving](https://www.tensorflow.org/tfx/serving/api_rest).


### A/B Testing with Tensorflow Serving

A/B Testing helps us to understand how 2 (or more for that matter) different models will behave in a production setting. But technically TFS does not provide functionality to divert requests to 2 models from server side. Instead we can direct our requests from client side to different models mimicing the random behaviour. But we will need to calculate the statistics manually from the server side.

#### Batch inferencing using TFS

This is one of the very useful feature provided by the TFX framework to properly utilize the computing resources. Batch inference need to be enabled before using and we need to setup few configuration values.

- **max_batch_size** => batch size to be collected
- **batch_timeout_macros** => maximum wait time for filling a batch
- **max_enqueued_batches** => sets the maximum number of batched get queued for prediction. setting this helps to avoid congesion and will return an error to user
- **num_batch_threads** => how many CPU/GPU cores can be used in parrallel
- **pad_variable_lenght_inputs** => boolean to process input variables to same size

Once we decide on the values we need to create a file with related parameters contained like below.

<pre>
max_batch_size { value: 32 }
batch_timeout_micros { value: 5000 }
pad_variable_length_inputs: true
</pre>

Then after saving it, we can rerun the TFS service with 2 additional parameters defining the batch mode enabling and configs.

In [None]:
!docker run -p 8500:8500 \
            -p 8501:8501 \
            --mount type=bind,source=/path/to/models,target=/models/my_model \
            --mount type=bind,source=/path/to/batch_config,target=/server_config \
            -e MODEL_NAME=my_model -t tensorflow/serving \
            --enable_batching=true
            --batching_parameters_file=/server_config/batching_parameters.txt

Also there are several other parameter we can optimize in tensorflow serving to provide better inferencing service. Please read the documentation regarding such parameters.

> Other than TFS there are few other options to serve ML models. These include `BentoML`, `Seldon`, `GraphPipe`, `MLflow` and `Rayserve`.