# Serving Signatures

This notebook is meant to explore possiblities for serving signatures. 

The serving signature is basically the interface between our model and the client that calls our model. Unfortunately, there aren't a lot of good examples on different possiblities for the serving signature. In particular, there seems to be a lack of documentation and examples linking TFX pipelines and Tensorflow Serving.

In every single example on the [TFX github repo](https://github.com/tensorflow/tfx/tree/master/tfx/examples), the serving input function takes in serialised tf.examples. In addition, on this page of the [TFS documentation](https://www.tensorflow.org/tfx/serving/performance) it states the following: 

```
In general, it is advised to use the Classify and Regress endpoints as they accept tf.Example, which is a higher-level abstraction; however, in rare cases of large (O(Mb)) structured requests, savvy users may find using PredictRequest and directly encoding their Protobuf messages into a TensorProto, and skipping the serialization into and deserialization from tf.Example a source of slight performance gain.
```

However, in the actual example on the [Tensorflow Serving Github](https://github.com/tensorflow/serving/tree/master/tensorflow_serving/example), the example client code does not use tf.examples and instead directly converts data to TensorProto. 

So the question then becomes, how should our serving signatures be structured? Should they accept serialized examples? Should pre/post processing happen in the serving signature? Should we also return the tags in the serving signature? 

## **Background**
---

**TF Record**

TFRecord is a lightweight format optimized for streaming large datasets. TFRecord files contain tf.Example records, each record containing one or more features that would represent the columns in our data. tf.Example, the data structure representing every data row within TFRecord, is also the default data structure in the TF ecosystem and, therefore, is used in all TFX components. 

TF Records are very useful for training, but what is their purpose during serving? In the proto definition of [tf.examples/features](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/example/feature.proto) it does explicitly describe a these as "Protocol messages for describing features for machine learning model training **or inference.**"

**TensorProto**

Tensorflow Serving accepts [TensorProtos](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/tensor.proto). There is a helpful utility function to create tensor protos called [`make_tensor_proto`](https://www.tensorflow.org/api_docs/python/tf/make_tensor_proto). In the documentation it says (rather cryptically), "In TensorFlow 2.0, representing tensors as protos should no longer be a common workflow. That said, this utility function is still useful for generating TF Serving request protos". 

The question then becomes why can't we go directly from data --> TensorProto, especially since TensorProto accepts python types? There's nothing stopping us from doing this. As far as I can tell the main reasons we would consider not doing this are the following: 

 + Creates serialization overhead during RPC call by doing serialization of many repeated small items 
 + Need to list out all the inputs twice (client side and in serving signature) 
 
From [this example](https://github.com/yu-iskw/tensorflow-serving-example/blob/master/python/grpc_iris_client.py) it's possible to see how this would be annoying. 
 
Therefore, as far as I can tell, we'd want to use TF examples during serving in either of the following scenarios: 

 + Batch inference on files 
 + We have a LOT of features. 
 
In order to use TF examples we create them as normal ([see example](https://stackoverflow.com/questions/53888284/how-to-send-a-tf-example-into-a-tensorflow-serving-grpc-predict-request)) and then we can do the following when creating our request: 

`request.inputs['inputs'].CopyFrom(tf.make_tensor_proto(tf_ex.SerializeToString())`


**TFX** 

So it seems like it would be ideal to just avoid tf examples entirely! But wait, why did all of those TFX examples take in serialized TF Examples?? It seems its because TFMA (and the evaluator component) expects the model to have a signature called `serving_default` that takes in serialized examples.

This seems to either be a baffling design decision or just a lack of documentation linking TFX and TF Serving. 

These github issues describe it well: 
 + https://github.com/tensorflow/tfx/issues/1108
 + https://github.com/tensorflow/tfx/issues/2476
 + https://github.com/tensorflow/tfx/issues/1885


## **Testing our Model's Serving Signature**

Right now we have a serving signature that looks like the example below. Notice we are taking in `raw_text`, this makes it very easy to work with, but it also means we won't be able to use this if we want to use the evaluator component. 

```python
def _get_serve_tf_examples_fn(model, tf_transform_output):
    """Returns a function that parses JSON input"""
    # TODO: Create alternative serving function, especially if using evaluator
    model.tft_layer = tf_transform_output.transform_features_layer()
    
    @tf.function
    def serve_tf_examples_fn(raw_text):
        """Returns the output to be used in the serving signature."""
        reshaped_text = tf.reshape(raw_text, [-1, 1])
        transformed_features = model.tft_layer({"synopsis": reshaped_text})

        outputs = model(transformed_features)
        return {"outputs": outputs}

    return serve_tf_examples_fn


signatures = {
        "serving_default": _get_serve_tf_examples_fn(model, tf_transform_output).get_concrete_function(
            tf.TensorSpec(shape=[None], dtype=tf.string, name="examples")
        ),
    }
```

Now we can test our model. In this example, I assume I have a saved model at the source path. 

```bash
docker run -p 8501:8501 -p 8500:8500 --mount type=bind,source=/home/jupyter/nbcu-metadata-enhancement/training/notebooks/serving_test/,target=/models/bert-classifier -e MODEL_NAME=bert-classifier -t tensorflow/serving
```

The first thing we can do is just straight up curl it:

In [170]:
%%time
!curl -d '{"instances": [{"examples": "Signing of the candidate city contract and joint press conference of the IOC and the elected host city for the Olympic Winter Games 2026. Announcement show for the election of the candidate city for the Olympic Winter Games 2026. Includes press conferences and news announcements from the IOC. Candidate cities present the pros of hosting the 2026 Oympics"}]}' -X POST http://localhost:8501/v1/models/bert-classifier:predict

{
    "predictions": [[0.0287234783, 0.000263780355, 0.955139756, 0.00108590722, 0.911280632, 0.0352513194, 0.000111004389, 0.00389823318, 0.00236684084, 0.000164091587, 0.000111891335, 8.95087942e-05, 7.82489587e-05, 0.00047364831, 0.000185072422, 0.000910818577, 0.000251799822, 7.06151582e-07, 0.00423595309, 0.00720736384, 3.51062081e-05, 4.85675373e-05, 0.000198423862, 6.08200498e-05, 2.17204142e-05, 0.000507026911, 0.00048276782, 0.0161753, 0.000574171543, 3.19167739e-05, 0.000298529863, 2.17966517e-05, 9.03453038e-05, 0.000260174274, 0.000126898289, 3.42027465e-06, 0.000142335892, 0.0163213313, 7.61453e-05, 0.000127226114, 6.68214561e-05, 9.09229493e-05, 0.000314205885, 0.000213056803, 0.000723511, 8.86259222e-05, 4.07925218e-05, 0.000167876482, 7.92019127e-05, 0.000378519297, 0.000180214643, 0.000167548656, 6.15779572e-05, 0.00246241689, 0.0338238776, 1.58994535e-05, 0.00144681334, 0.0135127306, 0.000135421753, 9.5048279e-05, 0.000115033879, 0.0010626018, 8.51453296e-05, 0.000745

In [125]:
model_dir = '/home/jupyter/nbcu-metadata-enhancement/training/notebooks/serving_test/1611052279'
sample_request = "Signing of the candidate city contract and joint press conference of the IOC and the elected host city for the Olympic Winter Games 2026. Announcement show for the election of the candidate city for the Olympic Winter Games 2026. Includes press conferences and news announcements from the IOC. Candidate cities present the pros of hosting the 2026 Oympics"
labels = pd.read_csv('/home/jupyter/nbcu-metadata-enhancement/training/notebooks/serving_test/1611052279/assets/tags', header=None)

It's also super easy to do this in python via the requests library: 

In [171]:
import requests 

def get_rest_request(text, model_name='bert-classifier'):
    url = 'http://localhost:8501/v1/models/{}:predict'.format(model_name)
    payload = {"instances": text}
    response = requests.post(url=url, json=payload)
    
    return response.json()

In [172]:
%%time
response = get_rest_request(text=[sample_request])

CPU times: user 3.89 ms, sys: 2.57 ms, total: 6.45 ms
Wall time: 62.7 ms


In [173]:
response

{'predictions': [[0.0287234783,
   0.000263780355,
   0.955139756,
   0.00108590722,
   0.911280632,
   0.0352513194,
   0.000111004389,
   0.00389823318,
   0.00236684084,
   0.000164091587,
   0.000111891335,
   8.95087942e-05,
   7.82489587e-05,
   0.00047364831,
   0.000185072422,
   0.000910818577,
   0.000251799822,
   7.06151582e-07,
   0.00423595309,
   0.00720736384,
   3.51062081e-05,
   4.85675373e-05,
   0.000198423862,
   6.08200498e-05,
   2.17204142e-05,
   0.000507026911,
   0.00048276782,
   0.0161753,
   0.000574171543,
   3.19167739e-05,
   0.000298529863,
   2.17966517e-05,
   9.03453038e-05,
   0.000260174274,
   0.000126898289,
   3.42027465e-06,
   0.000142335892,
   0.0163213313,
   7.61453e-05,
   0.000127226114,
   6.68214561e-05,
   9.09229493e-05,
   0.000314205885,
   0.000213056803,
   0.000723511,
   8.86259222e-05,
   4.07925218e-05,
   0.000167876482,
   7.92019127e-05,
   0.000378519297,
   0.000180214643,
   0.000167548656,
   6.15779572e-05,
   0.002

If we inspect the response, we can see that the results are exactly the same as when we curled the model, which is a great sign. 

In the following cells we can associate our labels with the predictions. 

In [175]:
predictions = pd.Series(response['predictions'][0], index=labels)

In [176]:
predictions.sort_values(ascending=False)

(Sports,)               9.551398e-01
(Olympics,)             9.314756e-01
(Sports non-event,)     9.112806e-01
(News,)                 3.525132e-02
(Basketball,)           3.382388e-02
                            ...     
(Trains,)               6.128239e-07
(Arts & Literature,)    5.713443e-07
(Highlights,)           5.397718e-07
(Senior Citizen,)       4.916000e-07
(Intl soccer,)          3.275808e-07
Length: 408, dtype: float64

Let's also see what happens when we just predict with a loaded model. This will help us to ensure that our predictions are correct:

In [128]:
import tensorflow as tf
import tensorflow_text as text
import numpy as np
import pandas as pd

In [129]:
loaded_model = tf.keras.models.load_model(model_dir)


Two checkpoint references resolved to different objects (<tensorflow.python.keras.saving.saved_model.load.TensorFlowTransform>TransformFeaturesLayer object at 0x7f2ccb58f510> and <tensorflow.python.keras.engine.input_layer.InputLayer object at 0x7f2cc8bdbb50>).


In [131]:
%%time
loaded_model_prediction = loaded_model.predict([sample_request])

CPU times: user 161 ms, sys: 94.9 ms, total: 256 ms
Wall time: 135 ms


In [177]:
np.allclose(response['predictions'], loaded_model_prediction)

True

This is good! Calling the loaded model directly gives the exact same result as querying the model using TF Serving. 

**Now we can try with grpc**

In [178]:
import grpc
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc
import tensorflow as tf

In [179]:
def create_grpc_stub(host, port=8500):
    hostport = "{}:{}".format(host, port)
    channel = grpc.insecure_channel(hostport)
    stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
    return stub

def grpc_request(stub, data_sample, model_name='bert-classifier', signature_name='serving_default'):
    request = predict_pb2.PredictRequest()
    request.model_spec.name = model_name
    request.model_spec.signature_name = signature_name
    
    request.inputs['examples'].CopyFrom(tf.make_tensor_proto(data_sample))
    result = stub.Predict(request, 10)
    return result

In [180]:
request = predict_pb2.PredictRequest()
request.model_spec.name = 'bert-classifier'
request.model_spec.signature_name = 'serving_default'

In [181]:
stub = create_grpc_stub('localhost')
rs_grpc = grpc_request(stub, [sample_request])
outputs = rs_grpc.outputs['outputs'].float_val
grpc_prediction = np.reshape(outputs, (rs_grpc.outputs['outputs'].tensor_shape.dim[0].size, -1))

In [183]:
np.allclose(response['predictions'], grpc_prediction)

True

**NOTE:** In the example above we have tensorflow as a dependency, which we may want to avoid. Fortunately, there is a way to create the tensorprotos directly. There are a few examples of this scattered across the internet, but [this resource](https://towardsdatascience.com/tensorflow-serving-client-make-it-slimmer-and-faster-b3e5f71208fb) does a good job explaining the problem and giving an example solution. Other solutions I've seen have been very similar. 

In [145]:
rs_grpc.outputs['outputs'].tensor_shape.dim[0].size

1

As a final sanity check, we can use the grpcurl tool. In the [MLOps Examples repo](https://github.com/sky-uk/mlops-examples/blob/grpc-tutorial-added/grpc_tutorial_tfs/grpc_tutorial_tfs.md) we see how we can use grpcurl. In order to run the command below we would need to git clone into this repo, switch to the appropriate branch, enter the grpc_tutorial_tfs directory and then run the commdn. 

In [147]:
!echo {sample_request} | base64

U2lnbmluZyBvZiB0aGUgY2FuZGlkYXRlIGNpdHkgY29udHJhY3QgYW5kIGpvaW50IHByZXNzIGNv
bmZlcmVuY2Ugb2YgdGhlIElPQyBhbmQgdGhlIGVsZWN0ZWQgaG9zdCBjaXR5IGZvciB0aGUgT2x5
bXBpYyBXaW50ZXIgR2FtZXMgMjAyNi4gQW5ub3VuY2VtZW50IHNob3cgZm9yIHRoZSBlbGVjdGlv
biBvZiB0aGUgY2FuZGlkYXRlIGNpdHkgZm9yIHRoZSBPbHltcGljIFdpbnRlciBHYW1lcyAyMDI2
LiBJbmNsdWRlcyBwcmVzcyBjb25mZXJlbmNlcyBhbmQgbmV3cyBhbm5vdW5jZW1lbnRzIGZyb20g
dGhlIElPQy4gQ2FuZGlkYXRlIGNpdGllcyBwcmVzZW50IHRoZSBwcm9zIG9mIGhvc3RpbmcgdGhl
IDIwMjYgT3ltcGljcwo=


```bash
grpcurl -plaintext -import-path ./proto -d '{"model_spec":{"name":"bert-classifier","signature_name":"serving_default"},"inputs":{"examples":{"dtype": "DT_STRING","string_val":"U2lnbmluZyBvZiB0aGUgY2FuZGlkYXRlIGNpdHkgY29udHJhY3QgYW5kIGpvaW50IHByZXNzIGNvbmZlcmVuY2Ugb2YgdGhlIElPQyBhbmQgdGhlIGVsZWN0ZWQgaG9zdCBjaXR5IGZvciB0aGUgT2x5bXBpYyBXaW50ZXIgR2FtZXMgMjAyNi4gQW5ub3VuY2VtZW50IHNob3cgZm9yIHRoZSBlbGVjdGlvbiBvZiB0aGUgY2FuZGlkYXRlIGNpdHkgZm9yIHRoZSBPbHltcGljIFdpbnRlciBHYW1lcyAyMDI2LiBJbmNsdWRlcyBwcmVzcyBjb25mZXJlbmNlcyBhbmQgbmV3cyBhbm5vdW5jZW1lbnRzIGZyb20gdGhlIElPQy4gQ2FuZGlkYXRlIGNpdGllcyBwcmVzZW50IHRoZSBwcm9zIG9mIGhvc3RpbmcgdGhlIDIwMjYgT3ltcGljcwo="}}}' -proto ./proto/tensorflow_serving/apis/prediction_service.proto  localhost:8500 tensorflow.serving.PredictionService/Predict
```

---
# Larger Batch

The next example is just to show that the current serving signature can handle multiple requests in a very straightfoward manner. 

In [184]:
response = get_rest_request(text=['action action', 'way too much action'])

In [188]:
response['predictions'][0]

[0.0542888343,
 3.87648688e-05,
 0.538566053,
 0.0245588124,
 0.567848,
 0.000898718834,
 0.00673767924,
 0.0486128032,
 0.0142953396,
 0.0124898851,
 0.00227579474,
 0.037035495,
 0.0166816413,
 0.00766772032,
 0.000373363495,
 0.00285717845,
 0.000179231167,
 0.0102823079,
 0.000175088644,
 0.000519990921,
 0.00117817521,
 0.00194945931,
 8.58289277e-05,
 2.75128768e-05,
 0.00205248594,
 0.000337272882,
 0.000909686089,
 0.000476270914,
 0.00112292171,
 0.000196695328,
 0.000695765,
 3.91050635e-05,
 0.000229388475,
 8.83908506e-05,
 0.000112064758,
 0.000135272741,
 3.06612055e-05,
 0.000792622566,
 0.00196874142,
 0.00406339765,
 0.00357395411,
 0.000888139,
 0.00347402692,
 0.00295218825,
 0.000267475843,
 0.000167131424,
 0.00235310197,
 0.000202089548,
 0.000712394714,
 0.000681519508,
 0.000185608864,
 2.79893247e-05,
 0.0046248436,
 0.000167936087,
 0.000129431486,
 0.00277385116,
 0.000687599182,
 0.00773528218,
 0.000129073858,
 0.000137895346,
 0.000613987446,
 0.0001257658

---
---
# Including Labels

We had previously explored returning labels from the serving signature. An example serving signature implementation is presented below: 

```python
def _get_serve_tf_examples_fn(model, tf_transform_output):
    """Returns a function that parses JSON input"""
    # TODO: Create alternative serving function, especially if using evaluator
    model.tft_layer = tf_transform_output.transform_features_layer()
    tag_vocab= tf_transform_output.vocabulary_by_name('tags')
    
    @tf.function
    def serve_tf_examples_fn(raw_text):
        """Returns the output to be used in the serving signature."""
        reshaped_text = tf.reshape(raw_text, [-1, 1])
        transformed_features = model.tft_layer({"synopsis": reshaped_text})

        outputs = model(transformed_features)
        return {"outputs": outputs[0], "label": tf.constant(tag_vocab)}

    return serve_tf_examples_fn
```

You'll notice that we take the first element in outputs. This is because for whatever reason we cannot have outputs with different batch dimensions. 

Now we could do some awkward logic to make sure that the labels have the same batch dimension as the outputs, but then the question becomes what is the real advantage of doing this in the serving function versus in the service itself? In theory it could be easier to ensure that the labels are exactly up-to-date using the serving function approach, but I don't really see this being a problem in the service, since we know what model version we are serving and each model version has its associated tags stored alongside it. 

```bash
docker run -p 8501:8501 --mount type=bind,source=/home/jupyter/nbcu-metadata-enhancement/training/notebooks/serving/,target=/models/bert-classifier -e MODEL_NAME=bert-classifier -t tensorflow/serving
```

```bash
curl -d '{"instances": [{"examples": "so much action"}]}' -X POST http://localhost:8501/v1/models/bert-classifier:predict
```

```bash
curl -d '{"instances": [{"examples": [["so much action"], ["too much action"]]}]}' -X POST http://localhost:8501/v1/models/bert-classifier:predict
```

In [148]:
def get_rest_request(text, model_name='bert-classifier'):
    url = 'http://localhost:8501/v1/models/{}:predict'.format(model_name)
    payload = {"instances": text}
    response = requests.post(url=url, json=payload)
    
    return response

In [153]:
response = get_rest_request(text=[['so much action'], ['too much action']])

Only the first example is returned, as expected... 

In [154]:
response.json()

{'predictions': [{'label': 'Local', 'outputs': 0.352799237},
  {'label': 'Sports', 'outputs': 0.584652662},
  {'label': 'Sports non-event', 'outputs': 0.435713768},
  {'label': "Children's/Family Entertainment", 'outputs': 0.362205923},
  {'label': 'Documentary', 'outputs': 0.325834334},
  {'label': 'Talk', 'outputs': 0.289208353},
  {'label': 'Drama', 'outputs': 0.50265044},
  {'label': 'Comedy', 'outputs': 0.408233553},
  {'label': 'Action & Adventure', 'outputs': 0.325080395},
  {'label': 'Reality', 'outputs': 0.414543211},
  {'label': 'Animated', 'outputs': 0.282785118},
  {'label': 'Entertainment', 'outputs': 0.440046251},
  {'label': 'News', 'outputs': 0.953991413},
  {'label': 'Interview', 'outputs': 0.484029591},
  {'label': 'Educational', 'outputs': 0.415718853},
  {'label': 'kids (ages 5-9)', 'outputs': 0.317569911},
  {'label': 'Fantasy', 'outputs': 0.497406572},
  {'label': 'older teens (ages 15+)', 'outputs': 0.757915437},
  {'label': 'Outdoors', 'outputs': 0.573759437},
 

---


# TF Examples

The following is just to demonstrate what a serving signature that takes in serialized TF examples would look like, as well as example helper functions we would need in our service in order to wrap requests as a TF example before passing them on to the model. 

```python 
def _get_serve_tf_examples_fn(model, tf_transform_output):
    """Returns a function that parses a serialized tf.Example."""

    model.tft_layer = tf_transform_output.transform_features_layer()

    @tf.function
    def serve_tf_examples_fn(serialized_tf_examples):
        """Returns the output to be used in the serving signature."""
        feature_spec = tf_transform_output.raw_feature_spec()
        feature_spec.pop('series_ep_tags')
        
        parsed_features = tf.io.parse_example(
            serialized_tf_examples, feature_spec
        )

        transformed_features = model.tft_layer(parsed_features)

        outputs = model(transformed_features)
        return {"outputs": outputs}

    return serve_tf_examples_fn
```

```python 
def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def serialize_example(feature):
    input_example = {
          'features': _bytes_feature(feature)
    }
    
    example_proto = tf.train.Example(features=tf.train.Features(feature=input_example))
    return example_proto.SerializeToString()
```

---
# Multiple Signatures 

This example is just to illustrate how we could go about create multiple serving signatures for the same model. 

```python
class ServingSignatures(tf.Module):
    def __init__(self, model, tf_transform_output):
        self.model = model
        self.tf_transform_output = tf_transform_output
        self.model.tft_layer = self.tf_transform_output.transform_features_layer()

    @tf.function()
    def eval_input_fn(self, serialized_tf_examples):
        feature_spec = self.tf_transform_output.raw_feature_spec()
        parsed_features = tf.io.parse_example(
            serialized_tf_examples, feature_spec
        )
        transformed_features = self.model.tft_layer(parsed_features)
        outputs = self.model(transformed_features)
        return {"outputs": outputs}
    
    @tf.function()
    def serve_tf_examples_input_fn(self, serialized_tf_examples):
        feature_spec = self.tf_transform_output.raw_feature_spec()
        feature_spec.pop(component_utils._LABEL_KEY)
        feature_spec.pop(component_utils._SLICING_KEY)
        parsed_features = tf.io.parse_example(
            serialized_tf_examples, feature_spec
        )
        transformed_features = self.model.tft_layer(parsed_features)
        outputs = self.model(transformed_features)
        return {"outputs": outputs}
    
    @tf.function
    def serve_raw_requests(self, features):
        transformed_features = self.model.tft_layer(features)
        outputs = self.model(transformed_features)
        
        return {"outputs": outputs}
    
input_functions = ServingSignatures(model, tf_transform_output)

signatures = {"serving_default": input_functions.eval_input_fn.get_concrete_function(
                tf.TensorSpec(shape=[None], dtype=tf.string, name="examples")),
              "tf_example_input": input_functions.serve_tf_examples_input_fn.get_concrete_function(
                tf.TensorSpec(shape=[None], dtype=tf.string, name="examples")),
              "raw_input": input_functions.serve_raw_requests.get_concrete_function(
                {'features': tf.TensorSpec(shape=[None, 1], dtype=tf.string, name="features")})
             }

model.save(fn_args.serving_model_dir, save_format="tf", signatures=signatures)
```

# Bonus Resources

https://medium.com/mostly-ai/tensorflow-records-what-they-are-and-how-to-use-them-c46bc4bbb564
https://towardsdatascience.com/tfrecords-explained-24b8f2133282