<a href="https://colab.research.google.com/github/Muzhi1920/awesome-models/blob/main/13-tf_serving/TF_Serving_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 面向初学者的TensorFlow Serving服务（gRPC）实践

之前几篇文章，分别简介了基于tf.estimator实现简易wide&deep模型的训练（参考[wide&deep](https://zhuanlan.zhihu.com/p/510886354)），以及gRPC服务基础（参考[grpc-demo](https://zhuanlan.zhihu.com/p/518605682)）。本文继续介绍基于tensorflow serving实现的模型部署与推理服务，主要包括：

1. 服务模型准备
2. tf-serving环境的搭建
3. client实现模型调用服务
4. 模型输出与多目标融合

另基于restful api实现的json调取可以参考[rest_simple](https://www.tensorflow.org/tfx/tutorials/serving/rest_simple)

## 准备Wide&Deep Model

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


当前文件夹包含该模型。另Google Colab下模型路径，simple model store here[wide&deep](https://drive.google.com/file/d/1DpUKkqvGYs2kTSvUWglG-yRzOzIdSCUU/view?usp=sharing)

In [2]:
model_path = "/content/drive/MyDrive/saved_model"
version = "1"

### 检查saved model

TensorFlow通过`saved_model_cli`工具检查saved model的inouts和output。这里model可以确定多个`signature`（签名），比如`serving_default`，`predict`等等。其中每一套`signature`分别对应相同的输入和不同的输出，以满足不同需要。

In [3]:
!saved_model_cli show --dir {model_path}/{version} --all


MetaGraphDef with tag-set: 'serve' contains the following SignatureDefs:

signature_def['ctr/predict']:
  The given SavedModel SignatureDef contains the following input(s):
    inputs['dense_000'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, -1)
        name: Placeholder_21:0
    inputs['dense_002'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, -1)
        name: Placeholder_22:0
    inputs['dense_003'] tensor_info:
        dtype: DT_FLOAT
        shape: (-1, -1)
        name: Placeholder_23:0
    inputs['sparse_000'] tensor_info:
        dtype: DT_INT64
        shape: (-1, -1)
        name: Placeholder_18:0
    inputs['sparse_001'] tensor_info:
        dtype: DT_INT64
        shape: (-1, -1)
        name: Placeholder_19:0
    inputs['sparse_002'] tensor_info:
        dtype: DT_INT64
        shape: (-1, -1)
        name: Placeholder_20:0
  The given SavedModel SignatureDef contains the following output(s):
    outputs['all_class_ids'] tensor_info:
        dtype: 

## 搭建TensorFlow-Serving的service

搭建tensorflow model server服务环境，执行如下命令安装：

In [4]:
!echo "deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal" | sudo tee /etc/apt/sources.list.d/tensorflow-serving.list && \
curl https://storage.googleapis.com/tensorflow-serving-apt/tensorflow-serving.release.pub.gpg | sudo apt-key add -
!sudo apt update

deb http://storage.googleapis.com/tensorflow-serving-apt stable tensorflow-model-server tensorflow-model-server-universal
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2943  100  2943    0     0  14217      0 --:--:-- --:--:-- --:--:-- 14217
OK
Hit:1 https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64  InRelease
Get:2 https://cloud.r-project.org/bin/linux/ubuntu bionic-cran40/ InRelease [3,626 B]
Get:3 http://storage.googleapis.com/tensorflow-serving-apt stable InRelease [3,012 B]
Ign:4 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  InRelease
Hit:5 https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64  Release
Get:6 http://security.ubuntu.com/ubuntu bionic-security InRelease [88.7 kB]
Get:7 http://ppa.launchpad.net/c2d4u.team/c2d4u4.0+/ubuntu bionic InRelease [15.9 kB]
Hit:8 

In [5]:
!sudo apt-get install tensorflow-model-server

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-460
Use 'sudo apt autoremove' to remove it.
The following NEW packages will be installed:
  tensorflow-model-server
0 upgraded, 1 newly installed, 0 to remove and 45 not upgraded.
Need to get 340 MB of archives.
After this operation, 0 B of additional disk space will be used.
Get:1 http://storage.googleapis.com/tensorflow-serving-apt stable/tensorflow-model-server amd64 tensorflow-model-server all 2.8.0 [340 MB]
Fetched 340 MB in 6s (59.7 MB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 76, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)

### 启动TF Serving服务

In [6]:
!nohup tensorflow_model_server \
  --port=8502   \
  --model_name='wide&deep' \
  --model_base_path=$model_path >server.log 2>&1 &

更多参数配置参考[offical code](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/model_servers/main.cc#L59)

In [7]:
!cat server.log

2022-06-06 12:51:58.645494: I tensorflow_serving/model_servers/server.cc:89] Building single TensorFlow model file config:  model_name: wide&deep model_base_path: /content/drive/MyDrive/saved_model
2022-06-06 12:51:58.645789: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models.
2022-06-06 12:51:58.645828: I tensorflow_serving/model_servers/server_core.cc:594]  (Re-)adding model: wide&deep
2022-06-06 12:51:58.750550: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: wide&deep version: 1}
2022-06-06 12:51:58.750645: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: wide&deep version: 1}
2022-06-06 12:51:58.750669: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: wide&deep version: 1}
2022-06-06 12:51:58.751063: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /content/drive/MyDrive/saved_model/1
2

In [8]:
# 查看`port`开启状态
!sudo lsof -i -P -n | grep LISTEN

node         7 root   21u  IPv6  26401      0t0  TCP *:8080 (LISTEN)
colab-fil   29 root    5u  IPv6  26379      0t0  TCP *:3453 (LISTEN)
colab-fil   29 root    6u  IPv4  26380      0t0  TCP *:3453 (LISTEN)
jupyter-n   42 root    6u  IPv4  27118      0t0  TCP 172.28.0.2:9000 (LISTEN)
python3     60 root   15u  IPv4  30693      0t0  TCP 127.0.0.1:44337 (LISTEN)
python3     60 root   18u  IPv4  30697      0t0  TCP 127.0.0.1:47445 (LISTEN)
python3     60 root   21u  IPv4  30701      0t0  TCP 127.0.0.1:44631 (LISTEN)
python3     60 root   24u  IPv4  30705      0t0  TCP 127.0.0.1:59491 (LISTEN)
python3     60 root   30u  IPv4  30711      0t0  TCP 127.0.0.1:34909 (LISTEN)
python3     60 root   44u  IPv4  32779      0t0  TCP 127.0.0.1:45987 (LISTEN)
python3     80 root    3u  IPv4  33061      0t0  TCP 127.0.0.1:21623 (LISTEN)
python3     80 root    4u  IPv4  33062      0t0  TCP 127.0.0.1:33817 (LISTEN)
python3     80 root    9u  IPv4  32093      0t0  TCP 127.0.0.1:40787 (LISTEN)
tensorflo 157

能够发现指定tf-serving服务的gRPC端口`8502`。

## 搭建Tensorflow-Serving的client

TensorFlow运行时是`懒加载（初始化）`的，首次请求引发大量延时（参考[TensorFlow Serving 模型更新毛刺的完全优化实践](https://mp.weixin.qq.com/s/DkCGusznH8F8p39oRLuNBQ)），为了降低懒加载的请求延迟，构建部分随机样本去初始化变量和各node。预热以后再接入外部请求。这个过程就是模型的预热。生成模型的预热文件，可参考 [official document](https://www.tensorflow.org/tfx/serving/saved_model_warmup)

>上面模型部署的server.log日志中其实也提示了` No warmup data file found `

In [9]:
# 安装本地client所需pkg
!pip install -q requests
!pip install -q tensorflow-serving-api

In [10]:
import os
import tempfile
import pandas as pd
import tensorflow as tf
import numpy as np
import json
import requests

import grpc
from tensorflow_serving.apis import predict_pb2
from tensorflow_serving.apis import prediction_service_pb2_grpc

prediction_service_pb2_grpc通过[`prediction_service.proto`](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/apis/prediction_service.proto)定义了大量接口，几乎包含所有请求类型的rpc service，包括`Classify`，`Regress`，`Predict`，`MultiInference`，`GetModelMetadata`。

这里使用的是`predict`的rpc service，由[`predict.proto`](https://github.com/tensorflow/serving/blob/master/tensorflow_serving/apis/predict.proto)定义请求和响应。该接口定义了所运行的TensorFlow Model，以及对应输入的tensor和输出filter。其中推理所需特征字段的定义声明为`map<string, TensorProto> inputs = 2;`，因此传入的是由feature_name映射到feature_tensor的一个feature_dict。

其他所有`proto`定义移步到[apis_proto](https://github.com/tensorflow/serving/tree/master/tensorflow_serving/apis)

### 接入模型的gRPC服务
主要的predict proto为
```proto
message PredictRequest {
  ModelSpec model_spec = 1;  //请求model的信息，model_name + model_version 唯一确定
  map<string, TensorProto> inputs = 2; //传入feature的dict
  repeated string output_filter = 3; //输出过滤
}

message PredictResponse {
  ModelSpec model_spec = 2; //返回model的信息
  map<string, TensorProto> outputs = 1; //返回score的dict
}
```

In [11]:
channel = grpc.insecure_channel('localhost:8502')
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)

### 创建model server请求

In [12]:
request = predict_pb2.PredictRequest()
request.model_spec.name = 'wide&deep'
request.model_spec.signature_name = 'predict'

#### 随机初始化feautre tensor

In [13]:
sparse_feature = ['sparse_000', 'sparse_001', 'sparse_002']
dense_feature = ['dense_000', 'dense_002', 'dense_003']

In [14]:
for sf in sparse_feature:
    dummy_sparse_feature = np.random.randint(10000000, 99999999, size=[8,3])
    request.inputs[sf].CopyFrom(tf.make_tensor_proto(dummy_sparse_feature))

for df in dense_feature:
    dummy_dense_feature = np.random.normal(size=(8, 1))
    request.inputs[df].CopyFrom(tf.make_tensor_proto(dummy_dense_feature, dtype=tf.float32))
request

model_spec {
  name: "wide&deep"
  signature_name: "predict"
}
inputs {
  key: "dense_000"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 8
      }
      dim {
        size: 1
      }
    }
    tensor_content: "\363P3?\006\034V\277#.\262?\341\360\321=|\245\234\277\325\277\265\277^\nJ\277)\311\237?"
  }
}
inputs {
  key: "dense_002"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 8
      }
      dim {
        size: 1
      }
    }
    tensor_content: "\206\235\273?\340\305\227\275\204\351\267\276\374&\276\277#\\\302?_t\367\276\232\261?\277\221`\264\276"
  }
}
inputs {
  key: "dense_003"
  value {
    dtype: DT_FLOAT
    tensor_shape {
      dim {
        size: 8
      }
      dim {
        size: 1
      }
    }
    tensor_content: "`\374\376?<\214&\277\264\226t?\037K\022\300\032\271B?>\263\365<,X\376\277Q\367\330?"
  }
}
inputs {
  key: "sparse_000"
  value {
    dtype: DT_INT64
    tensor_shape {
      dim {
        size: 8
      }

### 调用模型服务

In [15]:
%%timeit
result = stub.Predict(request, 10.0)  #timeout
result

The slowest run took 180.38 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 5: 975 µs per loop


### 模型输出与多目标融合排序

In [16]:
grpc_predictions = stub.Predict(request, 10.0)
ctr_output = grpc_predictions.outputs['ctr/logistic'].float_val
grpc_predictions.model_spec,ctr_output

(name: "wide&deep"
 version {
   value: 1
 }
 signature_name: "predict",
 [0.6135268807411194, 0.6138688325881958, 0.6135810613632202, 0.611910879611969, 0.6130681037902832, 0.6125918626785278, 0.6142157912254333, 0.6139649748802185])

>模型使用随机label训练的，发送的特征也是随机的，因此这里预估值的输出也是随机的。。。


此处假设是多目标模型，那么输出的是多个预估值。多目标融合可以定义在model graph中，也可以写在返回的多个预估分钟。这里`PredictResponse`的proto定义为`map<string, TensorProto> outputs = 1;`

In [17]:
ctr = grpc_predictions.outputs['ctr/logistic'].float_val
#demo如下
#vtr = grpc_predictions.outputs['vtr/logistic'].float_val，此处并非多目标模型，因此只包含ctr的输出值
#pt = grpc_predictions.outputs['pt/logistic'].float_val
#action = grpc_predictions.outputs['action/logistic'].float_val
#final_score = mix_func(ctr,vtr,pt,action) # add, mul and so on...

## 结论

通过窥探tensorflow_serving的相关proto定义，明确模型推理时所需的信息：

1. model_spec：指定模型名及其版本号。每个确定的模型路径为唯一的`model_name/version`
2. inputs：通过构造dict，确定传入的特征名到特征tensor的映射
3. outputs：通过signature明确选择哪一套key-value输出

基于此完成了model graph的前向计算和结果返回。对于不同的服务接口类型，在性能上也是不同的（比如吞吐量，最高并发量）。目前业界应用较广的还是基于k8s部署的tf-serving。可参考[ml-deployment-k8s-tfserving](https://github.com/deep-diver/ml-deployment-k8s-tfserving)
