# End-to-End BERT (Inference)

이 문서는 TensorFlow BERT pretrained weight를 TensorRT engine으로 변환한 이후, TRTIS에 연동해서 Inference Serving하는 방법을 안내하기 위해 작성이 되었습니다. 이 문서에서 처리하는 절차는 크게 다음 두 가지 입니다.

1. BERT TensorRT inference engine build
<img src="https://developer.nvidia.com/sites/default/files/akamai/deeplearning/tensorrt/trt-info.png" width="600" />
TensorFlow이용하여 학습된 BERT pretrained weight를 TensorRT engine 파일로 변환합니다. 이 예제에서는 이 과정에서 필요한 plugin 들을 build하는 과정도 포함합니다.

2. TRTIS model repository 구성 및 TRTIS 서버 실행
<img src="https://developer.nvidia.com/sites/default/files/pictures/2018/trt-inference-server-diagram-1200px.png" width="600" />
이 예제에서는 완성된 engine 파일을 이용하여, TRTIS용 Model repository를 구성하고, TensorRT Inference Server를 실제로 구동하여 동작하는 것을 살펴볼 것입니다.

## I. Building BERT TensorRT Inference Engine

### 1. Build TensorRT Docker Container

In [1]:
%%bash
cd ../trt
docker build . -f Dockerfile -t bert_trt --rm \
    --build-arg FROM_IMAGE_NAME=nvcr.io/nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 \
    --build-arg TRT_PKG_VERSION=6.0.1-1+cuda10.1 \
    --build-arg myuid=$(id -u) --build-arg mygid=1000 # $(id -g)

Sending build context to Docker daemon    209MB
Step 1/18 : ARG FROM_IMAGE_NAME
Step 2/18 : FROM ${FROM_IMAGE_NAME}
10.1-cudnn7-devel-ubuntu18.04: Pulling from nvidia/cuda
7ddbc47eeb70: Already exists
c1bbdc448b72: Already exists
8c3b70e39044: Already exists
45d437916d57: Already exists
d8f1569ddae6: Pulling fs layer
85386706b020: Pulling fs layer
ee9b457b77d0: Pulling fs layer
be4f3343ecd3: Pulling fs layer
30b4effda4fd: Pulling fs layer
b398e882f414: Pulling fs layer
be4f3343ecd3: Waiting
b398e882f414: Waiting
30b4effda4fd: Waiting
ee9b457b77d0: Verifying Checksum
ee9b457b77d0: Download complete
d8f1569ddae6: Verifying Checksum
d8f1569ddae6: Download complete
85386706b020: Verifying Checksum
85386706b020: Download complete
d8f1569ddae6: Pull complete
b398e882f414: Download complete
be4f3343ecd3: Verifying Checksum
be4f3343ecd3: Download complete
85386706b020: Pull complete
ee9b457b77d0: Pull complete
30b4effda4fd: Verifying Checksum
30b4effda4fd: Download complete
be4f3343ecd3: Pul

### 2. Container 실행

다음 docker 실행 명령을 이용하여 BERT TensorRT engine을 build하기 위한 container를 실행합니다.

In [2]:
%%bash
cd ..

GPU_ID=${1:-"ALL"}

docker rm -f bert_trt
docker run -d -ti \
    --name bert_trt${VERSION} \
    --runtime=nvidia \
    --shm-size=1g --ulimit memlock=1 --ulimit stack=67108864 \
    -u $(id -u):$(id -g) \
    -e NVIDIA_VISIBLE_DEVICES=${GPU_ID} \
    -v $(pwd)/outputs:/workspace/outputs \
    -v $(pwd)/results/models:/workspace/models \
    -v $(pwd)/trt/TensorRT:/workspace/TensorRT \
    bert_trt bash

e6d760222dce91761fa26b1307871282a70d2730369d4f83ae4a98aea2544383


Error: No such container: bert_trt


In [3]:
!docker ps -a

CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS                  PORTS               NAMES
e6d760222dce        bert_trt            "bash"              1 second ago        Up Less than a second                       bert_trt


### 3. Build TensorRT Plugin Layer library and download pretrained weights

BERT TensorRT engine을 build하기 위해 필요한 plugin의 라이브러리와 예제를 위해 pretrained-weight를 ngc로부터 다운로드 받습니다.

#### 1. Plugin build

In [4]:
%%bash
if [[ -e ../trt/TensorRT/demo/BERT/build ]]; then
    rm -rf ../trt/TensorRT/demo/BERT/build
fi
docker exec -t bert_trt bash /workspace/TensorRT/demo/BERT/python/build_plugins.sh

Building TensorRT plugins for BERT
-- The CXX compiler identification is GNU 7.4.0
-- The CUDA compiler identification is NVIDIA 10.1.243
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Configuring done
-- Generating done
-- Build files have been written to: /workspace/TensorRT/demo/BERT/build
[35m[1mScanning dependencies of target common[0m
[  7%] [32mBuilding CXX object CMakeFiles/common.dir/workspace/TensorRT/samples/common/logger.cpp.o[0m
[ 15%] [32m[1mLinking CXX shared library libcommon.so[0m
[ 15%] Built target common
[35m[1mScanning d

실행결과 아래 경로에 build 한 결과가 저장됩니다.

In [5]:
!ls ../trt/TensorRT/demo/BERT/build

CMakeCache.txt	cmake_install.cmake  libcommon.so  sample_bert
CMakeFiles	libbert_plugins.so   Makefile


#### 2. Pre-trained weight download
NGC로부터 pre-trained weight를 다운로드 받습니다. NGC에서는 다음의 조건에 대한 pretrained weight를 제공하므로 이 중에 선택하여 사용할 수 있습니다.

| | options |
|:---:|:--- |
| model | large, base |
| precision | fp32, fp16 |
| seq. length | 128, 384 |

물론 독자적으로 학습하신 weight (ckpt)를 사용하실 수도 있습니다.

이 예제에서는 bert-large, fp16, seq-len 128 을 사용하도록 하겠습니다.

In [6]:
%%bash -s 'large' 'fp16' '128'

MODEL=${1:-'large'}
FT_PRECISION=${2:-'fp16'}
SEQ_LEN=${3:-'128'}

if [[ ! -d ../results/models/fine-tuned/bert_tf_v2_${MODEL}_${FT_PRECISION}_${SEQ_LEN}_v2/ ]]; then
    docker exec -t bert_trt bash /workspace/TensorRT/demo/BERT/python/download_fine-tuned_model.sh ${MODEL} ${FT_PRECISION} ${SEQ_LEN}
fi

실행결과 

In [7]:
%%bash -s 'large' 'fp16' '128'

MODEL=${1:-'large'}
FT_PRECISION=${2:-'fp16'}
SEQ_LEN=${3:-'128'}

ls ../results/models/fine-tuned/bert_tf_v2_${MODEL}_${FT_PRECISION}_${SEQ_LEN}_v2/

bert_config.json
model.ckpt-8144.data-00000-of-00001
model.ckpt-8144.index
model.ckpt-8144.meta
tf_bert_squad_1n_fp16_gbs32.190523100044.log
vocab.txt


### 4. Build TensorRT Engine

이제 BERT TensorRT precision을 build할 것입니다. 이전에 NGC model repository에서 다운로드 받은 pretrained weight와 동일한 조건으로 engine을 build하되 batch size를 지정해 줘야 합니다. 현재 버전에서는 Batch size 1만을 지원하므로, 여기서는 batch size를 1로만 지정해서 테스트 하도록 하겠습니다. 실행시간이 Telsa V100 기준으로 **20분** 가량 소요되므로 신중히 실행하시기 바랍니다.

참고로 별도의 terminal 창을 열어서 ```watch -n1 nvidia-smi```를 이용해서 TensorRT가 engine을 Build하면서 GPU를 점유하는 것을 보실 수 있습니다. 이 과정에서 GPU를 사용하는 이유는 engine을 build하는 과정에서 target GPU의 성능을 측정하여 적절한 GPU Kernel의 구성을 TensorRT가 찾기 때문입니다.

In [8]:
%%time
%%bash -s 'large' 'fp16' '128' '1' 

MODEL=${1:-'large'}
FT_PRECISION=${2:-'fp16'}
SEQ_LEN=${3:-'128'}
BATCH_SIZE=${4:-'1'}

ENGINE_OUTPUT_DIR="outputs"

if [[ ! -e ${ENGINE_OUTPUT_DIR} ]]; then
    mkdir -p ${ENGINE_OUTPUT_DIR}
fi

docker exec -t \
    bert_trt \
        python3 -W ignore /workspace/TensorRT/demo/BERT/python/bert_builder.py \
            -m /workspace/models/fine-tuned/bert_tf_v2_${MODEL}_${FT_PRECISION}_${SEQ_LEN}_v2/model.ckpt-8144 \
            -c /workspace/models/fine-tuned/bert_tf_v2_${MODEL}_${FT_PRECISION}_${SEQ_LEN}_v2 \
            -o /workspace/outputs/bert_${MODEL}_${SEQ_LEN}.engine \
            -s ${SEQ_LEN} -b ${BATCH_SIZE}

[TensorRT] INFO: Using configuration file: /workspace/models/fine-tuned/bert_tf_v2_large_fp16_128_v2/bert_config.json
[TensorRT] INFO: Found 394 entries in weight map
[TensorRT] INFO: Detected 3 inputs and 1 output network tensors.
[TensorRT] INFO: Detected 3 inputs and 1 output network tensors.
[TensorRT] INFO: Detected 3 inputs and 1 output network tensors.
[TensorRT] INFO: Saving Engine to /workspace/outputs/bert_large_128.engine
[TensorRT] INFO: Done.
CPU times: user 15 ms, sys: 5.97 ms, total: 21 ms
Wall time: 14min 3s


위 Script의 실행결과 아래 경로에 engine 파일이 생성된 것을 볼 수 있습니다.

In [9]:
!ls ../outputs

bert_large_128.engine


### 5. Inference Test

이제 build 한 TensorRT engine을 이용해서 inference를 테스트해보겠습니다.

In [10]:
%%bash -s 'large' 'fp16' '128' '1'
MODEL=${1:-'large'}
FT_PRECISION=${2:-'fp16'}
SEQ_LEN=${3:-'128'}
BATCH_SIZE=${4:-'1'}

docker exec -t \
    bert_trt \
        python3 -W ignore /workspace/TensorRT/demo/BERT/python/bert_inference.py \
            -e /workspace/outputs/bert_${MODEL}_${SEQ_LEN}.engine -s ${SEQ_LEN} \
            -p "TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps." \
            -q "What is TensorRT?" \
            -v /workspace/models/fine-tuned/bert_tf_v2_${MODEL}_${FT_PRECISION}_${SEQ_LEN}_v2/vocab.txt


Passage: TensorRT is a high performance deep learning inference platform that delivers low latency and high throughput for apps such as recommenders, speech and image/video on NVIDIA GPUs. It includes parsers to import models, and plugins to support novel ops and layers before applying optimizations for inference. Today NVIDIA is open sourcing parsers and plugins in TensorRT so that the deep learning community can customize and extend these components to take advantage of powerful TensorRT optimizations for your apps.


Question: What is TensorRT?

Running Inference...
------------------------
Running inference in 242.068 Sentences/Sec
------------------------
Processing output 0 in batch
Answer: 'a high performance deep learning inference platform'
With probability: 46.155


### 6. Closing the container

In [11]:
!docker rm -f bert_trt

bert_trt


## Building BERT inferencing platform with TRTIS

### 1. Pulling TRTIS docker image

TensorRT Inference Server는 새로운 Build 없이 Serving을 할 수 있는 장점이 있습니다. 우선 원활한 예제의 실행을 위해 사용할 이미지를 다음 명령을 이용하여 pull 합니다.

In [12]:
%%bash
docker pull nvcr.io/nvidia/tensorrtserver:19.10-py3

19.10-py3: Pulling from nvidia/tensorrtserver
5667fdb72017: Already exists
d83811f270d5: Already exists
ee671aafb583: Already exists
7fc152dfb3a6: Already exists
dbc57626691b: Already exists
e20092842144: Already exists
d64c76da70d5: Already exists
429f0b34bf97: Already exists
39d853a0098c: Already exists
dc9dfc23df66: Already exists
1a32524cb863: Already exists
d3d394313ced: Already exists
857b6050fd78: Already exists
3a51649b9b50: Already exists
885e286ed6cc: Already exists
62be33d17790: Already exists
6a7d05a28b83: Already exists
11ff4c1b1e9b: Already exists
252fb308c785: Already exists
4749ee710260: Already exists
47668c0cb079: Already exists
4f9ec6b1521d: Already exists
292b425b68e8: Already exists
93e46b746825: Already exists
d66e2a94ffdd: Pulling fs layer
9ec0ad11e3f4: Pulling fs layer
28efceee1d39: Pulling fs layer
026a283c83f0: Pulling fs layer
af0f2fe8c66a: Pulling fs layer
ef30f655718e: Pulling fs layer
0b20230b4afa: Pulling fs layer
bd575020981a: Pulling fs layer
27fab5730d

### 2. Setting TRTIS model repository

다음의 명령들을 이용하여 TRTIS model repository를 구성합니다. 여기서 숫자 1은 model version으로 원하는 버전을 설정하실 수 있으며, 향후에 inference client 단에서 원하는 버전을 지정하여 inference가 되도록 지정하실 수 있습니다.

In [13]:
%%bash
mkdir -p ../results/trtis_models/bert_large_128_fp16
mkdir -p ../results/trtis_models/bert_large_128_fp16/1

Model repository에는 model를 명시하는 ```config.pbtxt``` 파일과 TensorRT engine 파일을 ```model.plan```으로 이름을 변경하여 버전에 따라 저장을 합니다.

In [14]:
%%file ../results/trtis_models/bert_large_128_fp16/config.pbtxt

name: "bert_large_128_fp16"
platform: "tensorrt_plan"
max_batch_size: 1

instance_group [
    {
        count: 1
        kind: KIND_GPU
        gpus: [0]
        profile: "0"
    }
]


Overwriting ../results/trtis_models/bert_large_128_fp16/config.pbtxt


In [15]:
%%bash
cp ../outputs/bert_large_128.engine ../results/trtis_models/bert_large_128_fp16/1/model.plan

Model repository 구성과 함께 중요한 절차는 TensorRT를 이용하여 inference하는데 필요한 Plugin을 TRTIS에게 알려주는 것입니다. 그 이유는, TRTIS 입장에서는 TensorRT engine에서 사용하는 plugin의 정보를 사전에 알 방법이 없기 때문입니다. 여기서는 TRTIS 서버를 구동하기 전, docker image에서 plugin을 참조할 수 있도록 준비를 합니다.

In [16]:
%%bash
mkdir -p ../trt/plugins
cp ../trt/TensorRT/demo/BERT/build/*.so ../trt/plugins

### 3. Launch Server

이제 TensorRT inference server를 구동시킬 차례입니다. TensorFlow Model이 FP16으로 Inference 되도록 하게 하는 한편, TensorRT engine의 dependency를 위한 경로를 설정해 줍니다.

In [17]:
%%bash -s "fp16"

cd ..

precision=${1:-"fp16"}
NV_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:-"all"}

if [ "$precision" = "fp16" ] ; then
   echo "fp16 activated!"
   export TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE=1
else
   echo "fp32 activated!"
   export TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE=0
fi

# Start TRTIS server in detached state
docker run -d --rm \
   --runtime=nvidia \
   --shm-size=1g \
   --ulimit memlock=-1 \
   --ulimit stack=67108864 \
   -p8000:8000 \
   -p8001:8001 \
   -p8002:8002 \
   --name trt_server_cont \
   -e NVIDIA_VISIBLE_DEVICES=$NV_VISIBLE_DEVICES \
   -e TF_ENABLE_AUTO_MIXED_PRECISION_GRAPH_REWRITE \
   -v $(pwd)/results/trtis_models:/models \
   -v $(pwd)/trt/plugins:/opt/tensorrtserver/lib/plugins \
   -e LD_PRELOAD="/opt/tensorrtserver/lib/plugins/libcommon.so:/opt/tensorrtserver/lib/plugins/libbert_plugins.so" \
   nvcr.io/nvidia/tensorrtserver:19.10-py3 \
        trtserver --model-store=/models --strict-model-config=false

fp16 activated!
1065a3723c3e074ea5b269149a71461730c96962a02cafa9dd9956e3416360ce


### 4. Performance Test

#### 1. BERT client build (optional)
만약 BERT Training을 위해 사용한 BERT docker image가 위치한 노드와 동일한 노드라면 다음 명령은 생략하셔도 됩니다.

In [18]:
%%bash 
cd ..
bash ./scripts/docker/build.sh

19.08-py3: Pulling from nvidia/tensorrtserver
7413c47ba209: Already exists
0fe7e7cbb2e8: Already exists
1d425c982345: Already exists
344da5c95cec: Already exists
ae62549b429d: Already exists
e275e0ef6c20: Already exists
4090c4d315fe: Already exists
00a11b299176: Already exists
74a29ca83919: Already exists
a1abd2d74110: Already exists
90d7249fe09b: Already exists
5db1b1a35ea4: Already exists
b160969adc93: Already exists
0179f14b1047: Already exists
a58b5dcd3fa6: Already exists
e7af950e37dd: Already exists
e880be2d991d: Already exists
b7c0ae26dc75: Already exists
423736729fa4: Already exists
9595d4b4fa6d: Already exists
d18ab9b3cee4: Already exists
d13f74634ff4: Already exists
6465f099eaee: Already exists
1d25a5143caf: Already exists
1488e34e1ef6: Pulling fs layer
c0b9035f7b0d: Pulling fs layer
e12a027580b2: Pulling fs layer
2195a5a8e51b: Pulling fs layer
68d9a4bdc44b: Pulling fs layer
79ac09aadede: Pulling fs layer
4dfca455860d: Pulling fs layer
8031f1622bfe: Pulling fs layer
d70a9aeed3

./scripts/docker/build.sh: line 7: cd: tensorrt-inference-server: No such file or directory


#### 2. BERT TRTIS performance (1 GPU)

In [19]:
%%bash -s "large" "128" "fp16" "1" "1" "3000" "10" "10" "localhost"

MODEL=${1:-"large"}
SEQ_LEN=${2:-"128"}
FT_PRECISION=${3:-"fp16"}
BATCH_SIZE=${4:-1}
MODEL_VERSION=${5:-1}
MAX_LATENCY=${6:-500}
MAX_CLIENT_THREADS=${7:-10}
MAX_CONCURRENCY=${8:-50}
SERVER_HOSTNAME=${9:-"localhost"}

MODEL_NAME="bert_${MODEL}_${SEQ_LEN}_${FT_PRECISION}"
NV_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:-"all"}

if [[ $SERVER_HOSTNAME == *":"* ]]; then
  echo "ERROR! Do not include the port when passing the Server Hostname. These scripts require that the TRTIS HTTP endpoint is on Port 8000 and the gRPC endpoint is on Port 8001. Exiting..."
  exit 1
fi

if [[ ! -e ../results/perf_client/${MODLE_NAME} ]]; then
    mkdir ../results/perf_client/${MODEL_NAME}
fi

TIMESTAMP=$(date "+%y%m%d_%H%M")
OUTPUT_FILE_CSV="/results/perf_client/${MODEL_NAME}/results_${TIMESTAMP}.csv"

docker run --rm -t \
    --net=host \
    --shm-size=1g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -e NVIDIA_VISIBLE_DEVICES=$NV_VISIBLE_DEVICES \
    -u $(id -u):$(id -g) \
    -v $(pwd):/workspace/bert \
    -v $(pwd)/results:/results \
    bert \
        /workspace/install/bin/perf_client \
            --max-threads ${MAX_CLIENT_THREADS} \
            -m ${MODEL_NAME} \
            -x ${MODEL_VERSION} \
            -p 3000 \
            -d \
            -v \
            -i gRPC \
            -u ${SERVER_HOSTNAME}:8001 \
            -b ${BATCH_SIZE} \
            -l ${MAX_LATENCY} \
            -c ${MAX_CONCURRENCY} \
            -f ${OUTPUT_FILE_CSV} \
            -z

                                                                                                                                                
== TensorFlow ==

NVIDIA Release 19.08 (build 7791926)
TensorFlow Version 1.14.0

Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017-2019 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Detected MOFED driver 4.6-1.0.1; version automatically updated.

*** Measurement Settings ***
  Batch size: 1
  Measurement window: 3000 msec
  Latency limit: 3000 msec
  Concurrency limit: 10 concurrent requests
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 230 infer/sec. Avg latency: 4336 usec (std 113 usec)
  Pass [2] throughput: 232 infer/sec. Avg latency: 4288 usec (std 7

#### 3. Model reconfiguration & updated inference performance - 2 GPU

In [20]:
%%file ../results/trtis_models/bert_large_128_fp16/config.pbtxt

name: "bert_large_128_fp16"
platform: "tensorrt_plan"
max_batch_size: 1

instance_group [
    {
        count: 1
        kind: KIND_GPU
        gpus: [0, 1]
        profile: "0"
    }
]


Overwriting ../results/trtis_models/bert_large_128_fp16/config.pbtxt


In [21]:
%%bash
SERVER_URI=${1:-"localhost"}

echo "Waiting for TRTIS Server to be ready at http://$SERVER_URI:8000..."

live_command="curl -m 1 -L -s -o /dev/null -w %{http_code} http://$SERVER_URI:8000/api/health/live"
ready_command="curl -m 1 -L -s -o /dev/null -w %{http_code} http://$SERVER_URI:8000/api/health/ready"

current_status=$($live_command)

# First check the current status. If that passes, check the json. If either fail, loop
while [[ ${current_status} != "200" ]] || [[ $($ready_command) != "200" ]]; do

   printf "."
   sleep 1
   current_status=$($live_command)
done

echo "TRTIS Server is ready!"

Waiting for TRTIS Server to be ready at http://localhost:8000...
TRTIS Server is ready!


In [22]:
%%bash -s "large" "128" "fp16" "1" "1" "3000" "10" "10" "localhost"

MODEL=${1:-"large"}
SEQ_LEN=${2:-"128"}
FT_PRECISION=${3:-"fp16"}
BATCH_SIZE=${4:-1}
MODEL_VERSION=${5:-1}
MAX_LATENCY=${6:-500}
MAX_CLIENT_THREADS=${7:-10}
MAX_CONCURRENCY=${8:-50}
SERVER_HOSTNAME=${9:-"localhost"}

MODEL_NAME="bert_${MODEL}_${SEQ_LEN}_${FT_PRECISION}"
NV_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:-"all"}

if [[ $SERVER_HOSTNAME == *":"* ]]; then
  echo "ERROR! Do not include the port when passing the Server Hostname. These scripts require that the TRTIS HTTP endpoint is on Port 8000 and the gRPC endpoint is on Port 8001. Exiting..."
  exit 1
fi

if [[ ! -e ../results/perf_client/${MODLE_NAME} ]]; then
    mkdir ../results/perf_client/${MODEL_NAME}
fi

TIMESTAMP=$(date "+%y%m%d_%H%M")
OUTPUT_FILE_CSV="/results/perf_client/${MODEL_NAME}/results_${TIMESTAMP}.csv"

docker run --rm -t \
    --net=host \
    --shm-size=1g \
    --ulimit memlock=-1 \
    --ulimit stack=67108864 \
    -e NVIDIA_VISIBLE_DEVICES=$NV_VISIBLE_DEVICES \
    -u $(id -u):$(id -g) \
    -v $(pwd):/workspace/bert \
    -v $(pwd)/results:/results \
    bert \
        /workspace/install/bin/perf_client \
            --max-threads ${MAX_CLIENT_THREADS} \
            -m ${MODEL_NAME} \
            -x ${MODEL_VERSION} \
            -p 3000 \
            -d \
            -v \
            -i gRPC \
            -u ${SERVER_HOSTNAME}:8001 \
            -b ${BATCH_SIZE} \
            -l ${MAX_LATENCY} \
            -c ${MAX_CONCURRENCY} \
            -f ${OUTPUT_FILE_CSV} \
            -z

                                                                                                                                                
== TensorFlow ==

NVIDIA Release 19.08 (build 7791926)
TensorFlow Version 1.14.0

Container image Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017-2019 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: Detected MOFED driver 4.6-1.0.1; version automatically updated.

*** Measurement Settings ***
  Batch size: 1
  Measurement window: 3000 msec
  Latency limit: 3000 msec
  Concurrency limit: 10 concurrent requests
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 234 infer/sec. Avg latency: 4269 usec (std 59 usec)
  Pass [2] throughput: 235 infer/sec. Avg latency: 4245 usec (std 40

CalledProcessError: Command 'b'\nMODEL=${1:-"large"}\nSEQ_LEN=${2:-"128"}\nFT_PRECISION=${3:-"fp16"}\nBATCH_SIZE=${4:-1}\nMODEL_VERSION=${5:-1}\nMAX_LATENCY=${6:-500}\nMAX_CLIENT_THREADS=${7:-10}\nMAX_CONCURRENCY=${8:-50}\nSERVER_HOSTNAME=${9:-"localhost"}\n\nMODEL_NAME="bert_${MODEL}_${SEQ_LEN}_${FT_PRECISION}"\nNV_VISIBLE_DEVICES=${NVIDIA_VISIBLE_DEVICES:-"all"}\n\nif [[ $SERVER_HOSTNAME == *":"* ]]; then\n  echo "ERROR! Do not include the port when passing the Server Hostname. These scripts require that the TRTIS HTTP endpoint is on Port 8000 and the gRPC endpoint is on Port 8001. Exiting..."\n  exit 1\nfi\n\nif [[ ! -e ../results/perf_client/${MODLE_NAME} ]]; then\n    mkdir ../results/perf_client/${MODEL_NAME}\nfi\n\nTIMESTAMP=$(date "+%y%m%d_%H%M")\nOUTPUT_FILE_CSV="/results/perf_client/${MODEL_NAME}/results_${TIMESTAMP}.csv"\n\ndocker run --rm -t \\\n    --net=host \\\n    --shm-size=1g \\\n    --ulimit memlock=-1 \\\n    --ulimit stack=67108864 \\\n    -e NVIDIA_VISIBLE_DEVICES=$NV_VISIBLE_DEVICES \\\n    -u $(id -u):$(id -g) \\\n    -v $(pwd):/workspace/bert \\\n    -v $(pwd)/results:/results \\\n    bert \\\n        /workspace/install/bin/perf_client \\\n            --max-threads ${MAX_CLIENT_THREADS} \\\n            -m ${MODEL_NAME} \\\n            -x ${MODEL_VERSION} \\\n            -p 3000 \\\n            -d \\\n            -v \\\n            -i gRPC \\\n            -u ${SERVER_HOSTNAME}:8001 \\\n            -b ${BATCH_SIZE} \\\n            -l ${MAX_LATENCY} \\\n            -c ${MAX_CONCURRENCY} \\\n            -f ${OUTPUT_FILE_CSV} \\\n            -z\n'' returned non-zero exit status 1.