<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>

# 2.0 Hosting the Model

In this notebook, you'll learn strategies to optimize Triton Server to improve the performance of your deployment.


**[2.1 Concurrent Model Execution](#2.1-Concurrent-Model-Execution)**<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.1.1 Exercise: Usage Considerations](#2.1.1-Exercise:-Usage-Considerations)<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.1.2 Implementation](#2.1.2-Implementation)<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.1.3 Exercise: Configure Multiple Instance Groups](#2.1.3-Exercise:-Configure-Multiple-Instance-Groups)<br>
**[2.2 Scheduling Strategies](#2.2-Scheduling-Strategies)**<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.2.1 Stateless Inference](#2.2.1-Stateless-Inference)<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.2.2 Stateful Inference](#2.2.2-Stateful-Inference)<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.2.3 Pipelines / Ensembles](#2.2.3-Pipelines-/-Ensembles)<br>
**[2.3 Dynamic Batching](#2.3-Dynamic-Batching)**<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.3.1 Exercise: Implement Dynamic Batching](#2.3.1-Exercise:-Implement-Dynamic-Batching)<br>

So far, we've executed customer requests sequentially, in the order they have arrived at the server, and used a static batch of size 8 for any requests to our server. This has not only left our GPUs heavily underutilized, but has also significantly affected the latency of responses received from the server. This is not an uncommon situation. Unless you are developing an application that processes large volumes of data in batch, you will likely be sending individual inference requests from the user application, leading to even further underutilization. As we have seen in the previous notebook, model optimizations do help considerably to accelerate model execution.  However, they do not change the fact that when serving is implemented naively, the nature of the inference workload leads to GPU underutilization.

Inference servers, such as NVIDIA Triton, implement a wide range of features that allow us to improve the GPU utilization and improve request latency. The three that we will discuss in this class are:<br/>
- <a href="https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/architecture.html#section-concurrent-model-execution">Concurrent model execution</a></br>
- <a href="https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/models_and_schedulers.html">Scheduling</a> <br/>
- <a href="https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/model_configuration.html#section-dynamic-batcher">Dynamic batching</a> <br/>


Please refer to the <a href="https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/quickstart.html">Triton documentation</a> and its <a href="https://github.com/NVIDIA/triton-inference-server">source code</a> for further information about the mechanisms and configurations that can help improve model inference performance.

# 2.1 Concurrent Model Execution
The Triton architecture allows multiple models and/or multiple instances of the same model to execute in parallel on a single GPU. The following figure shows an example with two models: `model0` and `model1`. Assuming Triton is not currently processing any request, when two requests arrive simultaneously, one for each model, Triton immediately schedules both of them onto the GPU, and the GPU’s hardware scheduler begins working on both computations in parallel. </br>

<img src="images/multi_model_exec.png"/><br/>

#### Default Behavior

By default, if multiple requests for the same model arrive at the same time, Triton will serialize their execution by scheduling only one at a time on the GPU, as shown in the following figure.

<img src="images/multi_model_serial_exec.png"/><br/>

Triton provides an instance-group feature that allows each model to specify how many parallel executions of that model should be allowed. Each such enabled parallel execution is referred to as an *execution instance*. By default, Triton gives each model a single execution instance, which means that only a single execution of the model is allowed to be in progress at a time as shown in the above figure. 

#### Instance Groups
By using the *instance-group* setting, the number of execution instances for a model can be increased. The following figure shows model execution when `model1` is configured to allow three execution instances. As shown in the figure, the first three `model1` inference requests are immediately executed in parallel on the GPU. The fourth `model1` inference request must wait until one of the first three executions completes before beginning.

<img src="images/multi_model_parallel_exec.png"/><br/>


## 2.1.1 Exercise: Usage Considerations

For most models, the Triton feature that provides the largest performance improvement is *dynamic batching*. The key advantages of dynamic batching over setting up multiple instance execution are:
- No overhead for model parameter storage
- No overhead related to model parameter fetch from the GPU memory
- Better utilization of the GPU resources

Before we look at the configuration for multiple model execution, let's execute our model again using a single instance, and observe the resource utilization of the GPU. <br>


#### Exercise Steps
1. Launch a terminal window from the JupyterLab launch page.  If you need to open a new launch page, click the '+' icon on the left sidebar menu. You can then use a drag-and-drop action to move the terminal to a sub-window configuration  for better viewing.
2. Execute the following command in the terminal before you run the performance tool:<br>

```
watch -n0.5 nvidia-smi
```
    You should see an output that resembles:
<img src="images/NVIDIASMI.png" style="position:relative; left:30px;" width=800/>

3. Execute the same benchmark we used in the previous notebook, but with the batch size reduced to 1, and observe the <code>nvidia-smi</code> output again.  Pay special attention to the memory consumption and GPU utilization.

In [1]:
# Set the server hostname and check it - you should get a message that "Triton Server is ready!"
tritonServerHostName = "triton"
!./utilities/wait_for_triton_server.sh {tritonServerHostName}

Waiting for Triton Server to be ready at triton:8000...
200
Triton Server is ready!


In [2]:
# Load the previous configuration.
modelVersion="1"
precision="fp32"
batchSize="1"
maxLatency="500"
maxClientThreads="10"
maxConcurrency="2"
dockerBridge="host"
resultsFolderName="1"
profilingData="utilities/profiling_data_int64"

In [3]:
# Update configuration parameters and run profiler.
modelName = "bertQA-onnx-trt-fp16"
maxConcurrency= "10"
batchSize="1"
print("Running: " + modelName)
!bash ./utilities/run_perf_client_local.sh \
                    {modelName} \
                    {modelVersion} \
                    {precision} \
                    {batchSize} \
                    {maxLatency} \
                    {maxClientThreads} \
                    {maxConcurrency} \
                    {tritonServerHostName} \
                    {dockerBridge} \
                    {resultsFolderName} \
                    {profilingData}

Running: bertQA-onnx-trt-fp16
Waiting for Triton Server to be ready at triton:8000...
200
Triton Server is ready!
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 3000 msec
  Latency limit: 500 msec
  Concurrency limit: 10 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 36.3333 infer/sec. Avg latency: 27508 usec (std 167 usec)
  Pass [2] throughput: 36 infer/sec. Avg latency: 27664 usec (std 166 usec)
  Pass [3] throughput: 36.6667 infer/sec. Avg latency: 27481 usec (std 59 usec)
  Client: 
    Request count: 110
    Throughput: 36.6667 infer/sec
    Avg latency: 27481 usec (standard deviation 59 usec)
    p50 latency: 27471 usec
    p90 latency: 27527 usec
    p95 latency: 27559 usec
    p99 latency: 27635 usec
    Avg HTTP time: 27471 usec (send 5 usec + response wait 27465 usec + receive 1 usec)
  Server: 
    Inference count: 131
    Execution count: 131
    Successful r

Hopefully, you have observed utilization similar to the following:<br/>
<img src="images/NVIDIASMI2.png" width=800/><br/>

Do you think you will observe a major acceleration as a consequence of increasing the number of instance groups?<br>
Discuss with the instructor.

## 2.1.2 Implementation
Let's look at how to enable concurrent execution and what impact it will have on our model performance. Execute the following code cells to export the model in the ONNX format.

In [4]:
modelName = "bertQA-onnx-conexec"
exportFormat = "onnx"

In [5]:
!python ./deployer/deployer.py \
    --{exportFormat} \
    --save-dir ./candidatemodels \
    --triton-model-name {modelName} \
    --triton-model-version 1 \
    --triton-max-batch-size 8 \
    --triton-dyn-batching-delay 0 \
    --triton-engine-count 1 \
    -- --checkpoint ./data/bert_qa.pt \
    --config_file ./bert_config.json \
    --vocab_file ./vocab \
    --predict_file ./squad/v1.1/dev-v1.1.json \
    --do_lower_case \
    --batch_size=8 

deploying model bertQA-onnx-conexec in format onnxruntime_onnx
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))

conversion correctness test results
-----------------------------------
maximal absolute error over dataset (L_inf):  0.00022935867309570312

average L_inf error over output tensors:  0.0001423954963684082
variance of L_inf error over output tensors:  6.657553323445124e-09
stddev of L_inf error over output tensors:  8.159383140559784e-05

time of error check of native model:  0.4295821189880371 seconds
time of error check of onnx model:  20.64755845069

In [6]:
!ls -alh ./candidatemodels/bertQA-onnx-conexec

total 16K
drwxr-xr-x 3 root root 4.0K Oct  6 08:09 .
drwxr-xr-x 3 root root 4.0K Oct  6 08:08 ..
drwxr-xr-x 2 root root 4.0K Oct  6 08:08 1
-rw-r--r-- 1 root root  569 Oct  6 08:09 config.pbtxt


## 2.1.3 Exercise: Configure Multiple Instance Groups
In order to specify multiple instances, we need to change the "count" value from '1' to a larger number in the `instance_group` section of the "config.pbtxt" configuration file. 


```
    instance_group [
    {
        count: 2
        kind: KIND_GPU
        gpus: [ 0 ]
    }
]
```

#### Exercise Steps:
1. Modify [config.pbtxt](candidatemodels/bertQA-onnx-conexec/config.pbtxt) in the `bertQA-onnx-conexec` deployment just created to specify two instances of our BERT-based question answering model. You should find the default instance_group block at the end of the file. Change the count variable from 1 to 2.  (see the [solution](solutions/ex-2-1-3_config.pbtxt) as needed)
2. To make the comparison fair, also enable TensorRT with the addition of an `execution_accelerators` block inside the `optimization` block:

```text
optimization {
   execution_accelerators {
      gpu_execution_accelerator : [ {
         name : "tensorrt"
         parameters { key: "precision_mode" value: "FP16" }
      }]
   }
cuda { graphs: 0 }
}
```

3. Once you have saved your changes (Main menu: File -> Save File), move the model across to Triton by executing the following command.

In [9]:
!mv ./candidatemodels/bertQA-onnx-conexec model_repository/

mv: inter-device move failed: './candidatemodels/bertQA-onnx-conexec' to 'model_repository/bertQA-onnx-conexec'; unable to remove target: Directory not empty


4. Run our standard stress test against the model. Please compare it to the single instance execution.<br>
   Did the throughput change?<br>
   Did the latency change?

In [10]:
maxConcurrency= "10"
batchSize="1"
print("Running: " + modelName)
!bash ./utilities/run_perf_client_local.sh \
                    {modelName} \
                    {modelVersion} \
                    {precision} \
                    {batchSize} \
                    {maxLatency} \
                    {maxClientThreads} \
                    {maxConcurrency} \
                    {tritonServerHostName} \
                    {dockerBridge} \
                    {resultsFolderName} \
                    {profilingData}

Running: bertQA-onnx-conexec
Waiting for Triton Server to be ready at triton:8000...
200
Triton Server is ready!
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 3000 msec
  Latency limit: 500 msec
  Concurrency limit: 10 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 36 infer/sec. Avg latency: 27690 usec (std 325 usec)
  Pass [2] throughput: 36 infer/sec. Avg latency: 27653 usec (std 116 usec)
  Pass [3] throughput: 36.3333 infer/sec. Avg latency: 27687 usec (std 334 usec)
  Client: 
    Request count: 109
    Throughput: 36.3333 infer/sec
    Avg latency: 27687 usec (standard deviation 334 usec)
    p50 latency: 27639 usec
    p90 latency: 27826 usec
    p95 latency: 27907 usec
    p99 latency: 28152 usec
    Avg HTTP time: 27658 usec (send 6 usec + response wait 27650 usec + receive 2 usec)
  Server: 
    Inference count: 130
    Execution count: 130
    Successful reque

Before we continue, let's free up some GPU memory by moving some of the models out of the Triton model repository.  After removing the following three models, only the `bertQA-torchscript` model should remain.

In [11]:
# Remove models from the inference server by removing them from the model_repository
!mv /dli/task/model_repository/bertQA-onnx /dli/task/candidatemodels/
!mv /dli/task/model_repository/bertQA-onnx-conexec /dli/task/candidatemodels/
!mv /dli/task/model_repository/bertQA-onnx-trt-fp16 /dli/task/candidatemodels/

# List remaining models on the inference server
!ls /dli/task/model_repository

mv: inter-device move failed: '/dli/task/model_repository/bertQA-onnx-conexec' to '/dli/task/candidatemodels/bertQA-onnx-conexec'; unable to remove target: Directory not empty
bertQA-onnx-conexec  bertQA-torchscript


# 2.2 Scheduling Strategies
Triton supports batch inferencing by allowing individual inference requests to specify a batch of inputs. The inferencing for a batch of inputs is performed at the same time which is especially important for GPUs since it can greatly increase inferencing throughput. In many use cases the individual inference requests are not batched, therefore, they do not benefit from the throughput benefits of batching. <br/>

The inference server contains multiple scheduling and batching algorithms that support many different model types and use-cases. The choice of the scheduler / batcher will be driven by several factors the key ones being:
- Stateful / stateless nature of your inference workload
- Whether your application is composed of models served in isolation or whether a more complex pipeline / ensemble is being used

## 2.2.1 Stateless Inference

When dealing with stateless inference (as we are in this class) we have two main options when it comes to scheduling. The first option is the default scheduler which will distribute request to all instances assigned for inference. This is the preferred option when the structure of the inference workload is well understood and where inference will take place at regular batch sizes and time intervals.

The second option is dynamic batching which combines individual request and similarly to the default batcher distributes the larges batches across instances. We will discuss this particular option in the next section of the class.

## 2.2.2 Stateful Inference

A stateful model (or stateful custom backend) does maintain state between inference requests. The model is expecting multiple inference requests that together form a sequence of inferences that must be routed to the same model instance so that the state being maintained by the model is correctly updated. Moreover, the model may require that Triton provide control signals indicating, for example, sequence start.

The sequence batcher can employ one of two scheduling strategies when deciding how to batch the sequences that are routed to the same model instance. These strategies are Direct and Oldest.

With the Direct scheduling strategy the sequence batcher ensures not only that all inference requests in a sequence are routed to the same model instance, but also that each sequence is routed to a dedicated batch slot within the model instance. This strategy is required when the model maintains state for each batch slot, and is expecting all inference requests for a given sequence to be routed to the same slot so that the state is correctly updated.

With the Oldest scheduling strategy the sequence batcher ensures that all inference requests in a sequence are routed to the same model instance and then uses the dynamic batcher to batch together multiple inferences from different sequences into a batch that inferences together.

## 2.2.3 Pipelines / Ensembles

An ensemble model represents a pipeline of one or more models and the connection of input and output tensors between those models. Ensemble models are intended to be used to encapsulate a procedure that involves multiple models, such as "data preprocessing -> inference -> data post-processing". Using ensemble models for this purpose can avoid the overhead of transferring intermediate tensors and minimize the number of requests that must be sent to Triton. An example of an ensemble pipeline is illustrated below: <br/>

<img src="images/ensemble_example0.png"/>

The ensemble scheduler must be used for ensemble models, regardless of the scheduler used by the models within the ensemble. With respect to the ensemble scheduler, an ensemble model is not an actual model. Instead, it specifies the data flow between models within the ensemble as Step. The scheduler collects the output tensors in each step, provides them as input tensors for other steps according to the specification. In spite of that, the ensemble model is still viewed as a single model from an external view.

More information on Triton scheduling can be found in the <a href="https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/models_and_schedulers.html#stateless-models">following section of the documentation</a>. In this class, we will focus further on one of the most powerful features of Triton, *dynamic batching*.

# 2.3 Dynamic Batching
Dynamic batching is a feature of Triton that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in increased throughput.

When a model instance becomes available for inferencing, the dynamic batcher will attempt to create batches from the requests that are available in the scheduler. Requests are added to the batch in the order the requests were received. If the dynamic batcher can form a batch of a preferred size(s) it will create a batch of the largest possible preferred size and send it for inferencing. If the dynamic batcher cannot form a batch of a preferred size, it will send a batch of the largest size possible that is less than the max batch size allowed by the model. 

The dynamic batcher can be configured to allow requests to be delayed for a limited time in the scheduler to allow other requests to join the dynamic batch. For example, the following configuration sets the maximum delay time of 100 microseconds for a request:

```
dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}
```


## 2.3.1 Exercise: Implement Dynamic Batching


Let's begin again by exporting an ONNX model.

In [12]:
modelName = "bertQA-onnx-trt-dynbatch"
exportFormat = "onnx"

In [13]:
!python ./deployer/deployer.py \
    --{exportFormat} \
    --save-dir ./candidatemodels \
    --triton-model-name {modelName} \
    --triton-model-version 1 \
    --triton-max-batch-size 8 \
    --triton-dyn-batching-delay 0 \
    --triton-engine-count 1 \
    -- --checkpoint ./data/bert_qa.pt \
    --config_file ./bert_config.json \
    --vocab_file ./vocab \
    --predict_file ./squad/v1.1/dev-v1.1.json \
    --do_lower_case \
    --batch_size=8

deploying model bertQA-onnx-trt-dynbatch in format onnxruntime_onnx
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))
  'Automatically generated names will be applied to each dynamic axes of input {}'.format(key))

conversion correctness test results
-----------------------------------
maximal absolute error over dataset (L_inf):  0.00022935867309570312

average L_inf error over output tensors:  0.0001423954963684082
variance of L_inf error over output tensors:  6.657553323445124e-09
stddev of L_inf error over output tensors:  8.159383140559784e-05

time of error check of native model:  0.4185009002685547 seconds
time of error check of onnx model:  18.951934

#### Exercise Steps
1. Modify [config.pbtxt](candidatemodels/bertQA-onnx-trt-dynbatch/config.pbtxt) for dynamic batching using the example snippet. 

    ```
    dynamic_batching {
      preferred_batch_size: [ 4, 8 ]
      max_queue_delay_microseconds: 100
    }
    ```
    
2. Enable TensorRT in the optimization block.

    ```
    optimization {
       execution_accelerators {
          gpu_execution_accelerator : [ {
             name : "tensorrt"
             parameters { key: "precision_mode" value: "FP16" }
          }]
       }
    cuda { graphs: 0 }
    }
    ```
3. Once saved, move the model to the Triton model repository and run the performance utility by executing the following cells. ([solution](solutions/ex-2-3-1_config.pbtxt) if needed)

In [14]:
!mv ./candidatemodels/bertQA-onnx-trt-dynbatch model_repository/

In [None]:
modelName = "bertQA-onnx-trt-dynbatch"
maxConcurency= "10"
batchSize="1"
print("Running: "+modelName)
!bash ./utilities/run_perf_client_local.sh \
                    {modelName} \
                    {modelVersion} \
                    {precision} \
                    {batchSize} \
                    {maxLatency} \
                    {maxClientThreads} \
                    {maxConcurency} \
                    {tritonServerHostName} \
                    {dockerBridge} \
                    {resultsFolderName} \
                    {profilingData}

Running: bertQA-onnx-trt-dynbatch
Waiting for Triton Server to be ready at triton:8000...
200
..........................Triton Server is ready!
*** Measurement Settings ***
  Batch size: 1
  Measurement window: 3000 msec
  Latency limit: 500 msec
  Concurrency limit: 10 concurrent requests
  Using synchronous calls for inference
  Stabilizing using average latency

Request concurrency: 1
  Pass [1] throughput: 35.3333 infer/sec. Avg latency: 28269 usec (std 465 usec)
  Pass [2] throughput: 35.3333 infer/sec. Avg latency: 28351 usec (std 358 usec)
  Pass [3] throughput: 35.3333 infer/sec. Avg latency: 28315 usec (std 354 usec)
  Client: 
    Request count: 106
    Throughput: 35.3333 infer/sec
    Avg latency: 28315 usec (standard deviation 354 usec)
    p50 latency: 28464 usec
    p90 latency: 28710 usec
    p95 latency: 28806 usec
    p99 latency: 28924 usec
    Avg HTTP time: 28280 usec (send 10 usec + response wait 28268 usec + receive 2 usec)
  Server: 
    Inference count: 127
   

You should have observed a fairly dramatic improvement in both latency and throughput. 
* How big is the impact in comparison to vanilla ONNX configuration or vanilla TorchScript? 
* What do you think was bottlenecking the multiple instance implementation?

Discuss the results with the instructor.

<h3 style="color:green;">Congratulations!</h3><br>
You've leaned some strategies to improve the GPU utilization and reduce latency using:

* Concurrent model execution
* Scheduling
* Dynamic batching

In the next segment of the class we will make a more formal assessment of inference performance across multiple concurrency levels and how to analyze your inference performance in a structured way. Please proceed to the next notebook:<br>
[3.0 Server Performance](030_ServerPerformance.ipynb)

<a href="https://www.nvidia.com/dli"> <img src="images/DLI_Header.png" alt="Header" style="width: 400px;"/> </a>