# 2.0 Hosting the Model

**[2.1 Concurrent Model Execution](#2.1-Concurrent-Model-Execution)**<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.1.1 Usage Considerations](#2.1.1-Usage-Considerations)<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.1.2 Implementation](#2.1.2-Implementation)<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.1.3 Configure Multiple Instance Groups](#2.1.3-Configure-Multiple-Instance-Groups)<br>
**[2.2 Scheduling Strategies](#2.2-Scheduling-Strategies)**<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.2.1 Stateless Inference](#2.2.1-Stateless-Inference)<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.2.2 Stateful Inference](#2.2.2-Stateful-Inference)<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.2.3 Pipelines / Ensembles](#2.2.3-Pipelines-/-Ensembles)<br>
**[2.3 Dynamic Batching](#2.3-Dynamic-Batching)**<br>
&nbsp; &nbsp; &nbsp; &nbsp; [2.3.1 Implement Dynamic Batching](#2.3.1-Implement-Dynamic-Batching)<br>

# 2.1 Concurrent Model Execution
The Triton architecture allows multiple models and/or multiple instances of the same model to execute in parallel on a single GPU. The following figure shows an example with two models: `model0` and `model1`. Assuming Triton is not currently processing any request, when two requests arrive simultaneously, one for each model, Triton immediately schedules both of them onto the GPU, and the GPU’s hardware scheduler begins working on both computations in parallel. </br>

#### Instance Groups
By using the *instance-group* setting, the number of execution instances for a model can be increased. The following figure shows model execution when `model1` is configured to allow three execution instances. As shown in the figure, the first three `model1` inference requests are immediately executed in parallel on the GPU. The fourth `model1` inference request must wait until one of the first three executions completes before beginning.


## 2.1.1 Usage Considerations

For most models, the Triton feature that provides the largest performance improvement is *dynamic batching*. The key advantages of dynamic batching over setting up multiple instance execution are:
- No overhead for model parameter storage
- No overhead related to model parameter fetch from the GPU memory
- Better utilization of the GPU resources

Before we look at the configuration for multiple model execution, let's execute our model again using a single instance, and observe the resource utilization of the GPU. <br>


1. Launch a terminal window from the JupyterLab launch page.  If you need to open a new launch page, click the '+' icon on the left sidebar menu. You can then use a drag-and-drop action to move the terminal to a sub-window configuration  for better viewing.
2. Execute the following command in the terminal before you run the performance tool:<br>

```
watch -n0.5 nvidia-smi
```
    You should see an output that resembles:
<img src="images/NVIDIASMI.png" style="position:relative; left:30px;" width=800/>

3. Execute the same benchmark we used in the previous notebook, but with the batch size reduced to 1, and observe the <code>nvidia-smi</code> output again.  Pay special attention to the memory consumption and GPU utilization.

In [1]:
# Set the server hostname and check it - you should get a message that "Triton Server is ready!"
tritonServerHostName = "triton"
!./utilities/wait_for_triton_server.sh {tritonServerHostName}

Waiting for Triton Server to be ready at triton:8000...
200
Triton Server is ready!


In [2]:
# Load the previous configuration.
modelVersion = "1"
precision = "fp32"
batchSize = "8"
maxLatency = "500"
maxClientThreads = "10"
maxConcurrency = "2"
dockerBridge = "host"
resultsFolderName = "1"
profilingData = "utilities/profiling_data_int64"
measurement_request_count = 50
percentile_stability = 85
stability_percentage = 50

In [3]:
%%time
# Update configuration parameters and run profiler.
modelName = "bertQA-onnx-trt-fp16"
maxConcurrency= "10"
batchSize="1"
print("Running: "+modelName)

!./utilities/run_perf_analyzer_local.sh \
                    {modelName} \
                    {modelVersion} \
                    {precision} \
                    {batchSize} \
                    {maxLatency} \
                    {maxClientThreads} \
                    {maxConcurrency} \
                    {tritonServerHostName} \
                    {dockerBridge} \
                    {resultsFolderName} \
                    {profilingData} \
                    {measurement_request_count} \
                    {percentile_stability} \
                    {stability_percentage}

Running: bertQA-onnx-trt-fp16
Waiting for Triton Server to be ready at triton:8000...
200
Triton Server is ready!
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 50
  Latency limit: 500 msec
  Concurrency limit: 10 concurrent requests
  Using synchronous calls for inference
  Stabilizing using p85 latency

Request concurrency: 1
  Pass [1] throughput: 1.81311 infer/sec. p85 latency: 8730 usec
  Pass [2] throughput: 117.945 infer/sec. p85 latency: 8644 usec
  Pass [3] throughput: 116.933 infer/sec. p85 latency: 8741 usec
  Pass [4] throughput: 118.937 infer/sec. p85 latency: 8492 usec
  Client: 
    Request count: 354
    Throughput: 117.938 infer/sec
    Avg client overhead: 0.02%
    Avg send request rate: 117.94 infer/sec
    p50 latency: 8435 usec
    p85 latency: 8660 usec
    p90 latency: 8698 usec
    p95 latency: 8874 usec
    p99 latency: 9007 usec
    Avg gRPC time: 8

##observed utilization similar to the following:<br/>
##expect to observe a major acceleration as a consequence of increasing the number of instance groups?<br>


## 2.1.2 Implementation
Let's look at how to enable concurrent execution and what impact it will have on our model performance

In [4]:
modelName = "bertQA-onnx-conexec"
exportFormat = "onnx"

In [5]:
!python ./deployer/deployer.py \
    --{exportFormat} \
    --save-dir ./candidatemodels \
    --triton-model-name {modelName} \
    --triton-model-version 1 \
    --triton-max-batch-size 8 \
    --triton-dyn-batching-delay 0 \
    --triton-engine-count 1 \
    -- --checkpoint ./data/bert_qa.pt \
    --config_file ./bert_config.json \
    --vocab_file ./vocab \
    --predict_file ./squad/v1.1/dev-v1.1.json \
    --do_lower_case \
    --batch_size=8 

deploying model bertQA-onnx-conexec in format onnxruntime_onnx
verbose: False, log level: Level.ERROR


conversion correctness test results
-----------------------------------
maximal absolute error over dataset (L_inf):  0.0135725736618042

average L_inf error over output tensors:  0.008322536945343018
variance of L_inf error over output tensors:  1.9936597752234775e-05
stddev of L_inf error over output tensors:  0.0044650417413765325

time of error check of native model:  0.285872220993042 seconds
time of error check of onnx model:  6.443050384521484 seconds

done


In [6]:
!ls -alh ./candidatemodels/bertQA-onnx-conexec

total 16K
drwxr-xr-x 3 root root 4.0K Mar 19 15:33 .
drwxr-xr-x 3 root root 4.0K Mar 19 15:33 ..
drwxr-xr-x 2 root root 4.0K Mar 19 15:33 1
-rw-r--r-- 1 root root  569 Mar 19 15:33 config.pbtxt


## 2.1.3 Configure Multiple Instance Groups
In order to specify multiple instances, we need to change the "count" value from '1' to a larger number in the `instance_group` section of the "config.pbtxt" configuration file. 


```
    instance_group [
    {
        count: 2
        kind: KIND_GPU
        gpus: [ 0 ]
    }
]
```

1. Modify [config.pbtxt](candidatemodels/bertQA-onnx-conexec/config.pbtxt) in the `bertQA-onnx-conexec` deployment just created to specify two instances of our BERT-based question answering model. You should find the default instance_group block at the end of the file. Change the count variable from 1 to 2.  (see the [solution](solutions/ex-2-1-3_config.pbtxt) as needed)
2. To make the comparison fair, also enable TensorRT with the addition of an `execution_accelerators` block inside the `optimization` block:

```text
optimization {
   execution_accelerators {
      gpu_execution_accelerator : [ {
         name : "tensorrt"
         parameters { key: "precision_mode" value: "FP16" }
      }]
   }
cuda { graphs: 0 }
}
```

3. Once you have saved your changes (Main menu: File -> Save File), move the model across to Triton by executing the following command.

In [7]:
mv ./candidatemodels/bertQA-onnx-conexec model_repository/

4. Run our standard stress test against the model. Please compare it to the single instance execution.<br>
   Did the throughput change?<br>
   Did the latency change?

In [8]:
%%time
modelName = "bertQA-onnx-conexec"
maxConcurrency= "10"
batchSize="1"
print("Running: "+modelName)

!./utilities/run_perf_analyzer_local.sh \
                    {modelName} \
                    {modelVersion} \
                    {precision} \
                    {batchSize} \
                    {maxLatency} \
                    {maxClientThreads} \
                    {maxConcurrency} \
                    {tritonServerHostName} \
                    {dockerBridge} \
                    {resultsFolderName} \
                    {profilingData} \
                    {measurement_request_count} \
                    {percentile_stability} \
                    {stability_percentage}

Running: bertQA-onnx-conexec
Waiting for Triton Server to be ready at triton:8000...
200
Triton Server is ready!
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 50
  Latency limit: 500 msec
  Concurrency limit: 10 concurrent requests
  Using synchronous calls for inference
  Stabilizing using p85 latency

Request concurrency: 1
  Pass [1] throughput: 1.52113 infer/sec. p85 latency: 5988 usec
  Pass [2] throughput: 167.848 infer/sec. p85 latency: 6165 usec
  Pass [3] throughput: 168.861 infer/sec. p85 latency: 6103 usec
  Pass [4] throughput: 167.841 infer/sec. p85 latency: 6141 usec
  Client: 
    Request count: 505
    Throughput: 168.183 infer/sec
    Avg client overhead: 0.06%
    Avg send request rate: 168.18 infer/sec
    p50 latency: 5895 usec
    p85 latency: 6142 usec
    p90 latency: 6192 usec
    p95 latency: 6286 usec
    p99 latency: 6472 usec
    Avg gRPC time: 59

Before we continue, let's free up some GPU memory by moving some of the models out of the Triton model repository.  After removing the following three models, only the `bertQA-torchscript` model should remain.

In [9]:
# Remove models from the inference server by removing them from the model_repository
!mv -f /dli/task/model_repository/bertQA-onnx /dli/task/candidatemodels/
!mv -f /dli/task/model_repository/bertQA-onnx-conexec /dli/task/candidatemodels/
!mv -f /dli/task/model_repository/bertQA-onnx-trt-fp16 /dli/task/candidatemodels/

# List remaining models on the inference server
!ls /dli/task/model_repository

bertQA-torchscript


# 2.2 Scheduling Strategies
Triton supports batch inferencing by allowing individual inference requests to specify a batch of inputs. The inferencing for a batch of inputs is performed at the same time which is especially important for GPUs since it can greatly increase inferencing throughput. In many use cases the individual inference requests are not batched, therefore, they do not benefit from the throughput benefits of batching. <br/>

The inference server contains multiple scheduling and batching algorithms that support many different model types and use-cases. The choice of the scheduler / batcher will be driven by several factors the key ones being:
- Stateful / stateless nature of your inference workload
- Whether your application is composed of models served in isolation or whether a more complex pipeline / ensemble is being used

## 2.2.1 Stateless Inference

When dealing with stateless inference (as we are in this class) we have two main options when it comes to scheduling. The first option is the default scheduler which will distribute request to all instances assigned for inference. This is the preferred option when the structure of the inference workload is well understood and where inference will take place at regular batch sizes and time intervals.

The second option is dynamic batching which combines individual request and similarly to the default batcher distributes the larges batches across instances.

## 2.2.2 Stateful Inference

A stateful model (or stateful custom backend) does maintain state between inference requests. The model is expecting multiple inference requests that together form a sequence of inferences that must be routed to the same model instance so that the state being maintained by the model is correctly updated. Moreover, the model may require that Triton provide control signals indicating, for example, sequence start.

With the Direct scheduling strategy the sequence batcher ensures not only that all inference requests in a sequence are routed to the same model instance, but also that each sequence is routed to a dedicated batch slot within the model instance. This strategy is required when the model maintains state for each batch slot, and is expecting all inference requests for a given sequence to be routed to the same slot so that the state is correctly updated.

## 2.2.3 Pipelines / Ensembles

An ensemble model represents a pipeline of one or more models and the connection of input and output tensors between those models. Ensemble models are intended to be used to encapsulate a procedure that involves multiple models, such as "data preprocessing -> inference -> data post-processing". Using ensemble models for this purpose can avoid the overhead of transferring intermediate tensors and minimize the number of requests that must be sent to Triton. <br/>

The ensemble scheduler must be used for ensemble models, regardless of the scheduler used by the models within the ensemble. With respect to the ensemble scheduler, an ensemble model is not an actual model. Instead, it specifies the data flow between models within the ensemble as Step. The scheduler collects the output tensors in each step, provides them as input tensors for other steps according to the specification. In spite of that, the ensemble model is still viewed as a single model from an external view. we will focus further on one of the most powerful features of Triton, *dynamic batching*.

# 2.3 Dynamic Batching
Dynamic batching is a feature of Triton that allows inference requests to be combined by the server, so that a batch is created dynamically, resulting in increased throughput.

When a model instance becomes available for inferencing, the dynamic batcher will attempt to create batches from the requests that are available in the scheduler. Requests are added to the batch in the order the requests were received. If the dynamic batcher can form a batch of a preferred size(s) it will create a batch of the largest possible preferred size and send it for inferencing. If the dynamic batcher cannot form a batch of a preferred size, it will send a batch of the largest size possible that is less than the max batch size allowed by the model. 

The dynamic batcher can be configured to allow requests to be delayed for a limited time in the scheduler to allow other requests to join the dynamic batch. For example, the following configuration sets the maximum delay time of 100 microseconds for a request:

```
dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}
```


## 2.3.1 Implement Dynamic Batching


Let's begin again by exporting an ONNX model.

In [10]:
modelName = "bertQA-onnx-trt-dynbatch"
exportFormat = "onnx"

In [11]:
!python ./deployer/deployer.py \
    --{exportFormat} \
    --save-dir ./candidatemodels \
    --triton-model-name {modelName} \
    --triton-model-version 1 \
    --triton-max-batch-size 8 \
    --triton-dyn-batching-delay 0 \
    --triton-engine-count 1 \
    -- --checkpoint ./data/bert_qa.pt \
    --config_file ./bert_config.json \
    --vocab_file ./vocab \
    --predict_file ./squad/v1.1/dev-v1.1.json \
    --do_lower_case \
    --batch_size=8

deploying model bertQA-onnx-trt-dynbatch in format onnxruntime_onnx
verbose: False, log level: Level.ERROR


conversion correctness test results
-----------------------------------
maximal absolute error over dataset (L_inf):  0.0135725736618042

average L_inf error over output tensors:  0.008322536945343018
variance of L_inf error over output tensors:  1.9936597752234775e-05
stddev of L_inf error over output tensors:  0.0044650417413765325

time of error check of native model:  0.28313279151916504 seconds
time of error check of onnx model:  6.684637546539307 seconds

done



1. Modify [config.pbtxt](candidatemodels/bertQA-onnx-trt-dynbatch/config.pbtxt) for dynamic batching using the example snippet. 

    ```
    dynamic_batching {
      preferred_batch_size: [ 4, 8 ]
      max_queue_delay_microseconds: 100
    }
    ```
    
2. Enable TensorRT in the optimization block.

    ```
    optimization {
       execution_accelerators {
          gpu_execution_accelerator : [ {
             name : "tensorrt"
             parameters { key: "precision_mode" value: "FP16" }
          }]
       }
    cuda { graphs: 0 }
    }
    ```
3. Once saved, move the model to the Triton model repository and run the performance utility by executing the following cells. ([solution](solutions/ex-2-3-1_config.pbtxt) if needed)

In [13]:
mv ./candidatemodels/bertQA-onnx-trt-dynbatch model_repository/

In [14]:
%%time
# warm up model with some inferences for faster analysis  (takes about 5 minutes)
modelName = "bertQA-onnx-trt-dynbatch"
batchSize = 8
!./utilities/run_warmup.sh {modelName} {batchSize}
batchSize = 4
!./utilities/run_warmup.sh {modelName} {batchSize}
batchSize = 1
!./utilities/run_warmup.sh {modelName} {batchSize}

Waiting for Triton Server to be ready at triton:8000...
200
...Triton Server is ready!
*** Measurement Settings ***
  Batch size: 8
  Service Kind: Triton
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 25
  Using synchronous calls for inference
  Stabilizing using p85 latency

Request concurrency: 1
  Client: 
    Request count: 78
    Throughput: 207.874 infer/sec
    p50 latency: 38105 usec
    p85 latency: 38614 usec
    p90 latency: 38739 usec
    p95 latency: 39322 usec
    p99 latency: 39916 usec
    Avg gRPC time: 38201 usec ((un)marshal request/response 6 usec + response wait 38195 usec)
  Server: 
    Inference count: 624
    Execution count: 78
    Successful request count: 78
    Avg request latency: 37768 usec (overhead 53 usec + queue 65 usec + compute input 26 usec + compute infer 37616 usec + compute output 8 usec)

Inferences/Second vs. Client p85 Batch Latency
Concurrency: 1, throughput: 207.874 infer/sec, latency 38614 usec


In [15]:
%%time
modelName = "bertQA-onnx-trt-dynbatch"
maxConcurency= "10"
batchSize="1"
print("Running: "+modelName)

!./utilities/run_perf_analyzer_local.sh \
                    {modelName} \
                    {modelVersion} \
                    {precision} \
                    {batchSize} \
                    {maxLatency} \
                    {maxClientThreads} \
                    {maxConcurrency} \
                    {tritonServerHostName} \
                    {dockerBridge} \
                    {resultsFolderName} \
                    {profilingData} \
                    {measurement_request_count} \
                    {percentile_stability} \
                    {stability_percentage}

Running: bertQA-onnx-trt-dynbatch
Waiting for Triton Server to be ready at triton:8000...
200
Triton Server is ready!
*** Measurement Settings ***
  Batch size: 1
  Service Kind: Triton
  Using "count_windows" mode for stabilization
  Minimum number of samples in each window: 50
  Latency limit: 500 msec
  Concurrency limit: 10 concurrent requests
  Using synchronous calls for inference
  Stabilizing using p85 latency

Request concurrency: 1
  Pass [1] throughput: 112.959 infer/sec. p85 latency: 8981 usec
  Pass [2] throughput: 114.925 infer/sec. p85 latency: 8881 usec
  Pass [3] throughput: 113.947 infer/sec. p85 latency: 8951 usec
  Client: 
    Request count: 342
    Throughput: 113.944 infer/sec
    Avg client overhead: 0.02%
    Avg send request rate: 114.28 infer/sec
    p50 latency: 8707 usec
    p85 latency: 8936 usec
    p90 latency: 8999 usec
    p95 latency: 9104 usec
    p99 latency: 9638 usec
    Avg gRPC time: 8762 usec (marshal 3 usec + response wait 8758 usec + unmarsha

observed a fairly dramatic improvement in both latency and throughput. 
* What do you think was bottlenecking the multiple instance implementation?
