# 3.0 Server Performance

**[3.1 Assessing the impact of Optimizations](#3.1-Assessing-the-impact-of-Optimizations)**<br>
&nbsp; &nbsp; &nbsp; &nbsp; [3.1.1 Profile the Model](#3.1.1-Exercise:-Profile-the-Model)<br>
**[3.2 Monitoring and Responding to Performance Fluctuations](#3.2-Monitoring-and-Responding-to-Performance-Fluctuations)**<br>
&nbsp; &nbsp; &nbsp; &nbsp; [3.2.1 Viewing Prometheus Metrics](#3.2.1-Viewing-Prometheus-Metrics)<br>
&nbsp; &nbsp; &nbsp; &nbsp; [3.2.2 Interpreting the Metrics](#3.2.2-Interpreting-the-Metrics)<br>

# 3.1 Assessing the impact of Optimizations
The performance tool that we've been using has an additional feature: not only does it display the results on the screen, it also saves the data in a tabular format to the following location: 

<code>"./results/${MODEL_NAME}/results${RESULTS_ID}_${TIMESTAMP}.csv"</code>

To assess the impact of the various optimizations, let's take advantage of the previously generated log files.

## 3.1.1 Profile the Model
We executed <code>bertQA-torchscript</code> as well as <code>bertQA-onnx-trt-dynbatch</code> earlier, so we should already have the logs from that execution saved. Let's look at the content of the appropriate log folders. If you have executed the performance tool more than once, you might see multiple log files with different time stamps created.

In [1]:
!ls ./results/bertQA-torchscript/results_1*
!ls ./results/bertQA-onnx-trt-dynbatch/results_1*

./results/bertQA-torchscript/results_1_240319_1505.csv
./results/bertQA-onnx-trt-dynbatch/results_1_240319_1554.csv


# 3.2 Monitoring and Responding to Performance Fluctuations

Understanding the performance of your inference server is not only critical at the initial planning stage but equally important throughout the lifetime of the application. The ability to capture metrics describing server performance is not only central to the ability to respond to issues, but also is a foundation of more advanced features like automatic scaling.  The diagram below demonstrates a simplified view of the Triton deployment architecture. With [Kubernetes], create a configuration that will automatically scale with the increased demand within your data center or, if necessary, burst the excess workload to the cloud/clouds. <br/>

<img width=700 src="images/DeploymentArchitecture.png"/>

## 3.2.1 Viewing Prometheus Metrics
Triton exposes [Prometheus](https://prometheus.io/) performance metrics for monitoring on port 8002 by default. These include metrics on GPU power usage, GPU memory, request counts, and latency measures. let's query the metrics captured throughout our performance runs:

In [2]:
# Set the server hostname and check it - you should get a message that "Triton Server is ready!"
tritonServerHostName = "triton"
!./utilities/wait_for_triton_server.sh {tritonServerHostName}

Waiting for Triton Server to be ready at triton:8000...
200
Triton Server is ready!


In [4]:
# Use a curl command to request the metrics
prometheus_url = tritonServerHostName + ":8002/metrics"
!curl -v {prometheus_url}

*   Trying 172.20.0.3:8002...
* Connected to triton (172.20.0.3) port 8002 (#0)
> GET /metrics HTTP/1.1
> Host: triton:8002
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Type: text/plain; charset=utf-8
< Content-Length: 5321
< 
# HELP nv_inference_request_success Number of successful inference requests, all batch sizes
# TYPE nv_inference_request_success counter
nv_inference_request_success{gpu_uuid="GPU-adf51400-e5c2-11ee-bd50-8f821a947b31",model="bertQA-onnx-trt-dynbatch",version="1"} 5460
nv_inference_request_success{gpu_uuid="GPU-adf51400-e5c2-11ee-bd50-8f821a947b31",model="bertQA-torchscript",version="1"} 323
# HELP nv_inference_request_failure Number of failed inference requests, all batch sizes
# TYPE nv_inference_request_failure counter
nv_inference_request_failure{gpu_uuid="GPU-adf51400-e5c2-11ee-bd50-8f821a947b31",model="bertQA-onnx-trt-dynbatch",version="1"} 0
nv_inference_request_failure{gpu_uuid="GPU-adf5140

## 3.2.2 Interpreting the Metrics
The Prometheus metrics output is a list of metrics, where each is provided with the form:

```
# HELP <metric_name and description>
# TYPE <metric_name and type>
metric_name{gpu_uuid="GPU-xxxxxx",...} <data>
```

For example, if the inference server models includes two models, you should see among the list some metrics that are specific to each model, and other metrics that are more general about the GPU they both share.<br>

#### Count Example
The following example indicates that the inference count for the `bertQA-onnx-trt-dynbatch` model is 10,105 so far, while the inference count for `bertQA-torchscript` model is 717.<br>What do your results show?
```
# HELP nv_inference_count Number of inferences performed
# TYPE nv_inference_count counter
nv_inference_count{gpu_uuid="GPU-640c6e00-43dd-9fae-9f9a-cb6af82df8e9",model="bertQA-onnx-trt-dynbatch",version="1"} 10105.000000
nv_inference_count{gpu_uuid="GPU-640c6e00-43dd-9fae-9f9a-cb6af82df8e9",model="bertQA-torchscript",version="1"} 717.000000
```

#### GPU Power Example
The following example indicates that current GPU power usage is about 40 watts.<br>What do your results show?
```
# HELP nv_gpu_power_usage GPU power usage in watts
# TYPE nv_gpu_power_usage gauge
nv_gpu_power_usage{gpu_uuid="GPU-640c6e00-43dd-9fae-9f9a-cb6af82df8e9"} 39.958000
```

#### What Do Your Results Indicate?

* Can you identify the current utilization rate? 
* Why is it zero? 
* How much memory are we using? 
* Why do you think we are using the GPU memory even though there are no requests executed against our server? 


Successfully configured optimizations and profiled the model.<br>