# TensorRT Engine Report Card

Use this Jupyter worksheet to get a quick overview of the structure and characteristics of a TensorRT Engine plan.

## Load JSON Files

In [None]:
import IPython
from ipywidgets import widgets
from trex import *

# Choose an engine file to load.
engine_name = "../tests/inputs/mobilenet.qat.onnx.engine"
engine_name = "../tests/inputs/mobilenet_v2_residuals.qat.onnx.engine"

set_wide_display()

# Execute the cell, then press the Select button, choose an engine file, and move to the next cell.
rootdir = '../tests/inputs/'
fc = display_filechooser(rootdir)

In [None]:
if fc.selected is not None:
    engine_name = fc.selected

assert engine_name is not None
plan = EnginePlan(f"{engine_name}.graph.json", f"{engine_name}.profile.json", f"{engine_name}.profile.metadata.json")

### Render Graph

In [None]:
report_card_draw_plan_graph(plan, engine_name)

<html><div style="text-align:center;background:#76b900;padding:20px;color:#ffffff;font-size:2em;">Plan Summary</div></html>

In [None]:
# "Average time" refers to the sum of the layer latencies, when profiling layers separately
# "Latency" refers to the [min, max, mean, median, 99% percentile] of the engine latency measurements, when timing the engine w/o profiling layers.
plan.summary()

In [None]:
layer_latency_sunburst(plan.df, "Layers Latencies (%)")

In [None]:
report_card_table_view(plan)

### Timings

In [None]:
plot_engine_timings(timing_json_file= f"{engine_name}.timing.json")

## Performance

In [None]:
report_card_perf_overview(plan);

## Memory Footprint

In [None]:
report_card_memory_footprint(plan);

<html><div style="text-align:center;background:#76b900;padding:20px;color:#ffffff;font-size:2em;">Convolutions</div></html>

In [None]:
convs = plan.get_layers_by_type('Convolution')
report_card_convolutions_overview(convs)

## Tactics

In [None]:
latency_vs_prec_per_conv = partial(
    plotly_bar2,
    convs,
    values_col='latency.pct_time',
    names_col='Name',
    color='tactic')

latency_vs_prec_per_conv("Latency per Layer (color=Tactics)")

tactic_cnt = group_count(plan.df, 'tactic')
display_df(tactic_cnt)

## Experimental 

The data below are based on partial information and are provided here just for exploration and fun.

Examples of simplifying assumptions:
* Convolutions and matrix-multiplications are implemented using implicit-gemm. In practice, various algorithms might be used, which will affect the arithmetic-intensity.
* Input/output activations (feature-maps) and parameters (weights and other constants) are read once, and from device global memory. In practice, it is likely that activations will be resident in the L2 cache, which has a much higher bandwidth compared to global memory. Matrix multiplication is performed using activation tiles which are read (reused) multiple times. These reads are ignored in the calculation of the arithmetic-intensity.
* Compute-efficiency and memory-efficiency are calculated by dividing the MACs or bytes by the total time per layer, instead of dividing by the time spent only on compute or only on memory access.
* Some convolutions have fued operations (e.g. SiLU activation) which are currently ignored in the calculation of the number of operations.


Performance optimization references:
* [Nvidia DL Peformance Background: Understand Performance](https://docs.nvidia.com/deeplearning/performance/dl-performance-gpu-background/index.html#understand-perf)
* [Nvidia DL Peformance: GEMM](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#math-mem)
* [CUDA Programming: Instruction Throughput](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#maximize-instruction-throughput)
* [TensorRT Developer Guide: Performance Best Practices](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#performance)

In [None]:
report_card_gemm_MNK(plan);

In [None]:
report_card_gemm_MNK_scatter(plan);

In [None]:
report_card_efficiency_vs_latency_3d(plan);

In [None]:
report_card_perf_scatter(plan);

<html><div style="text-align:center;background:#76b900;padding:20px;color:#ffffff;font-size:2em;">Layer Lint Utility</div></html>

Linting functions perform static analysis of the plan to flag possible performance hazards.<br>
See TensorRT's [Performance Best Prcatices](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#performance) for more information.

<b>The linting is in an early experimental stage and may be imcomplete or erronous.<b>

## Convolution Lint

Ideally all Float16 and INT8 convolutions are accelerated on Tensor Cores, so `ConvLinter` uses heuristics on the kernel name to determine if the kernel is accelerated on TCs.  This method is probably incorrect is some cases and needs more investment.

In [None]:
display_df(ConvLinter(plan).lint())

## Reformat Lint

Reformat layers copy their input to their output while making some changes to the data.<br>
A Reformat layer may change the data layout (e.g. from NCHW to NC32HW) or perform data-type conversion (e.g. from float32 to INT8).

A Reformat layer is added by to the engine graph by TensorRT's graph-optimizer for one of several reasons, which are indicated by the `attr.origin` field.
* REFORMAT: type or layout conversion.
* SLICE: slice layer output conversion.
* CONCAT: concat layer input conversion.

Reformat layers that perform data-type conversion from float32/float16 to INT8, or vice-versa, may indicate poorly placed Q/DQ layers in a QAT network. <br>
These are Q/DQ layers which could not be fused with another layer in the engine graph and may be quite costly in latency.

In [None]:
display_df(ReformatLinter(plan).lint())

## Slice Linter

Slice layers that perform data-type conversion may indicate an optimization opportunity.

In [None]:
display_df(SliceLinter(plan).lint())

## Q/DQ Linter

Quantize/Dequenatize layers perform a copy with quantization/dequantization.<br>
Unfused Q/DQ layers ("dangling Q/DQ") are very wasteful and usually indicate poorly placement of fake-quantization in the training model.

In [None]:
display_df(QDQLinter(plan).lint())

## Pointwise Layers

Pointwise and Elementwise layers can be fused to create larger kernels.<br>
Here you can explore how well these layers managed to fuse.

In [None]:
report_card_pointwise_lint(plan)