# ONNX Runtime Tutorial!

This notebook is designed to demonstrate the features of the ONNXRT passes integrated into MASE as part of the MASERT framework. The following demonstrations were run on a NVIDIA RTX A2000 GPU with a Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz CPU.

## Section 1. ONNX Runtime Optimizations
Firstly, we will show you how we can utilise the ONNX RT optimizations. We expect to see a speed up without a loss in model accuracy. We will use a simple model, `jsc-toy`, and compare the optimized model to the original model using the `Machop API`.

First, we load the machop requirements by running the cell below.

In [2]:
import sys
import os
from pathlib import Path
import toml

# Figure out the correct path
machop_path = Path(".").resolve().parent.parent.parent /"machop"
assert machop_path.exists(), "Failed to find machop at: {}".format(machop_path)
sys.path.append(str(machop_path))

# Add directory to the PATH so that chop can be called
new_path = "../../../machop"
full_path = os.path.abspath(new_path)
os.environ['PATH'] += os.pathsep + full_path

from chop.tools.utils import to_numpy_if_tensor
from chop.tools.logger import set_logging_verbosity
from chop.tools import get_cf_args, get_dummy_input
from chop.passes.graph.utils import deepcopy_mase_graph
from chop.tools.get_input import InputGenerator
from chop.tools.checkpoint_load import load_model
from chop.ir import MaseGraph
from chop.models import get_model_info, get_model, get_tokenizer
from chop.dataset import MaseDataModule, get_dataset_info
from chop.passes.graph.transforms import metadata_value_type_cast_transform_pass
from chop.passes.graph import (
    summarize_quantization_analysis_pass,
    add_common_metadata_analysis_pass,
    init_metadata_analysis_pass,
    add_software_metadata_analysis_pass,
    runtime_analysis_pass,
    onnx_runtime_interface_pass,
    )

set_logging_verbosity("info")

[2024-03-28 23:08:16,245] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)


[32mINFO    [0m [34mSet logging level to info[0m
I0328 23:08:18.752703 140487283468096 logger.py:44] Set logging level to info


We then load in a demonstration toml file and set the relevant pass arguments (this is all done automatically if we were to use the command line, see [Section 2](#section-2-int8-quantization))

In [7]:
JSC_TOML_PATH = "../../../machop/configs/onnx/jsc_gpu_ort.toml"

# Reading TOML file and converting it into a Python dictionary
with open(JSC_TOML_PATH, 'r') as toml_file:
    pass_args = toml.load(toml_file)

# Extract the 'passes.tensorrt' section and its children
onnx_config = pass_args.get('passes', {}).get('onnxruntime', {})
# Extract the 'passes.runtime_analysis' section and its children
runtime_analysis_config = pass_args.get('passes', {}).get('runtime_analysis', {})

# Load the basics in
model_name = pass_args['model']
dataset_name = pass_args['dataset']
max_epochs = pass_args['max_epochs']
batch_size = pass_args['batch_size']
learning_rate = pass_args['learning_rate']
accelerator = pass_args['accelerator']

data_module = MaseDataModule(
    name=dataset_name,
    batch_size=batch_size,
    model_name=model_name,
    num_workers=0,
)

data_module.prepare_data()
data_module.setup()

# Add the data_module and other necessary information to the configs
configs = [onnx_config, runtime_analysis_config]
for config in configs:
    config['task'] = pass_args['task']
    config['batch_size'] = pass_args['batch_size']
    config['model'] = pass_args['model']
    config['data_module'] = data_module
    config['dataset'] = pass_args['dataset']
    config['accelerator'] = 'cuda' if pass_args['accelerator'] == 'gpu' else pass_args['accelerator']
    if config['accelerator'] == 'gpu':
        os.environ['CUDA_MODULE_LOADING'] = 'LAZY'

model_info = get_model_info(model_name)
model = get_model(
    model_name,
    task="cls",
    dataset_info=data_module.dataset_info,
    pretrained=False)

input_generator = InputGenerator(
    data_module=data_module,
    model_info=model_info,
    task="cls",
    which_dataloader="train",
)

# generate the mase graph and initialize node metadata
mg = MaseGraph(model=model)

Next, we train the `jsc-toy` model using the machop `train` action with the config from the toml file. You may want to switch to GPU for this task - it will not affect the cpu optimizations later on.

In [3]:
# !ch train --config {JSC_TOML_PATH} --accelerator gpu

Then we load in the checkpoint. You will have to adjust this according to where it has been stored in the mase_output directory.

In [8]:
# Load in the trained checkpoint - change this accordingly
JSC_CHECKPOINT_PATH = "../../../mase_output/jsc-toy_cls_jsc/software/training_ckpts/best.ckpt"
model = load_model(load_name=JSC_CHECKPOINT_PATH, load_type="pl", model=model)

# Initiate metadata
dummy_in = next(iter(input_generator))
_ = model(**dummy_in)
mg, _ = init_metadata_analysis_pass(mg, None)

# Copy original graph for analysis later
mg_original = deepcopy_mase_graph(mg)

mg, _ = add_common_metadata_analysis_pass(mg, {"dummy_in": dummy_in})
mg, _ = add_software_metadata_analysis_pass(mg, None)
mg, _ = metadata_value_type_cast_transform_pass(mg, pass_args={"fn": to_numpy_if_tensor})

[32mINFO    [0m [34mLoaded pytorch lightning checkpoint from ../../../mase_output/jsc-toy_cls_jsc/software/training_ckpts/best.ckpt[0m
I0327 14:20:09.068645 140012160939840 checkpoint_load.py:85] Loaded pytorch lightning checkpoint from ../../../mase_output/jsc-toy_cls_jsc/software/training_ckpts/best.ckpt


We then run the `onnx_runtime_interface_pass` which completes the optimizations using the dataloader and `jsc-toy` model. This returns metadata containing the paths to the models:

- `onnx_path` (the optimized model)
- `onnx_dynamic_quantized_path` (the dynamically )

In this case, since we are not quantizing the model, only the `onnx_path` is available. 

The models are also stored in the directory:
```
mase_output
└── onnxrt
    └── model_task_dataset_date
        ├── optimized
        ├── pre_processed
        ├── static_quantized
        └── dynamic_quantized
```

In [9]:
mg, onnx_meta = onnx_runtime_interface_pass(mg, pass_args=onnx_config)

[32mINFO    [0m [34mConverting PyTorch model to ONNX...[0m
I0327 14:20:12.535338 140012160939840 onnx_runtime.py:48] Converting PyTorch model to ONNX...
[32mINFO    [0m [34mProject will be created at /root/mase/mase_output/onnxrt/jsc-toy_cls_jsc_2024-03-27[0m
I0327 14:20:12.539771 140012160939840 onnx_runtime.py:50] Project will be created at /root/mase/mase_output/onnxrt/jsc-toy_cls_jsc_2024-03-27
[32mINFO    [0m [34mONNX Conversion Complete. Stored ONNX model to /root/mase/mase_output/onnxrt/jsc-toy_cls_jsc_2024-03-27/optimized/version_1/model.onnx[0m
I0327 14:20:12.751212 140012160939840 onnx_runtime.py:68] ONNX Conversion Complete. Stored ONNX model to /root/mase/mase_output/onnxrt/jsc-toy_cls_jsc_2024-03-27/optimized/version_1/model.onnx
[32mINFO    [0m [34mONNX Model Summary: 
+-------+----------------------------------+--------------------+--------------------------------------------------------------------------------------------------------------------------+---

We can view a summary of the ONNX model (which is the unmodified from the Pytorch one), however it should be optimized. Let's run an analysis path on both the original `MaseGraph` and the `.onnx` optimized model.

In [10]:
_, _ = runtime_analysis_pass(mg_original, pass_args=runtime_analysis_config)

[32mINFO    [0m [34mStarting transformation analysis on jsc-toy[0m
I0327 14:20:16.984423 140012160939840 analysis.py:270] Starting transformation analysis on jsc-toy
[32mINFO    [0m [34m
Results jsc-toy:
+------------------------------+---------------+
|      Metric (Per Batch)      |     Value     |
+------------------------------+---------------+
|    Average Test Accuracy     |    0.73159    |
|      Average Precision       |    0.74429    |
|        Average Recall        |    0.73023    |
|       Average F1 Score       |    0.73347    |
|         Average Loss         |    0.76373    |
|       Average Latency        |  0.79688 ms   |
|   Average GPU Power Usage    |   21.816 W    |
| Inference Energy Consumption | 0.0048292 mWh |
+------------------------------+---------------+[0m
I0327 14:20:19.793779 140012160939840 analysis.py:398] 
Results jsc-toy:
+------------------------------+---------------+
|      Metric (Per Batch)      |     Value     |
+-------------------------

In [11]:
_, _ = runtime_analysis_pass(onnx_meta['onnx_path'], pass_args=runtime_analysis_config)

[32mINFO    [0m [34mStarting transformation analysis on jsc-toy-onnx[0m
I0327 14:20:33.337222 140012160939840 analysis.py:270] Starting transformation analysis on jsc-toy-onnx


[32mINFO    [0m [34m
Results jsc-toy-onnx:
+------------------------------+---------------+
|      Metric (Per Batch)      |     Value     |
+------------------------------+---------------+
|    Average Test Accuracy     |    0.73412    |
|      Average Precision       |    0.74875    |
|        Average Recall        |    0.73435    |
|       Average F1 Score       |    0.73761    |
|         Average Loss         |    0.74954    |
|       Average Latency        |   0.2215 ms   |
|   Average GPU Power Usage    |   21.575 W    |
| Inference Energy Consumption | 0.0013275 mWh |
+------------------------------+---------------+[0m
I0327 14:20:35.876071 140012160939840 analysis.py:398] 
Results jsc-toy-onnx:
+------------------------------+---------------+
|      Metric (Per Batch)      |     Value     |
+------------------------------+---------------+
|    Average Test Accuracy     |    0.73412    |
|      Average Precision       |    0.74875    |
|        Average Recall        |    0.7

As shown above, the latency of the cpu inference is around 3.5x less with the `jsc-toy` model without compromising accuracy simply by using the optimizations of ONNXRT. 

Lets now run the same optimzations, this time using a GPU and a larger model - the `vgg7`.  We will also utilse the chop action from the terminal which runs the same `onnx_runtime_interface_pass` pass.

First lets train the `vgg7` model using the machop `train` action with the config from the new toml file and then load the trained checkpoint it into the `transform` pass.

In [4]:
VGG_TOML_PATH = "../../../machop/configs/onnx/vgg7_gpu_quant.toml"

# !ch train --config {VGG_TOML_PATH}

# Load in the checkpoint from the previous train - modify accordingly
VGG_CHECKPOINT_PATH = "../../../mase_output/vgg7-pre-trained/test-accu-0.9332.ckpt"

!ch transform --config {VGG_TOML_PATH} --load {VGG_CHECKPOINT_PATH} --load-type pl 

[2024-03-28 23:09:44,122] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0328 23:09:47.151937 140014036379456 seed.py:54] Seed set to 0
+-------------------------+------------------------+--------------------------+--------------------------+--------------------------+
| Name                    |        Default         |       Config. File       |     Manual Override      |        Effective         |
+-------------------------+------------------------+--------------------------+--------------------------+--------------------------+
| task                    |     [38;5;8mclassification[0m     |           cls            |                          |           cls            |
| load_name               |          [38;5;8mNone[0m          | [38;5;8m../mase_output/vgg7-pre-[0m | /root/mase/mase_output/v | /root/mase/mase_output/v |
|                         |                        |      [38;5;8mtrained/test-[0m   

As shown above, the latency of the gpu inference is 30% less with the `vgg7` model without compromising accuracy simply by using the optimizations of ONNXRT. 

We will now look at quantization to further speed up the model. 

## Section 2. Quantization

We may quantize either using FP16 or INT8 by setting the `precision` parameter in `passes.onnxruntime.default.config` to `'fp16'` or `'int8'` respectively. INT8 quantization will show the most notable latency improvements but is more likely to lower performance. 

There are three types of quantization for ONNXRT and can be set in `onnxruntime.default.config` under `quantization_types`. The differences of the first two are for how they calibrate i.e. set the scale and zero points which are only relevant for integer based quantization:
- **Static Quantization**:
    - The scale and zero point of activations are calculated in advance (offline) using a calibration data set.
    - The activations have the same scale and zero point during each forward pass.
    - The `num_calibration_batches` parameter must also be set to ensure calibration is tested on a subset of the training dataset. A larger subset will be beneficial for calibrating the amaxes and may improve accuracy, however it will result in a longer calibration time.
- **Dynamic Quantization**:
    - The scale and zero point of activations are calculated on-the-fly (online) and are specific for each forward pass.
    - This approach is more accurate but introduces extra computational overhead

The `onnx_runtime_interface_pass` pass also supports mixed precision. This is an automatic only procedure, where ONNXRT finds a minimal set of ops to skip while retaining a certain level of accuracy, converting most of the ops to float16 but leaving some in float32. 
- **Auto Mixed Precision Quantization**:
    - Automatically adjusts between FP16 and FP32 precisions to retain certain level of accuracy
    - The `precision` parameter does not need to be set in the config since the whole process is automatic.
    - Unfortunately, this process is currently only supported on GPU.
    - This approach is most beneficial when INT8 or FP16 exclusive quantizations (static or dynamic) are giving poor results.

All three methodolgies first pre-procsses the model before quantization adding further optimizations. This intermidate model is stored to the `pre-processed` directory. 

For this example, we will set the `precision` to `'uint8'` (since `ConvInteger` node is not currently supported for `'int8'` on ONNXRT GPU execution provider). 

We will also set the `precision_types` to `['static', 'dynamic', 'auto']` to compare all three quantization methods, whilst keeping the other settings the exact same for a fair comparison against the optimized `vgg7` model used in the previous section.

In [2]:
JSC_TOML_PATH = "../../../machop/configs/onnx/jsc_gpu_quant.toml"
JSC_CHECKPOINT_PATH = "../../../mase_output/jsc-toy_cls_jsc/software/training_ckpts/best.ckpt"
!ch transform --config {JSC_TOML_PATH} --load {JSC_CHECKPOINT_PATH} --load-type pl

[2024-03-27 20:47:18,736] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0327 20:47:20.915161 139994198443840 seed.py:54] Seed set to 0
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| Name                    |        Default         | Config. File |     Manual Override      |        Effective         |
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| task                    |     [38;5;8mclassification[0m     |     cls      |                          |           cls            |
| load_name               |          [38;5;8mNone[0m          |              | /root/mase/mase_output/j | /root/mase/mase_output/j |
|                         |                        |              | sc-toy_cls_jsc/software/ | sc-toy_cls_jsc/software/ |
|                         |           

In [2]:
VGG_TOML_PATH = "../../../machop/configs/onnx/vgg7_gpu_quant.toml"
VGG_CHECKPOINT_PATH = "../../../mase_output/vgg7-pre-trained/test-accu-0.9332.ckpt"
!ch transform --config {VGG_TOML_PATH} --load {VGG_CHECKPOINT_PATH} --load-type pl

[2024-03-27 17:21:25,864] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0327 17:21:28.171163 140556050732864 seed.py:54] Seed set to 0
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| Name                    |        Default         | Config. File |     Manual Override      |        Effective         |
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| task                    |     [38;5;8mclassification[0m     |     cls      |                          |           cls            |
| load_name               |          [38;5;8mNone[0m          |              | /root/mase/mase_output/v | /root/mase/mase_output/v |
|                         |                        |              |  gg7-pre-trained/test-   |  gg7-pre-trained/test-   |
|                         |           

In [2]:
VGG_TOML_PATH = "../../../machop/configs/onnx/vgg7_cpu_quant.toml"
VGG_CHECKPOINT_PATH = "../../../mase_output/vgg7-pre-trained/test-accu-0.9332.ckpt"
!ch transform --config {VGG_TOML_PATH} --load {VGG_CHECKPOINT_PATH} --load-type pl 

[2024-03-27 17:15:55,643] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0327 17:15:58.257988 139658009347904 seed.py:54] Seed set to 0
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| Name                    |        Default         | Config. File |     Manual Override      |        Effective         |
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| task                    |     [38;5;8mclassification[0m     |     cls      |                          |           cls            |
| load_name               |          [38;5;8mNone[0m          |              | /root/mase/mase_output/v | /root/mase/mase_output/v |
|                         |                        |              |  gg7-pre-trained/test-   |  gg7-pre-trained/test-   |
|                         |           

In [2]:
MOBILENET_SMALL_TOML_PATH = "../../../machop/configs/onnx/mobilenetv3_small_gpu_quant.toml"
MOBILENET_SMALL_CHECKPOINT_PATH = "../../../mase_output/mobilenetv3-small_pre-trained_mnist/mobilenetv3_small_mnist_0.997.ckpt"

!ch transform --config {MOBILENET_SMALL_TOML_PATH} --load {MOBILENET_SMALL_CHECKPOINT_PATH} --load-type pl

[2024-03-28 01:57:07,259] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0328 01:57:10.158692 139887641352000 seed.py:54] Seed set to 0
+-------------------------+------------------------+-------------------+--------------------------+--------------------------+
| Name                    |        Default         |   Config. File    |     Manual Override      |        Effective         |
+-------------------------+------------------------+-------------------+--------------------------+--------------------------+
| task                    |     [38;5;8mclassification[0m     |        cls        |                          |           cls            |
| load_name               |          [38;5;8mNone[0m          |                   | /root/mase/mase_output/m | /root/mase/mase_output/m |
|                         |                        |                   | obilenetv3-small_pre-tra | obilenetv3-small_pre-tra |
|       

In [8]:
MOBILENET_TOML_PATH = "../../../machop/configs/onnx/mobilenetv3_large_gpu_quant.toml"
!ch transform --config {MOBILENET_TOML_PATH}

[2024-03-27 21:02:08,767] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0327 21:02:10.969612 139809641850688 seed.py:54] Seed set to 0
+-------------------------+------------------------+-------------------+-----------------+------------------------+
| Name                    |        Default         |   Config. File    | Manual Override |       Effective        |
+-------------------------+------------------------+-------------------+-----------------+------------------------+
| task                    |     [38;5;8mclassification[0m     |        cls        |                 |          cls           |
| load_name               |          None          |                   |                 |          None          |
| load_type               |           mz           |                   |                 |           mz           |
| batch_size              |          [38;5;8m128[0m           |        64         |  

In [14]:
MOBILENET_TOML_PATH = "../../../machop/configs/onnx/mobilenetv3_large_gpu_quant.toml"
!ch transform --config {MOBILENET_TOML_PATH}

[2024-03-27 21:20:20,076] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0327 21:20:22.271312 140179936118592 seed.py:54] Seed set to 0
+-------------------------+------------------------+-------------------+-----------------+------------------------+
| Name                    |        Default         |   Config. File    | Manual Override |       Effective        |
+-------------------------+------------------------+-------------------+-----------------+------------------------+
| task                    |     [38;5;8mclassification[0m     |        cls        |                 |          cls           |
| load_name               |          None          |                   |                 |          None          |
| load_type               |           mz           |                   |                 |           mz           |
| batch_size              |          [38;5;8m128[0m           |        64         |  