# Welcome To the ONNX Runtime Tutorial!

This notebook is designed to demonstrate the features of the ONNXRT passes integrated into MASE as part of the MASERT framework.

## Section 1. ONNX Runtime Optimizations
Firstly, we will show you how we can utilise the ONNX RT optimizations. We expect to see a speed up without a loss in model accuracy. We will use a simple model, `jsc-toy`, and compare the optimized model to the original model using the `Machop API`.

First, we load the machop requirements by running the cell below.

In [1]:
import sys
import os
from pathlib import Path
import toml

# Figure out the correct path
machop_path = Path(".").resolve().parent.parent.parent /"machop"
assert machop_path.exists(), "Failed to find machop at: {}".format(machop_path)
sys.path.append(str(machop_path))

# Add directory to the PATH so that chop can be called
new_path = "../../../machop"
full_path = os.path.abspath(new_path)
os.environ['PATH'] += os.pathsep + full_path

from chop.tools.utils import to_numpy_if_tensor
from chop.tools.logger import set_logging_verbosity
from chop.tools import get_cf_args, get_dummy_input
from chop.passes.graph.utils import deepcopy_mase_graph
from chop.tools.get_input import InputGenerator
from chop.tools.checkpoint_load import load_model
from chop.ir import MaseGraph
from chop.models import get_model_info, get_model, get_tokenizer
from chop.dataset import MaseDataModule, get_dataset_info
from chop.passes.graph.transforms import metadata_value_type_cast_transform_pass
from chop.passes.graph import (
    summarize_quantization_analysis_pass,
    add_common_metadata_analysis_pass,
    init_metadata_analysis_pass,
    add_software_metadata_analysis_pass,
    onnx_runtime_transform_pass,
    runtime_analysis_pass,
    )

set_logging_verbosity("info")

  from .autonotebook import tqdm as notebook_tqdm


[2024-03-23 17:56:50,031] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)


[32mINFO    [0m [34mSet logging level to info[0m
I0323 17:56:51.653448 140222021797696 logger.py:44] Set logging level to info


We then load in a demonstration toml file and set the relevant pass arguments (this is all done automatically if we were to use the command line, see [Section X]())

In [2]:
# JSC_TOML_PATH = "../../../machop/configs/onnx/jsc_gpu_ort.toml"

# # Reading TOML file and converting it into a Python dictionary
# with open(JSC_TOML_PATH, 'r') as toml_file:
#     pass_args = toml.load(toml_file)

VGG_TOML_PATH = "../../../machop/configs/onnx/vgg7_gpu_quant.toml"

with open(VGG_TOML_PATH, 'r') as toml_file:
    pass_args = toml.load(toml_file)

# Extract the 'passes.tensorrt' section and its children
onnx_config = pass_args.get('passes', {}).get('onnxruntime', {})
# Extract the 'passes.runtime_analysis' section and its children
runtime_analysis_config = pass_args.get('passes', {}).get('runtime_analysis', {})

# Load the basics in
model_name = pass_args['model']
dataset_name = pass_args['dataset']
max_epochs = pass_args['max_epochs']
batch_size = pass_args['batch_size']
learning_rate = pass_args['learning_rate']
accelerator = pass_args['accelerator']

data_module = MaseDataModule(
    name=dataset_name,
    batch_size=batch_size,
    model_name=model_name,
    num_workers=0,
)

data_module.prepare_data()
data_module.setup()

# Add the data_module and other necessary information to the configs
configs = [onnx_config, runtime_analysis_config]
for config in configs:
    config['task'] = pass_args['task']
    config['batch_size'] = pass_args['batch_size']
    config['model'] = pass_args['model']
    config['data_module'] = data_module
    config['dataset'] = pass_args['dataset']
    config['accelerator'] = 'cuda' if pass_args['accelerator'] == 'gpu' else pass_args['accelerator']
    if config['accelerator'] == 'gpu':
        os.environ['CUDA_MODULE_LOADING'] = 'LAZY'

model_info = get_model_info(model_name)
model = get_model(
    model_name,
    task="cls",
    dataset_info=data_module.dataset_info,
    pretrained=False)

input_generator = InputGenerator(
    data_module=data_module,
    model_info=model_info,
    task="cls",
    which_dataloader="train",
)

# generate the mase graph and initialize node metadata
mg = MaseGraph(model=model)

Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified


Next, we train the `jsc-toy` model using the machop `train` action with the config from the toml file. You may want to switch to GPU for this task - it will not affect the cpu optimizations later on.

In [3]:
# !ch train --config {JSC_TOML_PATH} --accelerator gpu

Then we load in the checkpoint. You will have to adjust this according to where it has been stored in the mase_output directory.

In [3]:
# Load in the trained checkpoint - change this accordingly
# JSC_CHECKPOINT_PATH = "../../../mase_output/jsc-toy_cls_jsc/software/training_ckpts/best.ckpt"
# model = load_model(load_name=JSC_CHECKPOINT_PATH, load_type="pl", model=model)

VGG_CHECKPOINT_PATH = "../../../mase_output/vgg7-pre-trained/test-accu-0.9332.ckpt"
model = load_model(load_name=VGG_CHECKPOINT_PATH, load_type="pl", model=model)

# Initiate metadata
dummy_in = next(iter(input_generator))
_ = model(**dummy_in)
mg, _ = init_metadata_analysis_pass(mg, None)

# Copy original graph for analysis later
mg_original = deepcopy_mase_graph(mg)

mg, _ = add_common_metadata_analysis_pass(mg, {"dummy_in": dummy_in})
mg, _ = add_software_metadata_analysis_pass(mg, None)
mg, _ = metadata_value_type_cast_transform_pass(mg, pass_args={"fn": to_numpy_if_tensor})

[32mINFO    [0m [34mLoaded pytorch lightning checkpoint from ../../../mase_output/vgg7-pre-trained/test-accu-0.9332.ckpt[0m
I0323 17:57:22.510873 140222021797696 checkpoint_load.py:85] Loaded pytorch lightning checkpoint from ../../../mase_output/vgg7-pre-trained/test-accu-0.9332.ckpt


We then run the `onnx_runtime_transform_pass` which completes the optimizations using the dataloader and `jsc-toy` model. This returns metadata containing the paths to the models:

- `onnx_path` (the optimized model)
- `onnx_dynamic_quantized_path` (the dynamically )

In this case, since we are not quantizing the model, only the `onnx_path` is available. 

The models are also stored in the directory:
```
mase_output
└── onnxrt
    └── model_task_dataset_date
        ├── optimized
        ├── pre_processed
        ├── static_quantized
        └── dynamic_quantized
```

In [4]:
mg, onnx_meta = onnx_runtime_transform_pass(mg, pass_args=onnx_config)

[32mINFO    [0m [34mConverting PyTorch model to ONNX...[0m
I0323 17:57:30.319553 140222021797696 onnx_runtime.py:51] Converting PyTorch model to ONNX...
[32mINFO    [0m [34mProject will be created at /root/mase/mase_output/onnxrt/vgg7_cls_cifar10_2024-03-23[0m
I0323 17:57:30.322780 140222021797696 onnx_runtime.py:53] Project will be created at /root/mase/mase_output/onnxrt/vgg7_cls_cifar10_2024-03-23
[32mINFO    [0m [34mONNX Conversion Complete. Stored ONNX model to /root/mase/mase_output/onnxrt/vgg7_cls_cifar10_2024-03-23/optimized/version_1/model.onnx[0m
I0323 17:57:31.196530 140222021797696 onnx_runtime.py:71] ONNX Conversion Complete. Stored ONNX model to /root/mase/mase_output/onnxrt/vgg7_cls_cifar10_2024-03-23/optimized/version_1/model.onnx
[32mINFO    [0m [34mONNX Model Summary: 
+-------+----------------------------+----------+---------------------------------------------------------------------+-------------------------------------+------------------------------

We can view a summary of the ONNX model (which is the unmodified from the Pytorch one), however it should be optimized. Let's run an analysis path on both the original `MaseGraph` and the `.onnx` optimized model.

In [6]:
_, _ = runtime_analysis_pass(mg_original, pass_args=runtime_analysis_config)

[32mINFO    [0m [34mStarting transformation analysis on vgg7[0m
I0323 17:59:20.901860 140222021797696 analysis.py:270] Starting transformation analysis on vgg7
[32mINFO    [0m [34m
Results vgg7:
+------------------------------+--------------+
|      Metric (Per Batch)      |    Value     |
+------------------------------+--------------+
|    Average Test Accuracy     |   0.88664    |
|      Average Precision       |   0.91334    |
|        Average Recall        |   0.91351    |
|       Average F1 Score       |   0.91325    |
|         Average Loss         |   0.26564    |
|       Average Latency        |   3.65 ms    |
|   Average GPU Power Usage    |   31.981 W   |
| Inference Energy Consumption | 0.032425 mWh |
+------------------------------+--------------+[0m
I0323 17:59:28.030013 140222021797696 analysis.py:393] 
Results vgg7:
+------------------------------+--------------+
|      Metric (Per Batch)      |    Value     |
+------------------------------+--------------+
|   

In [7]:
_, _ = runtime_analysis_pass(onnx_meta['onnx_path'], pass_args=runtime_analysis_config)

[32mINFO    [0m [34mStarting transformation analysis on vgg7-onnx[0m
I0323 17:59:35.753415 140222021797696 analysis.py:270] Starting transformation analysis on vgg7-onnx


[32mINFO    [0m [34m
Results vgg7-onnx:
+------------------------------+--------------+
|      Metric (Per Batch)      |    Value     |
+------------------------------+--------------+
|    Average Test Accuracy     |   0.90009    |
|      Average Precision       |   0.92233    |
|        Average Recall        |   0.92172    |
|       Average F1 Score       |   0.92141    |
|         Average Loss         |   0.24462    |
|       Average Latency        |  2.3949 ms   |
|   Average GPU Power Usage    |   39.176 W   |
| Inference Energy Consumption | 0.026062 mWh |
+------------------------------+--------------+[0m
I0323 17:59:40.753919 140222021797696 analysis.py:393] 
Results vgg7-onnx:
+------------------------------+--------------+
|      Metric (Per Batch)      |    Value     |
+------------------------------+--------------+
|    Average Test Accuracy     |   0.90009    |
|      Average Precision       |   0.92233    |
|        Average Recall        |   0.92172    |
|       Averag

In [8]:
_, _ = runtime_analysis_pass(onnx_meta['onnx_static_quantized_path'], pass_args=runtime_analysis_config)

[0;93m2024-03-23 18:00:23.525246942 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 9 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.[m
[0;93m2024-03-23 18:00:23.526694728 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.[m
[0;93m2024-03-23 18:00:23.526715163 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.[m
[32mINFO    [0m [34mStarting transformation analysis on vgg7-onnx[0m
I0323 18:00:23.547759 140222021797696 analysis.py:270] Starting transformation analysis on

In [10]:
_, _ = runtime_analysis_pass(onnx_meta['onnx_dynamic_quantized_path'], pass_args=runtime_analysis_config)

[0;93m2024-03-23 18:01:34.334436249 [W:onnxruntime:, transformer_memcpy.cc:74 ApplyImpl] 26 Memcpy nodes are added to the graph main_graph for CUDAExecutionProvider. It might have negative impact on performance (including unable to run CUDA graph). Set session_options.log_severity_level=1 to see the detail logs before this message.[m


NotImplemented: [ONNXRuntimeError] : 9 : NOT_IMPLEMENTED : Could not find an implementation for ConvInteger(10) node with name '/feature_layers.0/Conv_quant'

As shown above, the latency has decreased around 4x with the `jsc-toy` model without compromising accuracy due to the well calibrated amax and quantization-aware fine tuning. The inference energy consumption has thus also dropped tremendously and this is an excellent demonstration for the need to quantize in industry especially for LLMs in order to reduce energy usage. 

## Section 2. INT8 Quantization

We may quantize either using FP16 or INT8 by setting the `precision` parameter in `passes.onnxruntime.default.config` to `'fp16'` or `'int8'` respectively. INT8 quantization will show the most notable latency improvements but will likely lower performance. 

There are two types of quantization for ONNXRT and can be set in `onnxruntime.default.config` under `quantization_types`. The differences are for how they calibrate i.e. set the scale and zero points which are only relevant for integer based quantization:
- **Static Quantization**:
    - The scale and zero point of activations are calculated in advance (offline) using a calibration data set.
    - The activations have the same scale and zero point during each forward pass.
- **Dynamic Quantization**:
    - The scale and zero point of activations are calculated on-the-fly (online) and are specific for each forward pass.
    - This approach is more accurate but introduces extra computational overhead.

Both methodolgies first pre-procsses the model before quantization adding further optimizations. This intermidate model is stored to the `pre-processed` directory. 

For this example, we will set the `precision` to `'int8'` and the `precision_types` to `['static', 'dynamic']` to compare both quantization methods, whilst keeping the other settings the exact same for a fair comparison against the optimized model. This time however, we will use chop from the terminal which runs the same pass.

In [1]:
JSC_TOML_PATH = "../../../machop/configs/onnx/jsc_gpu_quant.toml"
# VGG_TOML_PATH = "../../../machop/configs/onnx/vgg7_gpu_quant.toml"

In [4]:
!ch transform --config {JSC_TOML_PATH}

[2024-03-23 17:30:06,279] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0323 17:30:08.548976 139671930701632 seed.py:54] Seed set to 0
+-------------------------+------------------------+--------------+-----------------+------------------------+
| Name                    |        Default         | Config. File | Manual Override |       Effective        |
+-------------------------+------------------------+--------------+-----------------+------------------------+
| task                    |     [38;5;8mclassification[0m     |     cls      |                 |          cls           |
| load_name               |          None          |              |                 |          None          |
| load_type               |           mz           |              |                 |           mz           |
| batch_size              |          [38;5;8m128[0m           |     256      |                 |          256      