# TensorRT Quantization Tutorial

This notebook is designed to show the features of the TensorRT passes integrated into MASE as part of the MASERT framework. The following demonstrations were run on a NVIDIA RTX A2000 GPU with a Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz CPU.

## Section 1. INT8 Quantization
Firstly, we will show you how to do a int8 quantization of a simple model, `jsc-toy`, and compare the quantized model to the original model using the `Machop API`. The quantization process is split into the following stages, each using their own individual pass, and are explained in depth at each subsection:

1. [Fake quantization](#section-11-fake-quantization): `tensorrt_fake_quantize_transform_pass`
2. [Calibration](#section-12-calibration): `tensorrt_calibrate_transform_pass`
3. [Quantized Aware Training](#section-13-quantized-aware-training-qat): `tensorrt_fine_tune_transform_pass`
4. [Quantization](#section-14-tensorrt-quantization): `tensorrt_engine_interface_pass`
5. [Analysis](#section-15-performance-analysis): `tensorrt_analysis_pass`

We start by loading in the required libraries and passes required for the notebook as well as ensuring the correct path is set for machop to be used.

In [1]:
import sys
import os
from pathlib import Path
import toml

# Figure out the correct path
machop_path = Path(".").resolve().parent.parent.parent /"src"
assert machop_path.exists(), "Failed to find machop at: {}".format(machop_path)
sys.path.append(str(machop_path))

# Add directory to the PATH so that chop can be called
new_path = "../../../machop"
full_path = os.path.abspath(new_path)
os.environ['PATH'] += os.pathsep + full_path

from chop.tools.utils import to_numpy_if_tensor
from chop.tools.logger import set_logging_verbosity
from chop.tools import get_cf_args, get_dummy_input
from chop.passes.graph.utils import deepcopy_mase_graph
from chop.tools.get_input import InputGenerator
from chop.tools.checkpoint_load import load_model
from chop.ir import MaseGraph
from chop.models import get_model_info, get_model, get_tokenizer
from chop.dataset import MaseDataModule, get_dataset_info
from chop.passes.graph.transforms import metadata_value_type_cast_transform_pass
from chop.passes.graph import (
    summarize_quantization_analysis_pass,
    add_common_metadata_analysis_pass,
    init_metadata_analysis_pass,
    add_software_metadata_analysis_pass,
    tensorrt_calibrate_transform_pass,
    tensorrt_fake_quantize_transform_pass,
    tensorrt_fine_tune_transform_pass,
    tensorrt_engine_interface_pass,
    runtime_analysis_pass,
    )

set_logging_verbosity("info")

[32mINFO    [0m [34mSet logging level to info[0m


Check dependency (the dependent package "cuda" refers to "cuda-python")

In [2]:
from chop.tools.check_dependency import check_deps_tensorRT_pass
check_deps_tensorRT_pass(silent=False)

[32mINFO    [0m [34mExtension: All dependencies for TensorRT pass are available.[0m


True

Next, we load in the toml file used for quantization. To view the configuration, click [here](../../../machop/configs/tensorrt/jsc_toy_INT8_quantization_by_type.toml).

In [3]:
import toml
# Path to your TOML file
# JSC_TOML_PATH = 'toy_INT8_quantization_by_type.toml'
JSC_TOML_PATH = 'resnet18_INT8_quant.toml'

# Reading TOML file and converting it into a Python dictionary
with open(JSC_TOML_PATH, 'r') as toml_file:
    pass_args = toml.load(toml_file)

# Extract the 'passes.tensorrt' section and its children
tensorrt_config = pass_args.get('passes', {}).get('tensorrt', {})
print(tensorrt_config)
# Extract the 'passes.runtime_analysis' section and its children
runtime_analysis_config = pass_args.get('passes', {}).get('tensorrt', {}).get('runtime_analysis', {})
print(runtime_analysis_config)

{'by': 'type', 'num_calibration_batches': 10, 'post_calibration_analysis': True, 'default': {'config': {'quantize': True, 'calibrators': ['percentile', 'mse', 'entropy'], 'percentiles': [99.0, 99.9, 99.99], 'precision': 'int8'}, 'input': {'calibrator': 'histogram', 'quantize_axis': False}, 'weight': {'calibrator': 'histogram', 'quantize_axis': False}}, 'fine_tune': {'fine_tune': True}, 'runtime_analysis': {'num_batches': 500, 'num_GPU_warmup_batches': 5, 'test': True}}
{'num_batches': 500, 'num_GPU_warmup_batches': 5, 'test': True}


We then create a `MaseGraph` by loading in a model and training it using the toml configuration model arguments.

In [4]:
from chop.dataset import MaseDataModule
from chop.models import get_model_info
from chop.models import get_model
from chop.tools.get_input import InputGenerator

# Load the basics in
model_name = pass_args['model']
dataset_name = pass_args['dataset']
max_epochs = pass_args['max_epochs']
batch_size = pass_args['batch_size']
learning_rate = pass_args['learning_rate']
accelerator = pass_args['accelerator']

data_module = MaseDataModule(
    name=dataset_name,
    batch_size=batch_size,
    model_name=model_name,
    num_workers=0,
)
data_module.prepare_data()
data_module.setup()

# Add the data_module and other necessary information to the configs
configs = [tensorrt_config, runtime_analysis_config]
for config in configs:
    config['task'] = pass_args['task']
    config['dataset'] = pass_args['dataset']
    config['batch_size'] = pass_args['batch_size']
    config['model'] = pass_args['model']
    config['data_module'] = data_module
    config['accelerator'] = 'cuda' if pass_args['accelerator'] == 'gpu' else pass_args['accelerator']
    if config['accelerator'] == 'gpu':
        os.environ['CUDA_MODULE_LOADING'] = 'LAZY'

model_info = get_model_info(model_name)
# quant_modules.initialize()
model = get_model(
    model_name,
    # task="cls",
    dataset_info=data_module.dataset_info,
    pretrained=False)


input_generator = InputGenerator(
    data_module=data_module,
    model_info=model_info,
    task="cls",
    which_dataloader="train",
)

# generate the mase graph and initialize node metadata
mg = MaseGraph(model=model)

model_info is MaseModelInfo(name='resnet', model_source=<ModelSource.TORCHVISION: 'torchvision'>, task_type=<ModelTaskType.VISION: 'vision'>, image_classification=True, physical_data_point_classification=False, sequence_classification=False, seq2seqLM=False, causal_LM=False, is_quantized=False, is_lora=False, is_sparse=False, is_fx_traceable=True)


Then we load in the checkpoint. You will have to adjust this according to where it has been stored in the mase_output directory.

In [5]:
# Load in the trained checkpoint - change this accordingly
JSC_CHECKPOINT_PATH = "/workspace/ADLS_Proj/mase_output/resnet18_cls_cifar10_2025-03-08/software/training_ckpts/best.ckpt"


model = load_model(load_name=JSC_CHECKPOINT_PATH, load_type="pl", model=model)
print("load model done!")

# Initiate metadata
dummy_in = next(iter(input_generator))
print("dummy in done")

_ = model(**dummy_in)
print("_ done")

mg, _ = init_metadata_analysis_pass(mg, None)
print("init_metadata_analysis_pass done")

mg, _ = add_common_metadata_analysis_pass(mg, {"dummy_in": dummy_in})
print("add_common_metadata_analysis_pass done")

mg, _ = add_software_metadata_analysis_pass(mg, None)
print("add_software_metadata_analysis_pass done")

mg, _ = metadata_value_type_cast_transform_pass(mg, pass_args={"fn": to_numpy_if_tensor})
print("metadata_value_type_cast_transform_pass done")

# Before we begin, we will copy the original MaseGraph model to use for comparison during quantization analysis
mg_original = deepcopy_mase_graph(mg)
print("deep copy done")

[32mINFO    [0m [34mLoaded pytorch lightning checkpoint from /workspace/ADLS_Proj/mase_output/resnet18_cls_cifar10_2025-03-08/software/training_ckpts/best.ckpt[0m


load model done!
dummy in done
_ done
init_metadata_analysis_pass done
add_common_metadata_analysis_pass done
add_software_metadata_analysis_pass done
metadata_value_type_cast_transform_pass done
using safe deepcopy
deep copy done


As you can see we have succesfully fake quantized all linear layers inside `jsc-toy`. This means that we will be able to simulate a quantized model in order to calibrate and fine tune it. This fake quantization was done on typewise i.e. for linear layers only. See [Section 4](#section-4-layer-wise-mixed-precision) for how to apply quantization layerwise - i.e. only first and second layers for example.

In [6]:
mg, _ = tensorrt_fake_quantize_transform_pass(mg, pass_args=tensorrt_config)
# summarize_quantization_analysis_pass(mg_original, mg)
summarize_quantization_analysis_pass(mg, pass_args={"save_dir": "quantize_summary_res18", "original_graph": mg_original})
mg, _ = tensorrt_calibrate_transform_pass(mg, pass_args=tensorrt_config)
mg, _ = tensorrt_fine_tune_transform_pass(mg, pass_args=tensorrt_config)

[32mINFO    [0m [34mApplying fake quantization to PyTorch model...[0m


op is placeholder
placeholder not in QUANTIZEABLE_OP
op is conv2d
node.op == call_module
op is batch_norm2d
batch_norm2d not in QUANTIZEABLE_OP
op is relu
relu not in QUANTIZEABLE_OP
op is max_pool2d
max_pool2d not in QUANTIZEABLE_OP
op is conv2d
node.op == call_module
op is batch_norm2d
batch_norm2d not in QUANTIZEABLE_OP
op is relu
relu not in QUANTIZEABLE_OP
op is conv2d
node.op == call_module
op is batch_norm2d
batch_norm2d not in QUANTIZEABLE_OP
op is add
add not in QUANTIZEABLE_OP
op is relu
relu not in QUANTIZEABLE_OP
op is conv2d
node.op == call_module
op is batch_norm2d
batch_norm2d not in QUANTIZEABLE_OP
op is relu
relu not in QUANTIZEABLE_OP
op is conv2d
node.op == call_module
op is batch_norm2d
batch_norm2d not in QUANTIZEABLE_OP
op is add
add not in QUANTIZEABLE_OP
op is relu
relu not in QUANTIZEABLE_OP
op is conv2d
node.op == call_module
op is batch_norm2d
batch_norm2d not in QUANTIZEABLE_OP
op is relu
relu not in QUANTIZEABLE_OP
op is conv2d
node.op == call_module
op is 

[32mINFO    [0m [34mFake quantization applied to PyTorch model.[0m
[32mINFO    [0m [34mQuantized graph histogram:[0m
[32mINFO    [0m [34m
| Original type     | OP                  |   Total |   Changed |   Unchanged |
|-------------------+---------------------+---------+-----------+-------------|
| AdaptiveAvgPool2d | adaptive_avg_pool2d |       1 |         0 |           1 |
| BatchNorm2d       | batch_norm2d        |      20 |         0 |          20 |
| MaxPool2d         | max_pool2d          |       1 |         0 |           1 |
| QuantConv2d       | conv2d              |      20 |         0 |          20 |
| QuantLinear       | linear              |       1 |         0 |           1 |
| ReLU              | relu                |      17 |         0 |          17 |
| add               | add                 |       8 |         0 |           8 |
| flatten           | flatten             |       1 |         0 |           1 |
| output            | output              |       1

op is batch_norm2d
batch_norm2d not in QUANTIZEABLE_OP
op is add
add not in QUANTIZEABLE_OP
op is relu
relu not in QUANTIZEABLE_OP
op is conv2d
node.op == call_module
op is batch_norm2d
batch_norm2d not in QUANTIZEABLE_OP
op is relu
relu not in QUANTIZEABLE_OP
op is conv2d
node.op == call_module
op is batch_norm2d
batch_norm2d not in QUANTIZEABLE_OP
op is add
add not in QUANTIZEABLE_OP
op is relu
relu not in QUANTIZEABLE_OP
op is adaptive_avg_pool2d
adaptive_avg_pool2d not in QUANTIZEABLE_OP
op is flatten
flatten not in QUANTIZEABLE_OP
op is linear
node.op == call_module
op is output
output not in QUANTIZEABLE_OP


[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
[32mINFO    [0m [34mDi

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

I0316 17:05:57.926141 140148613170240 rank_zero.py:63] `Trainer.fit` stopped: `max_epochs=2` reached.
[32mINFO    [0m [34mFine Tuning Complete[0m


### Section 1.4 TensorRT Quantization

After QAT, we are now ready to convert the model to a tensorRT engine so that it can be run with the superior inference speeds. To do so, we use the `tensorrt_engine_interface_pass` which converts the `MaseGraph`'s model from a Pytorch one to an ONNX format as an intermediate stage of the conversion.

During the conversion process, the `.onnx` and `.trt` files are stored to their respective folders shown in [Section 1.3](#section-13-quantized-aware-training-qat).

This interface pass returns a dictionary containing the `onnx_path` and `trt_engine_path`.

In [7]:
mg, meta = tensorrt_engine_interface_pass(mg, pass_args=tensorrt_config)

[32mINFO    [0m [34mStarting reference convert for QAT model...[0m
[32mINFO    [0m [34mReference model convert done. Now exporting to ONNX...[0m
[32mINFO    [0m [34mONNX Conversion Complete. Stored ONNX model to /workspace/ADLS_Proj/mase_output/tensorrt/quantization/resnet18_cls_cifar10_2025-03-16/2025-03-16/version_9/model.onnx[0m
[32mINFO    [0m [34mConverting PyTorch model to TensorRT...[0m


[03/16/2025-17:06:01] [TRT] [W] Calibrator is not being used. Users must provide dynamic range for all tensors that are not Int32 or Bool.
[03/16/2025-17:06:01] [TRT] [E] IBuilder::buildSerializedNetwork: Error Code 4: Internal Error (Calibration failure occurred with no scaling factors detected. This could be due to no int8 calibrator or insufficient custom scales for network layers. Please see int8 sample to setup calibration correctly.)


Exception: Failed to build serialized network. A builderflag or config parameter may be incorrect or the ONNX model is unsupported.

In [7]:
mg, meta = tensorrt_engine_interface_pass(mg, pass_args=tensorrt_config)

[32mINFO    [0m [34mStarting reference convert for QAT model...[0m
[32mINFO    [0m [34mReference model convert done. Now exporting to ONNX...[0m
[32mINFO    [0m [34mONNX Conversion Complete. Stored ONNX model to /workspace/ADLS_Proj/mase_output/tensorrt/quantization/resnet18_cls_cifar10_2025-03-16/2025-03-16/version_5/model.onnx[0m
[32mINFO    [0m [34mConverting PyTorch model to TensorRT...[0m
[32mINFO    [0m [34mTensorRT Conversion Complete. Stored trt model to /workspace/ADLS_Proj/mase_output/tensorrt/quantization/resnet18_cls_cifar10_2025-03-16/2025-03-16/version_6/model.trt[0m
[32mINFO    [0m [34mTensorRT Model Summary Exported to /workspace/ADLS_Proj/mase_output/tensorrt/quantization/resnet18_cls_cifar10_2025-03-16/2025-03-16/version_7/model.json[0m


In [7]:
mg, meta = tensorrt_engine_interface_pass(mg, pass_args=tensorrt_config)

[32mINFO    [0m [34mConverting PyTorch model to ONNX...[0m
[32mINFO    [0m [34mONNX Conversion Complete. Stored ONNX model to /workspace/ADLS_Proj/mase_output/tensorrt/quantization/resnet18_cls_cifar10_2025-03-16/2025-03-16/version_1/model.onnx[0m
[32mINFO    [0m [34mConverting PyTorch model to TensorRT...[0m
[32mINFO    [0m [34mTensorRT Conversion Complete. Stored trt model to /workspace/ADLS_Proj/mase_output/tensorrt/quantization/resnet18_cls_cifar10_2025-03-16/2025-03-16/version_2/model.trt[0m
[32mINFO    [0m [34mTensorRT Model Summary Exported to /workspace/ADLS_Proj/mase_output/tensorrt/quantization/resnet18_cls_cifar10_2025-03-16/2025-03-16/version_3/model.json[0m


### Section 1.5 Performance Analysis

To showcase the improved inference speeds and to evaluate accuracy and other performance metrics, the `tensorrt_analysis_pass` can be used.

The tensorRT engine path obtained the previous interface pass is now inputted into the the analysis pass. The same pass can take a MaseGraph as an input, as well as an ONNX graph. For this comparison, we will first run the anaylsis pass on the original unquantized model and then on the int8 quantized model.

In [8]:
_, _ = runtime_analysis_pass(mg_original, pass_args=runtime_analysis_config)
_, _ = runtime_analysis_pass(meta['trt_engine_path'], pass_args=runtime_analysis_config)

[32mINFO    [0m [34mStarting transformation analysis on resnet18[0m
[32mINFO    [0m [34m
Results resnet18:
+------------------------------+-------------+
|      Metric (Per Batch)      |    Value    |
+------------------------------+-------------+
|    Average Test Accuracy     |   0.73905   |
|      Average Precision       |   0.74136   |
|        Average Recall        |   0.74307   |
|       Average F1 Score       |   0.74105   |
|         Average Loss         |   0.73818   |
|       Average Latency        |  46.81 ms   |
|   Average GPU Power Usage    |  13.424 W   |
| Inference Energy Consumption | 0.17455 mWh |
+------------------------------+-------------+[0m
[32mINFO    [0m [34mRuntime analysis results saved to /workspace/mase_output/tensorrt/quantization/resnet18_cls_cifar10_2025-03-16/mase_graph/version_72/model.json[0m
[32mINFO    [0m [34mStarting transformation analysis on resnet18-trt_quantized[0m


[03/16/2025-16:58:46] [TRT] [W] Using default stream in enqueueV3() may lead to performance issues due to additional calls to cudaStreamSynchronize() by TensorRT to ensure correct synchronization. Please use non-default stream instead.


[32mINFO    [0m [34m
Results resnet18-trt_quantized:
+------------------------------+--------------+
|      Metric (Per Batch)      |    Value     |
+------------------------------+--------------+
|    Average Test Accuracy     |   0.73933    |
|      Average Precision       |   0.74141    |
|        Average Recall        |   0.74327    |
|       Average F1 Score       |   0.74114    |
|         Average Loss         |   0.73806    |
|       Average Latency        |  19.354 ms   |
|   Average GPU Power Usage    |   13.729 W   |
| Inference Energy Consumption | 0.073811 mWh |
+------------------------------+--------------+[0m
[32mINFO    [0m [34mRuntime analysis results saved to /workspace/mase_output/tensorrt/quantization/resnet18_cls_cifar10_2025-03-16/tensorrt/version_3/model.json[0m


As shown above, the latency has decreased around 6x with the `jsc-toy` model without compromising accuracy due to the well calibrated amax and quantization-aware fine tuning and additional runtime optimizations from TensorRT. The inference energy consumption has thus also dropped tremendously and this is an excellent demonstration for the need to quantize in industry especially for LLMs in order to reduce energy usage. 

## Section 2. FP16 Quantization

We will now load in a new toml configuration that uses fp16 instead of int8, whilst keeping the other settings the exact same for a fair comparison. This time however, we will use chop from the terminal which runs all the passes showcased in [Section 1](#section-1---int8-quantization).

Since float quantization does not require calibration, nor is it supported by `pytorch-quantization`, the model will not undergo fake quantization; for the time being this unfortunately means QAT is unavailable and only undergoes Post Training Quantization (PTQ). 

In [7]:
JSC_INT8_BY_TYPE_TOML = "/workspace/ADLS_Proj/docs/tutorials/tensorrt/resnet18_INT8_quant.toml"
JSC_CHECKPOINT_PATH = "/workspace/ADLS_Proj/mase_output/resnet18_cls_cifar10_2025-03-08/software/training_ckpts/best.ckpt"
!python ch transform --config {JSC_INT8_BY_TYPE_TOML} --load {JSC_CHECKPOINT_PATH} --load-type pl

INFO: Seed set to 0
I0315 02:21:21.056977 140494406755392 seed.py:57] Seed set to 0
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| Name                    |         Default          | Config. File |     Manual Override      |        Effective         |
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| task                    |      [38;5;8mclassification[0m      |     cls      |                          |           cls            |
| load_name               |           [38;5;8mNone[0m           |              | /workspace/ADLS_Proj/mas | /workspace/ADLS_Proj/mas |
|                         |                          |              | e_output/resnet18_cls_ci | e_output/resnet18_cls_ci |
|                         |                          |              | far10_2025-03-08/softwar | far10_2025-03-08/softwar |
|                     

In [None]:
JSC_INT8_BY_TYPE_TOML = "/workspace/ADLS_Proj/docs/tutorials/tensorrt/resnet18_INT8_quant.toml"
JSC_CHECKPOINT_PATH = "/workspace/ADLS_Proj/mase_output/resnet18_cls_cifar10_2025-03-08/software/training_ckpts/best.ckpt"
!python ch transform --config {JSC_INT8_BY_TYPE_TOML} --load {JSC_CHECKPOINT_PATH} --load-type pl

INFO: Seed set to 0
I0315 01:23:45.774668 139703610053696 seed.py:57] Seed set to 0
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| Name                    |         Default          | Config. File |     Manual Override      |        Effective         |
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| task                    |      [38;5;8mclassification[0m      |     cls      |                          |           cls            |
| load_name               |           [38;5;8mNone[0m           |              | /workspace/ADLS_Proj/mas | /workspace/ADLS_Proj/mas |
|                         |                          |              | e_output/resnet18_cls_ci | e_output/resnet18_cls_ci |
|                         |                          |              | far10_2025-03-08/softwar | far10_2025-03-08/softwar |
|                     

In [4]:
JSC_FP16_BY_TYPE_TOML = "/workspace/ADLS_Proj/docs/tutorials/tensorrt/resnet18_FP16_quant.toml"
JSC_CHECKPOINT_PATH = "/workspace/ADLS_Proj/mase_output/resnet18_cls_cifar10_2025-03-08/software/training_ckpts/best.ckpt"
!python ch transform --config {JSC_FP16_BY_TYPE_TOML} --load {JSC_CHECKPOINT_PATH} --load-type pl

INFO: Seed set to 0
I0315 01:36:29.907291 139755539158080 seed.py:57] Seed set to 0
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| Name                    |         Default          | Config. File |     Manual Override      |        Effective         |
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| task                    |      [38;5;8mclassification[0m      |     cls      |                          |           cls            |
| load_name               |           [38;5;8mNone[0m           |              | /workspace/ADLS_Proj/mas | /workspace/ADLS_Proj/mas |
|                         |                          |              | e_output/resnet18_cls_ci | e_output/resnet18_cls_ci |
|                         |                          |              | far10_2025-03-08/softwar | far10_2025-03-08/softwar |
|                     

In [6]:
JSC_FP32_BY_TYPE_TOML = "/workspace/ADLS_Proj/docs/tutorials/tensorrt/resnet18_FP32_quant.toml"
JSC_CHECKPOINT_PATH = "/workspace/ADLS_Proj/mase_output/resnet18_cls_cifar10_2025-03-08/software/training_ckpts/best.ckpt"
!python ch transform --config {JSC_FP32_BY_TYPE_TOML} --load {JSC_CHECKPOINT_PATH} --load-type pl

INFO: Seed set to 0
I0315 01:43:15.535419 140335826768960 seed.py:57] Seed set to 0
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| Name                    |         Default          | Config. File |     Manual Override      |        Effective         |
+-------------------------+--------------------------+--------------+--------------------------+--------------------------+
| task                    |      [38;5;8mclassification[0m      |     cls      |                          |           cls            |
| load_name               |           [38;5;8mNone[0m           |              | /workspace/ADLS_Proj/mas | /workspace/ADLS_Proj/mas |
|                         |                          |              | e_output/resnet18_cls_ci | e_output/resnet18_cls_ci |
|                         |                          |              | far10_2025-03-08/softwar | far10_2025-03-08/softwar |
|                     

As you can see, `fp16` acheives a slighty higher test accuracy but a slightly lower latency (~30%) from that of int8 quantization; it is still ~2.5x faster than the unquantized model. Now lets apply quantization to a more complicated model.

## Section 3. Type-wise Mixed Precision on Larger Model
We will now quantize `vgg7` which includes both convolutional and linear layers, however for this demonstration we want to quantize all layer types except the linear layers.

In this case, we set:

- The `by` parameter to `type`
- The `quantize` parameter to true for `passes.tensorrt.conv2d.config` and `precision` parameter to 'int8'.
- The `input` and `weight` quantize axis for the conv2d layers.
- The default `passes.tensorrt.default.config` precision to true. 

During the TensorRT quantization, the model's conv2d layers will be converted to an int8 fake quantized form, whilst the linear layers are kept to their default 'fp16'. Calibration of the conv2d layers and then fine tuning will be undergone before quantization and inference.

You may either download a pretrained model [here](https://imperiallondon-my.sharepoint.com/:f:/g/personal/zz7522_ic_ac_uk/Emh3VT7Q_qRFmnp8kDrcgDoBwGUuzLwwKNtX8ZAt368jJQ?e=gsKONa), otherwise train it yourself as shown below. 

In [2]:
VGG_TYPEWISE_TOML = "vgg7_typewise_mixed_precision.toml"

!python /root/local/mase-gp/src/ch train --config {VGG_TYPEWISE_TOML}

INFO: Seed set to 0
I0306 15:06:31.029283 140282827188032 seed.py:54] Seed set to 0
+-------------------------+----------------------+--------------+-----------------+----------------------+
| Name                    |       Default        | Config. File | Manual Override |      Effective       |
+-------------------------+----------------------+--------------+-----------------+----------------------+
| task                    |    [38;5;8mclassification[0m    |     cls      |                 |         cls          |
| load_name               |         None         |              |                 |         None         |
| load_type               |          mz          |              |                 |          mz          |
| batch_size              |         [38;5;8m128[0m          |      64      |                 |          64          |
| to_debug                |        False         |              |                 |        False         |
| log_level               |       

We will now load the checkpoint in, quantize the model and compare it to the unquantized version as we did in [Section 1.5](#section-15-performance-analysis)

In [1]:
VGG_TYPEWISE_TOML = "vgg7_typewise_mixed_precision.toml"
# Change this checkpoint path accordingly
VGG_CHECKPOINT_PATH = "/root/local/mase-gp/mase_output/vgg7_cifar_cls_cifar10_2025-03-06/software/training_ckpts/best.ckpt"

In [3]:
!python /root/local/mase-gp/src/ch transform --config {VGG_TYPEWISE_TOML} --load {VGG_CHECKPOINT_PATH} --load-type pl

INFO: Seed set to 0
I0306 15:58:00.578649 140271219853120 seed.py:54] Seed set to 0
+-------------------------+----------------------+--------------+--------------------------+--------------------------+
| Name                    |       Default        | Config. File |     Manual Override      |        Effective         |
+-------------------------+----------------------+--------------+--------------------------+--------------------------+
| task                    |    [38;5;8mclassification[0m    |     cls      |                          |           cls            |
| load_name               |         [38;5;8mNone[0m         |              | /root/local/mase-gp/mase | /root/local/mase-gp/mase |
|                         |                      |              | _output/vgg7_cifar_cls_c | _output/vgg7_cifar_cls_c |
|                         |                      |              | ifar10_2025-03-06/softwa | ifar10_2025-03-06/softwa |
|                         |                      |

By quantizing all convolutional layers to INT8 and maintaining fp16 precision for the linear layers we see a marginal decrease in latency whilst maintaining a comparable accuracy. By experimenting with precisions on a per type basis, you may find insights that work best for your model. 

## Section 4. Layer-wise Mixed Precision

So far we have strictly quantized either in int8 or fp16. Now, we will show how to conduct layerwise mixed precision using the same `vgg7` model. In this case we will show how for instance, layer 0 and 1 can be set to fp16, while the remaining layers can be int8 quantized. 

For this, we set:
- The `by` parameter to `name`
- The `precision` to 'int8' for `passes.tensorrt.default.config`
- The `precision` to 'fp16' for `passes.tensorrt.feature_layers_0.config and passes.tensorrt.feature_layers_1.config`
- The `precision` to 'int8' for `passes.tensorrt.feature_layers_2.config and passes.tensorrt.feature_layers_3.config` (although this is not necessary since the default is already set to 'int8')

In [None]:
VGG_LAYERWISE_TOML = "../../../machop/configs/tensorrt/vgg7_layerwise_mixed_precision.toml"

!python /root/local/mase-gp/src/ch transform --config {VGG_LAYERWISE_TOML} --load {VGG_CHECKPOINT_PATH} --load-type pl

[2024-03-28 23:25:51,157] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0328 23:25:54.303634 140449214740288 seed.py:54] Seed set to 0
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| Name                    |        Default         | Config. File |     Manual Override      |        Effective         |
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| task                    |     [38;5;8mclassification[0m     |     cls      |                          |           cls            |
| load_name               |          [38;5;8mNone[0m          |              | /root/mase/mase_output/v | /root/mase/mase_output/v |
|                         |                        |              |  gg7-pre-trained/test-   |  gg7-pre-trained/test-   |
|                         |           

In this case, we can see through the quantized summary that one convolutional layer (feature_layers_1) has not been quantized as its precision will be configured to 'fp16' in the tensorrt engine conversion stage whilst the remaining convolutional and linear layers have been quantized.