# Welcome To the TensorRT Quantization Tutorial!

This notebook is designed to show the features of the TensorRT passes integrated into MASE.

## Section 1 - INT8 Quantization
Firstly, we will show you how to do a INT8 quantization of a simple model, `jsc-toy`, and compare the quantized model to the original model using the `Machop API`. The quantization process is split into the following stages, each using their own individual pass, and are explained in depth at each subsection:

1. [Fake quantization](#section-11-fake-quantization): `tensorrt_fake_quantize_transform_pass`
2. [Calibration](#sect): `tensorrt_calibrate_transform_pass`
3. [Quantized Aware Training](#quantized-aware-training): `tensorrt_fine_tune_transform_pass`
4. [Quantization](#quantization): `tensorrt_engine_interface_pass`
5. [Analysis](#analysis): `tensorrt_analysis_pass`


We start by loading in the required libraries and passes required for the notebook as well as ensuring the correct path is set for machop to be used.

In [None]:
import sys
import os
from pathlib import Path
import toml
from copy import copy, deepcopy

# Figure out the correct path
machop_path = Path(".").resolve().parent.parent.parent /"machop"
assert machop_path.exists(), "Failed to find machop at: {}".format(machop_path)
sys.path.append(str(machop_path))

# Add directory to the PATH so that chop can be called
new_path = "../../../machop"
full_path = os.path.abspath(new_path)
os.environ['PATH'] += os.pathsep + full_path

from chop.tools.logger import set_logging_verbosity
from chop.passes.graph.utils import deepcopy_mase_graph
from chop.tools.get_input import InputGenerator
from chop.tools.checkpoint_load import load_model
from chop.ir import MaseGraph
from chop.models import get_model_info, get_model
from chop.dataset import MaseDataModule, get_dataset_info
from chop.passes.graph import (
    save_node_meta_param_interface_pass,
    report_node_meta_param_analysis_pass,
    profile_statistics_analysis_pass,
    add_common_metadata_analysis_pass,
    init_metadata_analysis_pass,
    add_software_metadata_analysis_pass,
    tensorrt_calibrate_transform_pass,
    tensorrt_fake_quantize_transform_pass,
    tensorrt_fine_tune_transform_pass,
    tensorrt_engine_interface_pass,
    tensorrt_analysis_pass,
    )

set_logging_verbosity("info")

Next, we load in the toml file used for quantization. To view the configuration, click [here](../../machop/configs/tensorrt/jsc_toy_INT8_quantization_by_type.toml), or read the documentation on Mase [here]().

In [None]:
# Path to your TOML file
toml_file_path = '../../../machop/configs/tensorRT/jsc_toy_INT8_quantization_by_type.toml'

# Reading TOML file and converting it into a Python dictionary
with open(toml_file_path, 'r') as toml_file:
    pass_args = toml.load(toml_file)

# Extract the 'passes.tensorrt_quantize' section and its children
tensorrt_quantize_config = pass_args.get('passes', {}).get('tensorrt_quantize', {})
# Extract the 'passes.tensorrt_fine_tune' section and its children
tensorrt_train_config = pass_args.get('passes', {}).get('tensorrt_fine_tune', {})
# Extract the 'passes.tensorrt_analysis' section and its children
tensorrt_analysis_config = pass_args.get('passes', {}).get('tensorrt_analysis', {})

We then create a `MaseGraph` by loading in a pre-trained model using the checkpoint provided and using the toml configuration model arguments

In [None]:
CHECKPOINT_PATH = "checkpoints/jsc-toy_classification_jsc/software/training_ckpts/best.ckpt"

# Load the basics in to 
model_name = pass_args['model']
dataset_name = pass_args['dataset']
max_epochs = pass_args['max_epochs']
batch_size = pass_args['batch_size']
learning_rate = pass_args['learning_rate']
accelerator = pass_args['accelerator']

data_module = MaseDataModule(
    name=dataset_name,
    batch_size=batch_size,
    model_name=model_name,
    num_workers=0,
)
data_module.prepare_data()
data_module.setup()


model_info = get_model_info(model_name)
# quant_modules.initialize()
model = get_model(
    model_name,
    task="cls",
    dataset_info=data_module.dataset_info,
    pretrained=False)

model = load_model(load_name=CHECKPOINT_PATH, load_type="pl", model=model)

input_generator = InputGenerator(
    data_module=data_module,
    model_info=model_info,
    task="cls",
    which_dataloader="train",
)

dummy_in = next(iter(input_generator))
_ = model(**dummy_in)

# generate the mase graph and initialize node metadata
mg = MaseGraph(model=model)

mg, _ = init_metadata_analysis_pass(mg, None)
mg, _ = add_common_metadata_analysis_pass(mg, {"dummy_in": dummy_in})
mg, _ = add_software_metadata_analysis_pass(mg, None)

Before we begin, we will copy the original `MaseGraph` to use for comparison during quantization analysis

In [None]:
mg_original = deepcopy_mase_graph(mg)

### Section 1.1 Fake Quantization

Firstly, we fake quantize the module in order to perform calibration and fine tuning before actually quantizing - this is only used if we have INT8 calibration as other precisions are not currently supported within [pytorch-quantization](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/index.html#) library.

This is acheived through the `tensorrt_fake_quantize_transform_pass` which goes through the model, either by type or by name, replaces each layer appropriately to a fake quantized form if the `quantize` parameter is set in the default config (`passes.tensorrt_quantize.default.config`) or on a per name or type basis. I.e. if we would like to only quantize the linear layers but not the convolutional layers, we could set the `quantize` parameter to true for `passes.tensorrt_quantize.linear.config` and the default to false.


In [None]:
configs = [tensorrt_quantize_config, tensorrt_train_config, tensorrt_analysis_config]
for config in configs:
    config['batch_size'] = pass_args['batch_size']
    config['model'] = pass_args['model']
    config['data_module'] = data_module
    config['accelerator'] = 'cuda' if pass_args['accelerator'] == 'gpu' else pass_args['accelerator']
    if config['accelerator'] == 'gpu':
        os.environ['CUDA_MODULE_LOADING'] = 'LAZY'

mg, _ = tensorrt_fake_quantize_transform_pass(mg, pass_args=tensorrt_quantize_config)

### Section 1.2 Calibration

Next, we perform calibration using the `tensorrt_calibrate_transform_pass`. Calibration is achieved by passing data samples to the quantizer and deciding the best amax for activations. 

Calibrators can be added as a search space parameter to examine the best performing calibrator. The calibrators have been included in the toml as follows.
For example: `calibrators = ["percentile", "mse", "entropy"]`

Note: 
- To use `percentile` calibration, a list of percentiles must be given
- To use `max` calibration, the `histogram` weight and input calibrators must be removed and replaced with `max`. This will use global maximum absolute value to calibrate the model.

In [None]:
mg, _ = tensorrt_calibrate_transform_pass(mg, pass_args=tensorrt_quantize_config)

### Section 1.3 Quantized Aware Training (QAT)

The `tensorrt_fine_tune_transform_pass` is used to fine tune the quantized model. For QAT it is typical to employ 10% of the original training epochs, starting at 1% of the initial training learning rate, and a cosine annealing learning rate schedule that follows the decreasing half of a cosine period, down to 1% of the initial fine tuning learning rate (0.01% of the initial training learning rate). However this default can be overidden by setting the `epochs`, `initial_learning_rate` and `final_learning_rate` in `passes.tensorrt_quantize.fine_tune`.

In [None]:
mg, _ = tensorrt_fine_tune_transform_pass(mg, pass_args=tensorrt_quantize_config)

### Section 1.4 TensorRT Quantization

After QAT, we are now ready to convert the model to a tensorRT engine so that it can be run with the superior inference speeds. To do so, we use the `tensorrt_engine_interface_pass` which converts the `MaseGraph`'s model from a Pytorch one to an ONNX format as an intermediate stage of the conversion.

In [None]:
mg, trt_meta = tensorrt_engine_interface_pass(mg, pass_args=tensorrt_quantize_config)

### Section 1.5 Performance Analysis

To showcase the improved inference speeds and to evaluate accuracy and other performance metrics, the `tensorrt_analysis_pass` can be used.

The tensorRT engine path obtained the previous interface pass is now inputted into the the analysis pass. The same pass can take a MaseGraph as an input, as well as an ONNX graph. For this comparison, we will first run the anaylsis pass on the original unquantized model and then on the INT8 quantized model.

In [None]:
_, _ = tensorrt_analysis_pass(trt_meta['graph_path'], pass_args=tensorrt_analysis_config)

In [None]:
_, _ = tensorrt_analysis_pass(mg_original, pass_args=tensorrt_analysis_config)

In [None]:
!ch transform --config ../../../machop/configs/tensorRT/jsc_toy_INT8_quantization_by_type.toml --load checkpoints/jsc-toy_classification_jsc/software/training_ckpts/best.ckpt --load-type pl

In [None]:
# Compare original tensorrt with quantized graph
trt_meta = {}
from pathlib import PosixPath

# JSC-Toy INT8 only quantization 
trt_meta['graph_path'] = PosixPath('/root/mase/mase_output/TensorRT/Quantization/TRT/2024_03_16/version_0/model.trt')

_, _ = tensorrt_analysis_pass(trt_meta['graph_path'], pass_args=tensorrt_analysis_config)

_, _ = tensorrt_analysis_pass(mg, pass_args=tensorrt_analysis_config)

In [None]:
# Compare original tensorrt with quantized graph
trt_meta = {}
from pathlib import PosixPath

# JSC-Toy FP16 only quantization 
trt_meta['graph_path'] = PosixPath('/root/mase/mase_output/TensorRT/Quantization/TRT/2024_03_16/version_1/model.trt')

_, _ = tensorrt_analysis_pass(trt_meta['graph_path'], pass_args=tensorrt_analysis_config)

_, _ = tensorrt_analysis_pass(mg, pass_args=tensorrt_analysis_config)