# Welcome To the TensorRT Quantization Tutorial!

This notebook is designed to show the features of the TensorRT passes integrated into MASE.

## Section 1 - INT8 Quantization
Firstly, we will show you how to do a INT8 quantization of a simple model, `jsc-toy`, and compare the quantized model to the original model using the `Machop API`. The quantization process is split into the following stages, each their own individual pass:

1. Fake quantization: using the `fake
2. Calibration - 
3. Quantized Aware Training - 
4. Quantization - 
5. Analysis - 


We start by loading in the required libraries and passes required for the notebook as well as ensuring the correct path is set for machop to be used.

In [2]:
import sys
import os
from pathlib import Path
import toml
from copy import copy, deepcopy

# Figure out the correct path
machop_path = Path(".").resolve().parent.parent.parent /"machop"
assert machop_path.exists(), "Failed to find machop at: {}".format(machop_path)
sys.path.append(str(machop_path))

# Add directory to the PATH so that chop can be called
new_path = "../../../machop"
full_path = os.path.abspath(new_path)
os.environ['PATH'] += os.pathsep + full_path

from chop.tools.logger import set_logging_verbosity
from chop.passes.graph.utils import deepcopy_mase_graph
from chop.tools.get_input import InputGenerator
from chop.tools.checkpoint_load import load_model
from chop.ir import MaseGraph
from chop.models import get_model_info, get_model
from chop.dataset import MaseDataModule, get_dataset_info
from chop.passes.graph import (
    save_node_meta_param_interface_pass,
    report_node_meta_param_analysis_pass,
    profile_statistics_analysis_pass,
    add_common_metadata_analysis_pass,
    init_metadata_analysis_pass,
    add_software_metadata_analysis_pass,
    tensorrt_calibrate_transform_pass,
    tensorrt_fake_quantize_transform_pass,
    tensorrt_fine_tune_transform_pass,
    tensorrt_engine_interface_pass,
    tensorrt_analysis_pass,
    )

set_logging_verbosity("info")

  from .autonotebook import tqdm as notebook_tqdm


[2024-03-16 15:25:08,775] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)


[32mINFO    [0m [34mSet logging level to info[0m
I0316 15:25:10.852008 139941960832832 logger.py:44] Set logging level to info


Next, we load in the toml file used for quantization. To view the configuration, click [here](../../machop/configs/tensorrt/jsc_toy_INT8_quantization_by_type.toml), or read the documentation on Mase [here]().

In [2]:
# Path to your TOML file
toml_file_path = '../../../machop/configs/tensorRT/jsc_toy_INT8_quantization_by_type.toml'

# Reading TOML file and converting it into a Python dictionary
with open(toml_file_path, 'r') as toml_file:
    pass_args = toml.load(toml_file)

# Extract the 'passes.tensorrt_quantize' section and its children
tensorrt_quantize_config = pass_args.get('passes', {}).get('tensorrt_quantize', {})
# Extract the 'passes.tensorrt_fine_tune' section and its children
tensorrt_train_config = pass_args.get('passes', {}).get('tensorrt_fine_tune', {})
# Extract the 'passes.tensorrt_analysis' section and its children
tensorrt_analysis_config = pass_args.get('passes', {}).get('tensorrt_analysis', {})

We then create a `MaseGraph` by loading in a pre-trained model using the checkpoint provided and using the toml configuration model arguments

In [3]:
CHECKPOINT_PATH = "checkpoints/jsc-toy_classification_jsc/software/training_ckpts/best.ckpt"

# Load the basics in to 
model_name = pass_args['model']
dataset_name = pass_args['dataset']
max_epochs = pass_args['max_epochs']
batch_size = pass_args['batch_size']
learning_rate = pass_args['learning_rate']
accelerator = pass_args['accelerator']

data_module = MaseDataModule(
    name=dataset_name,
    batch_size=batch_size,
    model_name=model_name,
    num_workers=0,
)
data_module.prepare_data()
data_module.setup()


model_info = get_model_info(model_name)
# quant_modules.initialize()
model = get_model(
    model_name,
    task="cls",
    dataset_info=data_module.dataset_info,
    pretrained=False)

model = load_model(load_name=CHECKPOINT_PATH, load_type="pl", model=model)

input_generator = InputGenerator(
    data_module=data_module,
    model_info=model_info,
    task="cls",
    which_dataloader="train",
)

dummy_in = next(iter(input_generator))
_ = model(**dummy_in)

# generate the mase graph and initialize node metadata
mg = MaseGraph(model=model)

mg, _ = init_metadata_analysis_pass(mg, None)
mg, _ = add_common_metadata_analysis_pass(mg, {"dummy_in": dummy_in})
mg, _ = add_software_metadata_analysis_pass(mg, None)

[32mINFO    [0m [34mLoaded pytorch lightning checkpoint from checkpoints/jsc-toy_classification_jsc/software/training_ckpts/best.ckpt[0m
I0316 15:20:13.324839 140527413331776 checkpoint_load.py:85] Loaded pytorch lightning checkpoint from checkpoints/jsc-toy_classification_jsc/software/training_ckpts/best.ckpt


Before we begin, we will copy the original `MaseGraph` to use for comparison during quantization analysis

In [None]:
mg_original = deepcopy_mase_graph(mg)

### Section 1.1 Fake Quantization

First we fake quantize the module to perform calibration and fine tuning before we actually quantize - this is only required if we have INT8 calibration as other precisions are not currently supported within [pytorch-quantization](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/index.html#) library.

This is acheived through the `tensorrt_fake_quantize_transform_pass` which goes through the model, either by type or by name, replaces each layer appropriately to a quantized form passing data samples to the quantizer and deciding the best amax for activations

Calibrators can be added as a search space parameter to examine the best performing calibrator. The calibrators have been included in the toml as follows.
For example: `calibrators = ["percentile", "mse", "entropy"]`

Note: 
- To use `percentile` calibration, a list of percentiles must be given
- To use `max` calibration, the `histogram` weight and input calibrators must be removed and replaced with `max`. This will use global maximum absolute value to calibrate the model.


In [5]:
configs = [tensorrt_quantize_config, tensorrt_train_config, tensorrt_analysis_config]
for config in configs:
    config['batch_size'] = pass_args['batch_size']
    config['model'] = pass_args['model']
    config['data_module'] = data_module
    config['accelerator'] = 'cuda' if pass_args['accelerator'] == 'gpu' else pass_args['accelerator']
    if config['accelerator'] == 'gpu':
        os.environ['CUDA_MODULE_LOADING'] = 'LAZY'

mg, _ = tensorrt_fake_quantize_transform_pass(mg, pass_args=tensorrt_quantize_config)

[32mINFO    [0m [34mApplying fake quantization to PyTorch model...[0m
I0316 15:20:15.177225 140527413331776 utils.py:132] Applying fake quantization to PyTorch model...


[32mINFO    [0m [34mFake quantization applied to PyTorch model.[0m
I0316 15:20:16.027739 140527413331776 utils.py:148] Fake quantization applied to PyTorch model.
[32mINFO    [0m [34mStarting calibration of the model in PyTorch...[0m
I0316 15:20:16.030015 140527413331776 calibrate.py:62] Starting calibration of the model in PyTorch...
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
I0316 15:20:16.034828 140527413331776 calibrate.py:71] Disabling Quantization and Enabling Calibration
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
I0316 15:20:16.036426 140527413331776 calibrate.py:71] Disabling Quantization and Enabling Calibration
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
I0316 15:20:16.038123 140527413331776 calibrate.py:71] Disabling Quantization and Enabling Calibration
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
I0316 15:20:16.039608 140527413331776 calibrat

                                                                           

/opt/conda/envs/mase/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=47` in the `DataLoader` to improve performance.
/opt/conda/envs/mase/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=47` in the `DataLoader` to improve performance.


Epoch 0: 100%|██████████| 3084/3084 [02:12<00:00, 23.34it/s, v_num=2, train_acc_step=0.679, val_acc_epoch=0.733, val_loss_epoch=0.751]

I0316 15:23:03.750215 140527413331776 rank_zero.py:64] `Trainer.fit` stopped: `max_epochs=1` reached.


Epoch 0: 100%|██████████| 3084/3084 [02:12<00:00, 23.33it/s, v_num=2, train_acc_step=0.679, val_acc_epoch=0.733, val_loss_epoch=0.751]


[32mINFO    [0m [34mConverting PyTorch model to ONNX...[0m
I0316 15:23:03.785155 140527413331776 quantize.py:129] Converting PyTorch model to ONNX...
  if min_amax < 0:
  max_bound = torch.tensor((2.0**(num_bits - 1 + int(unsigned))) - 1.0, device=amax.device)
  if min_amax <= epsilon:  # Treat amax smaller than minimum representable of fp16 0
  if min_amax <= epsilon:
[32mINFO    [0m [34mONNX Conversion Complete. Stored ONNX model to /root/mase/mase_output/TensorRT/Quantization/ONNX/2024_03_16/version_6/model.onnx[0m
I0316 15:23:04.103081 140527413331776 quantize.py:152] ONNX Conversion Complete. Stored ONNX model to /root/mase/mase_output/TensorRT/Quantization/ONNX/2024_03_16/version_6/model.onnx
[32mINFO    [0m [34mConverting PyTorch model to TensorRT...[0m
I0316 15:23:04.105526 140527413331776 quantize.py:55] Converting PyTorch model to TensorRT...


[03/16/2024-15:23:13] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading


### Section 1.2



In [None]:
mg, _ = tensorrt_calibrate_transform_pass(mg, pass_args=tensorrt_quantize_config)

### Section 1.3 Quantized Aware Training

The `tensorrt_fine_tune_transform_pass`

In [None]:
mg, _ = tensorrt_fine_tune_transform_pass(mg, pass_args=tensorrt_quantize_config)

# Convert and store to ONNX and then TensorRT
mg, trt_meta = tensorrt_engine_interface_pass(mg, pass_args=tensorrt_quantize_config)

In [4]:
!ch transform --config ../../../machop/configs/tensorRT/jsc_toy_INT8_quantization_by_type.toml --load checkpoints/jsc-toy_classification_jsc/software/training_ckpts/best.ckpt --load-type pl

[2024-03-16 15:32:40,047] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0316 15:32:42.672156 140607680886592 seed.py:54] Seed set to 0
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| Name                    |        Default         | Config. File |     Manual Override      |        Effective         |
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| task                    |     classification     |              |                          |      classification      |
| load_name               |          [38;5;8mNone[0m          |              | /root/mase/docs/tutorial | /root/mase/docs/tutorial |
|                         |                        |              | s/tensorRT/checkpoints/j | s/tensorRT/checkpoints/j |
|                         |                        

In [6]:
# Compare original tensorrt with quantized graph
trt_meta = {}
from pathlib import PosixPath

# JSC-Toy INT8 only quantization 
trt_meta['graph_path'] = PosixPath('/root/mase/mase_output/TensorRT/Quantization/TRT/2024_03_16/version_0/model.trt')

_, _ = tensorrt_analysis_pass(trt_meta['graph_path'], pass_args=tensorrt_analysis_config)

_, _ = tensorrt_analysis_pass(mg, pass_args=tensorrt_analysis_config)

[32mINFO    [0m [34m
TensorRT Engine Input/Output Information:
Index | Type    | DataType | Static Shape         | Dynamic Shape        | Name
------|---------|----------|----------------------|----------------------|-----------------------
0     | Input   | FLOAT    | (256, 16)              | (256, 16)              | input
1     | Output  | FLOAT    | (256, 5)               | (256, 5)               | 109[0m
I0316 15:12:06.617402 140528819410752 analysis.py:128] 
TensorRT Engine Input/Output Information:
Index | Type    | DataType | Static Shape         | Dynamic Shape        | Name
------|---------|----------|----------------------|----------------------|-----------------------
0     | Input   | FLOAT    | (256, 16)              | (256, 16)              | input
1     | Output  | FLOAT    | (256, 5)               | (256, 5)               | 109
[32mINFO    [0m [34mStarting TensorRT transformation analysis[0m
I0316 15:12:06.620040 140528819410752 analysis.py:202] Starting TensorR

[03/16/2024-15:12:06] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading


[32mINFO    [0m [34m
Results jsc-toy-quantized:
+------------------------------+--------------+
|            Metric            |    Value     |
+------------------------------+--------------+
|    Average Test Accuracy     |   0.73309    |
|      Average Precision       |   0.74935    |
|        Average Recall        |   0.73329    |
|       Average F1 Score       |   0.73688    |
|         Average Loss         |    0.7531    |
|       Average Latency        |  0.2472 ms   |
|   Average GPU Power Usage    |   67.63 W    |
| Inference Energy Consumption | 0.004644 mWh |
+------------------------------+--------------+[0m
I0316 15:12:13.257380 140528819410752 analysis.py:303] 
Results jsc-toy-quantized:
+------------------------------+--------------+
|            Metric            |    Value     |
+------------------------------+--------------+
|    Average Test Accuracy     |   0.73309    |
|      Average Precision       |   0.74935    |
|        Average Recall        |   0.73329    

In [6]:
# Compare original tensorrt with quantized graph
trt_meta = {}
from pathlib import PosixPath

# JSC-Toy FP16 only quantization 
trt_meta['graph_path'] = PosixPath('/root/mase/mase_output/TensorRT/Quantization/TRT/2024_03_16/version_1/model.trt')

_, _ = tensorrt_analysis_pass(trt_meta['graph_path'], pass_args=tensorrt_analysis_config)

_, _ = tensorrt_analysis_pass(mg, pass_args=tensorrt_analysis_config)

[32mINFO    [0m [34m
TensorRT Engine Input/Output Information:
Index | Type    | DataType | Static Shape         | Dynamic Shape        | Name
------|---------|----------|----------------------|----------------------|-----------------------
0     | Input   | FLOAT    | (256, 16)              | (256, 16)              | input
1     | Output  | FLOAT    | (256, 5)               | (256, 5)               | 109[0m
I0316 11:24:17.414859 140497372997440 analysis.py:128] 
TensorRT Engine Input/Output Information:
Index | Type    | DataType | Static Shape         | Dynamic Shape        | Name
------|---------|----------|----------------------|----------------------|-----------------------
0     | Input   | FLOAT    | (256, 16)              | (256, 16)              | input
1     | Output  | FLOAT    | (256, 5)               | (256, 5)               | 109
[32mINFO    [0m [34mStarting TensorRT transformation analysis[0m
I0316 11:24:17.417420 140497372997440 analysis.py:202] Starting TensorR

[03/16/2024-11:24:17] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading


[32mINFO    [0m [34m
Results jsc-toy-quantized:
+------------------------------+---------------+
|            Metric            |     Value     |
+------------------------------+---------------+
|    Average Test Accuracy     |    0.73325    |
|      Average Precision       |    0.74924    |
|        Average Recall        |    0.73356    |
|       Average F1 Score       |    0.73732    |
|         Average Loss         |    0.75403    |
|       Average Latency        |   0.2609 ms   |
|   Average GPU Power Usage    |   68.244 W    |
| Inference Energy Consumption | 0.0049457 mWh |
+------------------------------+---------------+[0m
I0316 11:24:23.903356 140497372997440 analysis.py:306] 
Results jsc-toy-quantized:
+------------------------------+---------------+
|            Metric            |     Value     |
+------------------------------+---------------+
|    Average Test Accuracy     |    0.73325    |
|      Average Precision       |    0.74924    |
|        Average Recall      