# Welcome To the TensorRT Quantization Tutorial!

This notebook is designed to show the features of the TensorRT passes integrated into MASE.

## Section 1. INT8 Quantization
Firstly, we will show you how to do a INT8 quantization of a simple model, `jsc-toy`, and compare the quantized model to the original model using the `Machop API`. The quantization process is split into the following stages, each using their own individual pass, and are explained in depth at each subsection:

1. [Fake quantization](#section-11-fake-quantization): `tensorrt_fake_quantize_transform_pass`
2. [Calibration](#sect): `tensorrt_calibrate_transform_pass`
3. [Quantized Aware Training](#quantized-aware-training): `tensorrt_fine_tune_transform_pass`
4. [Quantization](#quantization): `tensorrt_engine_interface_pass`
5. [Analysis](#analysis): `tensorrt_analysis_pass`

We start by loading in the required libraries and passes required for the notebook as well as ensuring the correct path is set for machop to be used.

In [3]:
import sys
import os
from pathlib import Path
import toml
from copy import copy, deepcopy

# Figure out the correct path
machop_path = Path(".").resolve().parent.parent.parent /"machop"
assert machop_path.exists(), "Failed to find machop at: {}".format(machop_path)
sys.path.append(str(machop_path))

# Add directory to the PATH so that chop can be called
new_path = "../../../machop"
full_path = os.path.abspath(new_path)
os.environ['PATH'] += os.pathsep + full_path

from chop.tools.utils import to_numpy_if_tensor
from chop.tools.logger import set_logging_verbosity
from chop.tools import get_cf_args, get_dummy_input
from chop.passes.graph.utils import deepcopy_mase_graph
from chop.tools.get_input import InputGenerator
from chop.tools.checkpoint_load import load_model
from chop.ir import MaseGraph
from chop.models import get_model_info, get_model, get_tokenizer
from chop.dataset import MaseDataModule, get_dataset_info
from chop.passes.graph.transforms import metadata_value_type_cast_transform_pass
from chop.passes.graph import (
    summarize_quantization_analysis_pass,
    add_common_metadata_analysis_pass,
    init_metadata_analysis_pass,
    add_software_metadata_analysis_pass,
    tensorrt_calibrate_transform_pass,
    tensorrt_fake_quantize_transform_pass,
    tensorrt_fine_tune_transform_pass,
    tensorrt_engine_interface_pass,
    tensorrt_analysis_pass,
    )

set_logging_verbosity("info")

[32mINFO    [0m [34mSet logging level to info[0m
I0318 00:20:18.130093 139689252038464 logger.py:44] Set logging level to info


Next, we load in the toml file used for quantization. To view the configuration, click [here](../../machop/configs/tensorrt/jsc_toy_INT8_quantization_by_type.toml), or read the documentation on Mase [here]().

In [2]:
# Path to your TOML file
toml_file_path = '../../../machop/configs/tensorrt/jsc_toy_INT8_quantization_by_type.toml'

# Reading TOML file and converting it into a Python dictionary
with open(toml_file_path, 'r') as toml_file:
    pass_args = toml.load(toml_file)

# Extract the 'passes.tensorrt_quantize' section and its children
tensorrt_quantize_config = pass_args.get('passes', {}).get('tensorrt_quantize', {})
# Extract the 'passes.tensorrt_analysis' section and its children
tensorrt_analysis_config = pass_args.get('passes', {}).get('tensorrt_analysis', {})

We then create a `MaseGraph` by loading in a model and training it using the toml configuration model arguments.

In [3]:
# Load the basics in
model_name = pass_args['model']
dataset_name = pass_args['dataset']
max_epochs = pass_args['max_epochs']
batch_size = pass_args['batch_size']
learning_rate = pass_args['learning_rate']
accelerator = pass_args['accelerator']

data_module = MaseDataModule(
    name=dataset_name,
    batch_size=batch_size,
    model_name=model_name,
    num_workers=0,
)
data_module.prepare_data()
data_module.setup()

# Add the data_module and other necessary information to the configs
configs = [tensorrt_quantize_config, tensorrt_analysis_config]
for config in configs:
    config['batch_size'] = pass_args['batch_size']
    config['model'] = pass_args['model']
    config['data_module'] = data_module
    config['accelerator'] = 'cuda' if pass_args['accelerator'] == 'gpu' else pass_args['accelerator']
    if config['accelerator'] == 'gpu':
        os.environ['CUDA_MODULE_LOADING'] = 'LAZY'

model_info = get_model_info(model_name)
# quant_modules.initialize()
model = get_model(
    model_name,
    task="cls",
    dataset_info=data_module.dataset_info,
    pretrained=False)


input_generator = InputGenerator(
    data_module=data_module,
    model_info=model_info,
    task="cls",
    which_dataloader="train",
)

# generate the mase graph and initialize node metadata
mg = MaseGraph(model=model)

--2024-03-17 22:59:54--  https://cernbox.cern.ch/index.php/s/jvFd5MoWhGs1l5v/download
Resolving cernbox.cern.ch (cernbox.cern.ch)... 128.142.53.35, 128.142.170.17, 137.138.120.151, ...
Connecting to cernbox.cern.ch (cernbox.cern.ch)|128.142.53.35|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/jvFd5MoWhGs1l5v/download [following]
--2024-03-17 22:59:55--  https://cernbox.cern.ch/s/jvFd5MoWhGs1l5v/download
Reusing existing connection to cernbox.cern.ch:443.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [application/octet-stream]
Saving to: ‘/root/mase/machop/.machop_cache/dataset/jsc/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z’

     0K .......... .......... .......... .......... ..........  185K
    50K .......... .......... .......... .......... ..........  370K
   100K .......... .......... .......... .......... .......... 51.5M
   150K .......... .......... .......... .......... ..........  37

FileNotFoundError: [Errno 2] Unable to synchronously open file (unable to open file: name = '/root/mase/machop/.machop_cache/dataset/jsc/processed-pythia82-lhc13-all-pt1-50k-r1_h022_e0175_t220_nonu_truth.z', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

Next, we train the `jsc-toy` model using the machop `train` action with the config from the toml file.

In [8]:
# !ch train --config ../../../machop/configs/tensorrt/jsc_toy_INT8_quantization_by_type.toml

Then we load in the checkpoint. You will have to adjust this according to where it has been stored in the mase_output directory.

In [10]:
# Load in the trained checkpoint - change this accordingly
JSC_CHECKPOINT_PATH = "../../../mase_output/jsc-toy_classification_jsc_2024-03-17/software/training_ckpts/best.ckpt"
model = load_model(load_name=JSC_CHECKPOINT_PATH, load_type="pl", model=model)

# Initiate metadata
dummy_in = next(iter(input_generator))
_ = model(**dummy_in)
mg, _ = init_metadata_analysis_pass(mg, None)
mg, _ = add_common_metadata_analysis_pass(mg, {"dummy_in": dummy_in})
mg, _ = add_software_metadata_analysis_pass(mg, None)
mg, _ = metadata_value_type_cast_transform_pass(mg, pass_args={"fn": to_numpy_if_tensor})

# Before we begin, we will copy the original MaseGraph model to use for comparison during quantization analysis
mg_original = deepcopy_mase_graph(mg)

[32mINFO    [0m [34mLoaded pytorch lightning checkpoint from ../../../mase_output/jsc-toy_classification_jsc_2024-03-17/software/training_ckpts/best.ckpt[0m
I0317 21:10:00.179528 139906558297920 checkpoint_load.py:85] Loaded pytorch lightning checkpoint from ../../../mase_output/jsc-toy_classification_jsc_2024-03-17/software/training_ckpts/best.ckpt


### Section 1.1 Fake Quantization

Firstly, we fake quantize the module in order to perform calibration and fine tuning before actually quantizing - this is only used if we have INT8 calibration as other precisions are not currently supported within [pytorch-quantization](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/index.html#) library.

This is acheived through the `tensorrt_fake_quantize_transform_pass` which goes through the model, either by type or by name, replaces each layer appropriately to a fake quantized form if the `quantize` parameter is set in the default config (`passes.tensorrt_quantize.default.config`) or on a per name or type basis. I.e. if we would like to only quantize the linear layers but not the convolutional layers, we could set the `quantize` parameter to true for `passes.tensorrt_quantize.linear.config` and the default to false.

Currently the quantizable layers are:
- Linear
- Conv1d, Conv2d, ConvNd 
- ConvTranspose1d, ConvTranspose2d, ConvTransposeNd 
- MaxPool1d, MaxPool2d, MaxPool3d
- AvgPool1d, AvgPool2d, AvgPool3d
- Clip (Tensor)
- LSTM, LSTMCell

To create a custom quantized module, click [here](https://docs.nvidia.com/deeplearning/tensorrt/pytorch-quantization-toolkit/docs/index.html#document-tutorials/creating_custom_quantized_modules).


In [4]:
mg, _ = tensorrt_fake_quantize_transform_pass(mg, pass_args=tensorrt_quantize_config)
summarize_quantization_analysis_pass(mg_original, mg)

[32mINFO    [0m [34mApplying fake quantization to PyTorch model...[0m
I0317 15:03:21.413895 140341721720640 utils.py:166] Applying fake quantization to PyTorch model...


[32mINFO    [0m [34mFake quantization applied to PyTorch model.[0m
I0317 15:03:22.608005 140341721720640 utils.py:182] Fake quantization applied to PyTorch model.
[32mINFO    [0m [34mQuantized graph histogram:[0m
I0317 15:03:22.673146 140341721720640 summary.py:84] Quantized graph histogram:
[32mINFO    [0m [34m
| Original type   | OP           |   Total |   Changed |   Unchanged |
|-----------------+--------------+---------+-----------+-------------|
| BatchNorm1d     | batch_norm1d |       4 |         0 |           4 |
| Linear          | linear       |       3 |         3 |           0 |
| ReLU            | relu         |       4 |         0 |           4 |
| output          | output       |       1 |         0 |           1 |
| x               | placeholder  |       1 |         0 |           1 |[0m
I0317 15:03:22.679919 140341721720640 summary.py:85] 
| Original type   | OP           |   Total |   Changed |   Unchanged |
|-----------------+--------------+---------+-----

As you can see we have succesfully quantized all linear layers inside `jsc-toy`. See Section X for how to apply quantization layerwise.

### Section 1.2 Calibration

Next, we perform calibration using the `tensorrt_calibrate_transform_pass`. Calibration is achieved by passing data samples to the quantizer and deciding the best amax for activations. 

Calibrators can be added as a search space parameter to examine the best performing calibrator. The calibrators have been included in the toml as follows.
For example: `calibrators = ["percentile", "mse", "entropy"]`

Note: 
- To use `percentile` calibration, a list of percentiles must be given
- To use `max` calibration, the `histogram` weight and input calibrators must be removed and replaced with `max`. This will use global maximum absolute value to calibrate the model.
- If `post_calibration_analysis` is set true the `tensorrt_analysis_pass` will be run for each calibrator tested to evaluate the most suitable calibrator for the model.

In [5]:
mg, _ = tensorrt_calibrate_transform_pass(mg, pass_args=tensorrt_quantize_config)

[32mINFO    [0m [34mStarting calibration of the model in PyTorch...[0m
I0317 15:03:22.733889 140341721720640 calibrate.py:84] Starting calibration of the model in PyTorch...
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
I0317 15:03:22.754286 140341721720640 calibrate.py:93] Disabling Quantization and Enabling Calibration
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
I0317 15:03:22.757311 140341721720640 calibrate.py:93] Disabling Quantization and Enabling Calibration
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
I0317 15:03:22.759768 140341721720640 calibrate.py:93] Disabling Quantization and Enabling Calibration
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
I0317 15:03:22.762053 140341721720640 calibrate.py:93] Disabling Quantization and Enabling Calibration
[32mINFO    [0m [34mDisabling Quantization and Enabling Calibration[0m
I0317 15:03:22.764362 1403417217206

[32mINFO    [0m [34mEnabling Quantization and Disabling Calibration[0m
I0317 15:03:24.101662 140341721720640 calibrate.py:114] Enabling Quantization and Disabling Calibration
W0317 15:03:24.103986 140341721720640 tensor_quantizer.py:174] Disable HistogramCalibrator
[32mINFO    [0m [34mEnabling Quantization and Disabling Calibration[0m
I0317 15:03:24.104778 140341721720640 calibrate.py:114] Enabling Quantization and Disabling Calibration
W0317 15:03:24.106118 140341721720640 tensor_quantizer.py:174] Disable HistogramCalibrator
[32mINFO    [0m [34mEnabling Quantization and Disabling Calibration[0m
I0317 15:03:24.106809 140341721720640 calibrate.py:114] Enabling Quantization and Disabling Calibration
W0317 15:03:24.108095 140341721720640 tensor_quantizer.py:174] Disable HistogramCalibrator
[32mINFO    [0m [34mEnabling Quantization and Disabling Calibration[0m
I0317 15:03:24.108847 140341721720640 calibrate.py:114] Enabling Quantization and Disabling Calibration
W0317 15:03

From the results, the 99% `percentile` clips too many values during the amax calibration, comprimising the loss. However 99.99% demonstrates higher validation accuracy alongside `mse` and `entropy` for `jsc-toy`. For such a small model, the methods are not highly distinguished, however for larger models this calibration process will be important for ensuring the quantized model still performs well. 

### Section 1.3 Quantized Aware Training (QAT)

The `tensorrt_fine_tune_transform_pass` is used to fine tune the quantized model. 

For QAT it is typical to employ 10% of the original training epochs, starting at 1% of the initial training learning rate, and a cosine annealing learning rate schedule that follows the decreasing half of a cosine period, down to 1% of the initial fine tuning learning rate (0.01% of the initial training learning rate). However this default can be overidden by setting the `epochs`, `initial_learning_rate` and `final_learning_rate` in `passes.tensorrt_quantize.fine_tune`.

The fine tuned checkpoints are stored in the ckpts/fine_tuning folder:

```
mase_output
└── tensorrt
    └── quantization
        ├── cache
        ├── ckpts
        │   └── fine_tuning
        ├── json
        ├── onnx
        └── trt

```

In [11]:
mg, _ = tensorrt_fine_tune_transform_pass(mg, pass_args=tensorrt_quantize_config)

I0317 21:12:15.631553 139906558297920 rank_zero.py:64] GPU available: True (cuda), used: True
I0317 21:12:15.654162 139906558297920 rank_zero.py:64] TPU available: False, using: 0 TPU cores
I0317 21:12:15.654832 139906558297920 rank_zero.py:64] IPU available: False, using: 0 IPUs
I0317 21:12:15.655537 139906558297920 rank_zero.py:64] HPU available: False, using: 0 HPUs
I0317 21:12:15.663746 139906558297920 rank_zero.py:64] You are using a CUDA device ('NVIDIA RTX A2000 12GB') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
W0317 21:12:15.664477 139906558297920 tensorboard.py:248] Missing logger folder: /root/mase/docs/tutorials/tensorrt/lightning_logs
I0317 21:12:19.127117 139906558297920 cuda.py:61] LOCAL_RANK: 0 - CUDA_VISIBLE_D

Sanity Checking DataLoader 0:   0%|          | 0/2 [00:00<?, ?it/s]

/root/anaconda3/envs/mase/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'val_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=27` in the `DataLoader` to improve performance.


                                                                           

/root/anaconda3/envs/mase/lib/python3.11/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:441: The 'train_dataloader' does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` to `num_workers=27` in the `DataLoader` to improve performance.


Epoch 1: 100%|██████████| 3084/3084 [00:49<00:00, 61.96it/s, v_num=0, train_acc_step=0.786, val_acc_epoch=0.734, val_loss_epoch=0.746] 

I0317 21:13:58.556372 139906558297920 rank_zero.py:64] `Trainer.fit` stopped: `max_epochs=2` reached.


Epoch 1: 100%|██████████| 3084/3084 [00:49<00:00, 61.94it/s, v_num=0, train_acc_step=0.786, val_acc_epoch=0.734, val_loss_epoch=0.746]


### Section 1.4 TensorRT Quantization

After QAT, we are now ready to convert the model to a tensorRT engine so that it can be run with the superior inference speeds. To do so, we use the `tensorrt_engine_interface_pass` which converts the `MaseGraph`'s model from a Pytorch one to an ONNX format as an intermediate stage of the conversion.

During the conversion process, the `.onnx` and `.trt` files are stored to their respective folders shown in [Section 1.3](#section-13-quantized-aware-training-qat).

This interface pass returns a dictionary containing the `onnx_path` and `trt_engine_path`.

In [22]:
mg, meta = tensorrt_engine_interface_pass(mg, pass_args=tensorrt_quantize_config)

[32mINFO    [0m [34mConverting PyTorch model to ONNX...[0m
I0317 21:31:46.986169 139906558297920 quantize.py:129] Converting PyTorch model to ONNX...
[32mINFO    [0m [34mONNX Conversion Complete. Stored ONNX model to /root/mase/mase_output/tensorrt/quantization/onnx/2024_03_17/version_5/model.onnx[0m
I0317 21:31:47.122666 139906558297920 quantize.py:152] ONNX Conversion Complete. Stored ONNX model to /root/mase/mase_output/tensorrt/quantization/onnx/2024_03_17/version_5/model.onnx
[32mINFO    [0m [34mConverting PyTorch model to TensorRT...[0m
I0317 21:31:47.124434 139906558297920 quantize.py:55] Converting PyTorch model to TensorRT...
[32mINFO    [0m [34mTensorRT Conversion Complete. Stored trt model to /root/mase/mase_output/tensorrt/quantization/trt/2024_03_17/version_5/model.trt[0m
I0317 21:31:56.180571 139906558297920 quantize.py:124] TensorRT Conversion Complete. Stored trt model to /root/mase/mase_output/tensorrt/quantization/trt/2024_03_17/version_5/model.trt
[3

### Section 1.5 Performance Analysis

To showcase the improved inference speeds and to evaluate accuracy and other performance metrics, the `tensorrt_analysis_pass` can be used.

The tensorRT engine path obtained the previous interface pass is now inputted into the the analysis pass. The same pass can take a MaseGraph as an input, as well as an ONNX graph. For this comparison, we will first run the anaylsis pass on the original unquantized model and then on the INT8 quantized model.

In [14]:
_, _ = tensorrt_analysis_pass(mg_original, pass_args=tensorrt_analysis_config)
_, _ = tensorrt_analysis_pass(meta['trt_engine_path'], pass_args=tensorrt_analysis_config)

[32mINFO    [0m [34mStarting transformation analysis[0m
I0317 21:15:14.776531 139906558297920 analysis.py:214] Starting transformation analysis
[32mINFO    [0m [34m
Results jsc-toy:
+------------------------------+---------------+
|            Metric            |     Value     |
+------------------------------+---------------+
|    Average Test Accuracy     |    0.73168    |
|      Average Precision       |    0.74472    |
|        Average Recall        |    0.73037    |
|       Average F1 Score       |    0.73369    |
|         Average Loss         |    0.76315    |
|       Average Latency        |  0.77245 ms   |
|   Average GPU Power Usage    |   20.072 W    |
| Inference Energy Consumption | 0.0043069 mWh |
+------------------------------+---------------+[0m
I0317 21:15:17.371860 139906558297920 analysis.py:317] 
Results jsc-toy:
+------------------------------+---------------+
|            Metric            |     Value     |
+------------------------------+---------------+

As shown above, the latency has decreased around 4x with the `jsc-toy` model without compromising accuracy due to the well calibrated amax and quantization-aware fine tuning. The inference energy consumption has thus also dropped tremendously and this is an excellent demonstration for the need to quantize in industry especially for LLMs in order to reduce energy usage. 

## Section 2. FP16 Quantization

We will now load in a new toml configuration that uses FP16 instead of INT8, whilst keeping the other settings the exact same for a fair comparison. This time however, we will use chop from the terminal which runs all the passes showcased in [Section 1](#section-1---int8-quantization).

Since float quantization does not require calibration, nor is it supported by `pytorch-quantization`, the model will not undergo fake quantization, unfortunately, for the time being this means QAT is unavailable.

In [6]:
!ch transform --config ../../../machop/configs/tensorrt/jsc_toy_FP16_quantization_by_type.toml --load {CHECKPOINT_PATH} --load-type pl

[2024-03-17 21:35:57,836] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0317 21:35:59.862919 139643205699392 seed.py:54] Seed set to 0
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| Name                    |        Default         | Config. File |     Manual Override      |        Effective         |
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| task                    |     classification     |              |                          |      classification      |
| load_name               |          [38;5;8mNone[0m          |              | /root/mase/mase_output/j | /root/mase/mase_output/j |
|                         |                        |              | sc-toy_classification_js | sc-toy_classification_js |
|                         |                        

As you can see, `FP16` acheives a slighty higher test accuracy and a slightly lower latency (~10%) from that of INT8 quantization as expected. Now lets try a more complicated model - `vgg7`.

## Section 3. Larger Model Quantization
We will now quantize `vgg7` which includes both convolutional and linear layers. You may either download a pretrained model [here](https://imperiallondon-my.sharepoint.com/:f:/g/personal/zz7522_ic_ac_uk/Emh3VT7Q_qRFmnp8kDrcgDoBwGUuzLwwKNtX8ZAt368jJQ?e=gsKONa), otherwise train it yourself as shown below. 

In [None]:
!ch train --config ../../../machop/configs/tensorrt/vgg7_FP16_quantization_by_type.toml

We will now load the checkpoint in and quantize the model and compare it to the unquantized version as we did in [Section 1.5](#section-15-performance-analysis)

In [5]:
VGG_CHECKPOINT_PATH = "../../../mase_output/vgg7-pre-trained/test-accu-0.9332.ckpt"

In [6]:
!ch transform --config ../../../machop/configs/tensorrt/vgg7_INT8_quantization_by_type.toml --load {VGG_CHECKPOINT_PATH} --load-type pl

[2024-03-17 22:32:56,319] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO: Seed set to 0
I0317 22:33:01.059036 140316672829248 seed.py:54] Seed set to 0
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| Name                    |        Default         | Config. File |     Manual Override      |        Effective         |
+-------------------------+------------------------+--------------+--------------------------+--------------------------+
| task                    |     classification     |              |                          |      classification      |
| load_name               |          [38;5;8mNone[0m          |              | /root/mase/mase_output/v | /root/mase/mase_output/v |
|                         |                        |              |  gg7-pre-trained/test-   |  gg7-pre-trained/test-   |
|                         |                        

# Section 4. Layerwise Mixed Precision

So far we have strictly quantized either in INT8 or FP16. Now, we will show how to conduct layerwise mixed precision using the `OPT-125M`. In this case we will show how for instance, layer 0 and 1 can be set to FP16, while layer 2 and 3 can be INT8 quantized. 

In [4]:
# Path to your TOML file
toml_file_path = '../../../machop/configs/tensorrt/opt-125M_layerwise_mixed_precision_by_name.toml'

# Reading TOML file and converting it into a Python dictionary
with open(toml_file_path, 'r') as toml_file:
    pass_args = toml.load(toml_file)

# Extract the 'passes.tensorrt_quantize' section and its children
tensorrt_quantize_config = pass_args.get('passes', {}).get('tensorrt_quantize', {})
# Extract the 'passes.tensorrt_analysis' section and its children
tensorrt_analysis_config = pass_args.get('passes', {}).get('tensorrt_analysis', {})

# Load the basics in
model_name = pass_args['model']
dataset_name = pass_args['dataset']
max_epochs = pass_args['max_epochs']
batch_size = pass_args['batch_size']
learning_rate = pass_args['learning_rate']
accelerator = pass_args['accelerator']

opt_tokenizer = get_tokenizer("facebook/opt-125m:patched")

data_module = MaseDataModule(
    name=dataset_name,
    batch_size=batch_size,
    model_name=model_name,
    num_workers=os.cpu_count(),
    max_token_len=128,
    tokenizer=opt_tokenizer,
    load_from_cache_file=True,
)
data_module.prepare_data()
data_module.setup()

# Add the data_module and other necessary information to the configs
configs = [tensorrt_quantize_config, tensorrt_analysis_config]
for config in configs:
    config['batch_size'] = pass_args['batch_size']
    config['model'] = pass_args['model']
    config['data_module'] = data_module
    config['accelerator'] = 'cuda' if pass_args['accelerator'] == 'gpu' else pass_args['accelerator']
    if config['accelerator'] == 'gpu':
        os.environ['CUDA_MODULE_LOADING'] = 'LAZY'

model_info = get_model_info(model_name)
model = get_model(
    "facebook/opt-125m:patched",
    task="lm",
    dataset_info=get_dataset_info("wikitext2"),
    pretrained=True,
)

# Load in the trained checkpoint - change this accordingly
# OPT125M_CHECKPOINT_PATH = "../../../mase_output/jsc-toy_classification_jsc_2024-03-17/software/training_ckpts/best.ckpt"
# model = load_model(load_name=OPT125M_CHECKPOINT_PATH, load_type="pl", model=model)

model_info = get_model_info("facebook/opt-125m:patched")
cf_args = get_cf_args(model_info=model_info, task="lm", model=model)

mg = MaseGraph(model=model, cf_args=cf_args)

# dummy_in = get_dummy_input(model_info, data_module=data_module, task="lm")
# if len(mg.model.additional_inputs) > 0:
#     dummy_in = dummy_in | mg.model.additional_inputs

# Initiate metadata
mg, _ = init_metadata_analysis_pass(mg, pass_args=None)

# # Before we begin, we will copy the original MaseGraph model to use for comparison during quantization analysis
# mg_original = deepcopy_mase_graph(mg)

In [5]:
mg, _ = tensorrt_fake_quantize_transform_pass(mg, pass_args=tensorrt_quantize_config)
summarize_quantization_analysis_pass(mg_original, mg)

mg, _ = tensorrt_calibrate_transform_pass(mg, pass_args=tensorrt_quantize_config)

mg, _ = tensorrt_fine_tune_transform_pass(mg, pass_args=tensorrt_quantize_config)

mg, meta = tensorrt_engine_interface_pass(mg, pass_args=tensorrt_quantize_config)

_, _ = tensorrt_analysis_pass(mg_original, pass_args=tensorrt_analysis_config)
_, _ = tensorrt_analysis_pass(meta['trt_engine_path'], pass_args=tensorrt_analysis_config)

[32mINFO    [0m [34mApplying fake quantization to PyTorch model...[0m
I0318 00:21:33.458112 139689252038464 utils.py:166] Applying fake quantization to PyTorch model...
W0318 00:24:28.679456 139689252038464 utils.py:169] INT8 precision not found in config. Skipping fake quantization.


NameError: name 'mg_original' is not defined