# Mixed-Precision Compression Post-Training Quantization + Conversion to IMX500 of a ShuffleNetV2 PyTorch Model

## Overview
This tutorial demonstrates how to use the [**Model Compression Toolkit (MCT)**](https://github.com/sony/model_optimization) for model quantization and compression. Specifically the mixed-precision post training quantization of a PyTorch pretrained model. After the model is quantized to a specific size, it will be converted to a binary format suitable to load to IMX500 using the [**IMX500-converter**](https://developer.aitrios.sony-semicon.com/en/raspberrypi-ai-camera/documentation/imx500-converter?version=3.14.3&progLang=). 

This example is not intended to demonstrate evaluating MCT PTQ performance and as such intentionally uses generated random data to speed up the process.
 
For the full tutorial on MCT's PTQ see - [*MCT Mixed-Precision PTQ PyTorch Tutorial*](https://github.com/sony/model_optimization/blob/main/tutorials/notebooks/mct_features_notebooks/pytorch/example_pytorch_mixed_precision_ptq.ipynb)

For tutorials on other quantization features of MCT see [*MCT Features Tutorials*](https://github.com/sony/model_optimization/blob/main/tutorials/notebooks/mct_features_notebooks/README.md)

## Summary
In this tutorial we cover the following steps:

1. Post-Training Quantization using MCT's mixed-precision API:
   1. Use MCT to estimate the quantized model size with MCT's PTQ (i.e. quantize all weights and activations to 8 bits).
   2. Set quantized model target size (AKA resource utilization constraints).
   3. Quantize the model with MCT.
2. Converting the model to a IMX500 suitable representation using IMX500-Converter

## Setup
Install the relevant packages:

In [None]:
from importlib import util

if not util.find_spec('edge_mdt') or not util.find_spec("uni.pytorch"):
    print(f"Installing edge-mdt")
    !pip install edge-mdt[pt]

if not util.find_spec('torch') or not util.find_spec("torchvision"):
    !pip install -q torch torchvision

!pip install -q onnx

Load a pre-trained ShuffleNetV2 model from torchvision, in 32-bits floating-point precision format.

In [None]:
import torch
from torchvision.models import shufflenet_v2_x2_0, ShuffleNet_V2_X2_0_Weights

float_model = shufflenet_v2_x2_0(weights=ShuffleNet_V2_X2_0_Weights.IMAGENET1K_V1)

## Representative Dataset
We're all set to use MCT's post-training quantization. To begin, we'll define a representative dataset generator. Please note that for demonstration purposes, we will generate random data of the desired image shape instead of using real images. Then, we will apply PTQ on our model using the dataset generator we have created. For more details on using MCT, refer to the MCT tutorials

In [None]:
from typing import Iterator, List
 
NUM_ITERS = 20
BATCH_SIZE = 32
def get_representative_dataset(n_iter: int):
    """
    This function creates a representative dataset generator. The generator yields numpy
        arrays of batches of shape: [Batch, C, H, W].
    Args:
        n_iter: number of iterations for MCT to calibrate on
    Returns:
        A representative dataset generator
    """
    def representative_dataset() -> Iterator[List]:
        for _ in range(n_iter):
            yield [torch.rand(BATCH_SIZE, 3, 224, 224)]
    return representative_dataset
representative_data_generator = get_representative_dataset(n_iter=NUM_ITERS)

## Target Platform Capabilities (TPC)
In addition, MCT optimizes the model for dedicated hardware platforms. This is done using a TPC (for more details, please visit our [documentation](https://github.com/SonySemiconductorSolutions/aitrios-edge-mdt-tpc)). Specifically for this tutorial, the TPCv4 is used because it enables multiple bit-widths for weights (2, 4 & 8) and activations (8 & 16). The default bit-width for both is 8 bits.

In [None]:
from edgemdt_tpc import get_target_platform_capabilities

# Get a TPC object representing the imx500 hardware and use it for PyTorch model quantization in MCT.
# Note we're using version 4.0, that supports weights & activation mixed precision required for compressing the model.
tpc = get_target_platform_capabilities(tpc_version='4.0', device_type='imx500')

## Mixed-Precision Post-Training Quantization using MCT
Now for the exciting part! Let’s use MCT's mixed-precision API to run PTQ on the model and compress it for the IMX500.

**First step**: let MCT estimate the memory requirements of the model. These requirements include weights memory, activation memory and BOPs (bit-operations). The IMX500 has a total memory of 8MB that should include both weights and activation memory, so we want to make sure this model will fit the IMX500's memory. We'll get the following values with this [API](https://github.com/sony/model_optimization/blob/main/model_compression_toolkit/core/pytorch/resource_utilization_data_facade.py)

 - **Weights memory** is the static memory of the model weights.
 - **Activation memory** is the dynamic memory required by the model's activations during inference. Each step of inference of the model requires a different size of memory, depending on the current operation's input and output sizes. The *activation memory* calculated by MCT is the **maximum** value of all operations activation memory.
 - **Total memory** is the sum of the *weights memory* and *activation memory*. 
 - **BOPs** is an estimation of the total multiply-accumulate operations required for a single image inference. This is a common metric for estimating latency and power requirements.

Note<sup>1</sup> that MCT performs this estimation on the original model and the numbers represent the **size** of weights and the **size** of the activations in the model, regardless of weights and activation types (e.g. float32 which is 4 bytes).

Note<sup>2</sup> that this is only an estimation and the actual memory required by the IMX500 is higher.

In [None]:
import model_compression_toolkit as mct

ru_data = mct.core.pytorch_resource_utilization_data(float_model, representative_data_generator,
                                                     target_platform_capabilities=tpc)

print('Model utilization estimation:')
print(f' Weights:\t\t{int(ru_data.weights_memory)}')
print(f' Activations:\t{int(ru_data.activation_memory)}')

# Assuming MCT quantizes all weights and activations to 8 bits (which is the TPCv4 default), the total memory is weights_memory+activation_memory bytes.
print(f' Total memory :\t{int(ru_data.weights_memory + ru_data.activation_memory)} bytes')

The total memory requirement is 7.86 MB, which is rather close to the maximum memory of the IMX500, so we might need to compress the model to less than 8 bits per weight. The IMX500 supports various bit-widths for weights (e.g. 2, 4 ,& 8 in TPCv4), and the MCT performs an optimization process (AKA, Mixed-precision) to select a bit-width per operation so the total weights memory is reduced to a predefined value.

Triggering the mixed-precision in MCT requires setting a `ResourceUtilization` object with the required weights memory. For example, compressing the weights memory to 80% is done like this:

In [None]:
ru = mct.core.ResourceUtilization(weights_memory=ru_data.weights_memory*0.80)

quantized_model, quantization_info = mct.ptq.pytorch_post_training_quantization(
        in_module=float_model,
        representative_data_gen=representative_data_generator,
        target_resource_utilization=ru,
        target_platform_capabilities=tpc
)

Note: The model weights are compressed to 80% compared to a quantized model with all 8 bit weights, and 20% compared to the original model whos weights are float32.

The compressed quantized model accuracy may degrade compared to a non-compressed quantized model. In order to improve accuracy, MCT can exploit the fact that the activation memory requirement is the maximum of the activation memory per operation during inference, and increase a certain operation activation quantization to 16 bits instead of the default 8 bits, assuming that increase doesn't change the maximum activation size. Triggering activation memory optimization in MCT requires either adding `activation_memory` or specifying `total_memory` of the `ResourceUtilization`.

In the example below, we ask the MCT to compress weights to 80%, and keep the activation memory the same, while allowing it to set some activation bit-width to 16 bits.

In [None]:
ru = mct.core.ResourceUtilization(weights_memory=ru_data.weights_memory*0.80,
                                  activation_memory=ru_data.weights_memory)

quantized_model, quantization_info = mct.ptq.pytorch_post_training_quantization(
        in_module=float_model,
        representative_data_gen=representative_data_generator,
        target_resource_utilization=ru,
        target_platform_capabilities=tpc
)

While in this example, we ask the MCT to compress total memory to 80%, while allowing it to set some activation bit-width to 16 bits.

In [None]:
ru = mct.core.ResourceUtilization(total_memory=ru_data.total_memory*0.80)

quantized_model, quantization_info = mct.ptq.pytorch_post_training_quantization(
        in_module=float_model,
        representative_data_gen=representative_data_generator,
        target_resource_utilization=ru,
        target_platform_capabilities=tpc
)

Our model is now quantized. MCT has created a simulated quantized model within the original PyTorch framework by inserting [quantization representation modules](https://github.com/sony/mct_quantizers). These modules, such as `PytorchQuantizationWrapper` and `PytorchActivationQuantizationHolder`, wrap PyTorch layers to simulate the quantization of weights and activations, respectively. While the size of the saved model remains unchanged, all the quantization parameters are stored within these modules and are ready for deployment on the target hardware. In this example, we used the default MCT settings, which compressed the model from 32 bits to 8 bits, resulting in a compression ratio of 4x. 

## Model Conversion

### Exporting to ONNX serialization 
In order to convert our model to an binary suitable to load to IMX500, we first need to serialize it to ONNX format. Please ensure that the `save_model_path` has been set correctly.

In [None]:
import os
import model_compression_toolkit as mct
save_folder = './mobilenet_pt'
os.makedirs(save_folder, exist_ok=True)
onnx_path = os.path.join(save_folder, 'qmodel.onnx')
mct.exporter.pytorch_export_model(quantized_model, save_model_path=onnx_path, repr_dataset=representative_data_generator)

before proceeding to convert the model we need to make sure java 17 or up is installed. for colab you can use this dist

In [None]:
!sudo apt install -y openjdk-17-jre

### Running the IMX500 Converter
Now, we can convert the model to create the PackerOut which can be loaded to IMX500

In [None]:
!imxconv-pt -i {onnx_path} -o {save_folder} --overwrite-output

## Copyrights

Copyright 2025 Sony Semiconductor Israel, Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
