# Mixed-Precision Compression Post-Training Quantization + Conversion to IMX500 of a ShuffleNetV2 PyTorch Model

## Overview
This tutorial demonstrates how to use the [**Model Compression Toolkit (MCT)**](https://github.com/sony/model_optimization) for model quantization and compression under certain HW constraints (e.g. memory). This is done using MCT's mixed-precision API that assigns different bit-widths to different weights and activations in the quantized model. After the model is quantized to a specific size, it will be converted to a binary format suitable to load to IMX500 using the [**IMX500-converter**](https://developer.aitrios.sony-semicon.com/en/raspberrypi-ai-camera/documentation/imx500-converter?version=3.14.3&progLang=). 

This example is not intended to demonstrate evaluating MCT PTQ performance and as such intentionally uses generated random data to speed up the process.
 
For the full tutorial on MCT's PTQ see - [*MCT Mixed-Precision PTQ PyTorch Tutorial*](https://github.com/sony/model_optimization/blob/main/tutorials/notebooks/mct_features_notebooks/pytorch/example_pytorch_mixed_precision_ptq.ipynb)

For tutorials on other quantization features of MCT see [*MCT Features Tutorials*](https://github.com/sony/model_optimization/blob/main/tutorials/notebooks/mct_features_notebooks/README.md)

## Summary
In this tutorial we cover the following steps:

1. Post-Training Quantization using MCT's mixed-precision API:
   1. Use MCT to estimate the quantized model size when quantized to default single bit-width precision.
   2. Set quantized model target size (a.k.a. Target Resource Utilization).
   3. Quantize the model with MCT.
2. Converting the model to a IMX500 suitable representation using IMX500-Converter

## Setup
Install the relevant packages:

In [None]:
from importlib import util, metadata
import os

if not util.find_spec('edge_mdt') or not util.find_spec("uni.pytorch"):
    print(f"Installing edge-mdt")
    !pip install edge-mdt[pt]

if not util.find_spec('torch') or not util.find_spec("torchvision"):
    !pip install -q torch torchvision

if not util.find_spec("edgemdt_tpc") or metadata.version("edge-mdt-tpc") < '1.1.0':
    raise Exception("Need edge-mdt-tpc>=1.1.0 for TPCv4")
    
!pip install -q onnx

install java 17 or up so we can run the imx500-converter

In [None]:
!sudo apt install -y openjdk-17-jre

Load a pre-trained **ShuffleNetV2** model from torchvision, in 32-bits floating-point precision format.

In [None]:
import torch
from torchvision.models import shufflenet_v2_x2_0, ShuffleNet_V2_X2_0_Weights

float_model = shufflenet_v2_x2_0(weights=ShuffleNet_V2_X2_0_Weights.IMAGENET1K_V1)

## Representative Dataset
We're all set to use MCT's post-training quantization. To begin, we'll define a representative dataset generator. Please note that for demonstration purposes, we will generate random data of the desired image shape instead of using real images. Then, we will apply PTQ on our model using the dataset generator we have created. For more details on using MCT, refer to the MCT tutorials

In [None]:
from typing import Iterator, List
 
NUM_ITERS = 3
BATCH_SIZE = 2
def get_representative_dataset(n_iter: int):
    """
    This function creates a representative dataset generator. The generator yields numpy
        arrays of batches of shape: [Batch, C, H, W].
    Args:
        n_iter: number of iterations for MCT to calibrate on
    Returns:
        A representative dataset generator
    """
    def representative_dataset() -> Iterator[List]:
        for _ in range(n_iter):
            yield [torch.rand(BATCH_SIZE, 3, 224, 224)]
    return representative_dataset
representative_data_generator = get_representative_dataset(n_iter=NUM_ITERS)

## Target Platform Capabilities (TPC)
MCT optimizes the model for dedicated hardware platforms. This is done using a TPC (for more details, please visit our [documentation](https://github.com/SonySemiconductorSolutions/aitrios-edge-mdt-tpc)). Specifically for this tutorial, the IMX500 TPCv4 is used because it enables multiple bit-widths for weights (2, 4 & 8) and activations (8 & 16). **The default bit-width for single-precision is 8 bits for both weights and activations in all operations**.

In [None]:
from edgemdt_tpc import get_target_platform_capabilities

# Get a TPC object representing the imx500 hardware and use it for PyTorch model quantization in MCT.
# Note we're using version 4.0, that supports weights & activation mixed precision required for compressing the model.
tpc = get_target_platform_capabilities(tpc_version='4.0', device_type='imx500')

## Mixed-Precision Post-Training Quantization using MCT
Let’s use MCT's resource utilization API to estimate the memory required to run the model on the IMX500. The requirement includes bith weights memory and activation memory.

MCT estimates weights memory, activation memory and BOPs (bit-operations). We'll get the following values with the [Resource Utilization Data API](https://sony.github.io/model_optimization/api/api_docs/methods/pytorch_kpi_data.html#ug-pytorch-resource-utilization-data).

 - **Weights memory** is the static memory of the model weights.
 - **Activation memory** is the dynamic memory required by the model's activations during inference. Each step of inference of the model requires a different size of memory, depending on the current operation's input and output sizes. The *activation memory* calculated by MCT is an estimation of the **maximal activation memory** during inference.
 - **Total memory** is the sum of the *weights memory* and *activation memory*. 
 - **Bit-Operations (BOPs)** is an estimation of the total multiply-accumulate operations required for a single image inference. This is a common metric for estimating latency and power requirements.

The memory values above represent the estimated memory required by the model when quantized to the **default single-precision bit-width**.

Note<sup>1</sup> that MCT performs this estimation on the original model and the numbers represent the **size** of weights and the **size** of the activations in the model, regardless of weights and activation types (e.g. float32 which is 4 bytes).

Note<sup>2</sup> that this is only an estimation and the actual memory required by the IMX500 is larger.

In [None]:
import model_compression_toolkit as mct

ru_data = mct.core.pytorch_resource_utilization_data(float_model, representative_data_generator,
                                                     target_platform_capabilities=tpc)

print('Model utilization estimation:')
print(f' Weights:\t\t{int(ru_data.weights_memory)}')
print(f' Activations:\t{int(ru_data.activation_memory)}')
print(f' Total memory :\t{int(ru_data.weights_memory + ru_data.activation_memory)} bytes')

The total memory estimation of the MCT is 7.86MB, so it should fit the IMX500 which has 8MB, but it's a close call. The MCT memory estimation is usually lower than the real memory used by the IMX500, because the MCT doesn't include all the memory allocations required by the IMX500 operation.

Let try quantizing the model in single-precision and check that it can be converted successfully.

## Run Converter in single precision

The following section quantizes the model, exports it to ONNX, and runs the Converter.

In [None]:
# Quantize model:
quantized_model, quantization_info = mct.ptq.pytorch_post_training_quantization(
        in_module=float_model,
        representative_data_gen=representative_data_generator,
        target_platform_capabilities=tpc
)

# Export quantized model to an onnx file.
save_folder = './quant_models'
os.makedirs(save_folder, exist_ok=True)
onnx_path = os.path.join(save_folder, 'qmodel.onnx')
mct.exporter.pytorch_export_model(quantized_model, save_model_path=onnx_path, repr_dataset=representative_data_generator)

# Run imx500-converter:
!imxconv-pt -i ./quant_models/qmodel.onnx -o ./quant_models/output --overwrite-output

In [None]:
# Print the imx500-converter memory report:
!cat ./quant_models/output/qmodel_MemoryReport.json

The imxconv-pt failed because there is not enough memory for the ShuffleNetV2 model quantized in single-precision (*ConvFe error (ISM) on 'qmodel': Not enough memory, available: 8.00MB , required: 8.34MB*). In the following sections, we'll compress the quantized model in mixed-precision, so it will fit the IMX500's memory limit.

The MCT's memory estimation (7.86MB) is lower than the actual memory required by the IMX500 (8.34MB). In order to calculate the correct target Resource Utilization inputs to MCT, we calculate the required compression ratio: $8.00MB/8.34MB=0.96$, and multiply it by the MCT estimation.

Triggering the mixed-precision in MCT requires setting a `ResourceUtilization` object with the required target memory constraints.

### Weights memory constraint

Providing the weights constraint only, will activate mixed-precision on weights while keeping all activations at the default bit-width. In the following code, we set the compression ratio to 96%, so it will fit the IMX500 memory.

In [None]:
compression_ratio = 0.96

ru = mct.core.ResourceUtilization(weights_memory=ru_data.weights_memory*compression_ratio)

quantized_model, quantization_info = mct.ptq.pytorch_post_training_quantization(
        in_module=float_model,
        representative_data_gen=representative_data_generator,
        target_resource_utilization=ru,
        target_platform_capabilities=tpc
)

# Export quantized model to an onnx file.
onnx_path = os.path.join('./quant_models/qmodel_weights_mp.onnx')
mct.exporter.pytorch_export_model(quantized_model, save_model_path=onnx_path, repr_dataset=representative_data_generator)

# Run imx500-converter:
!imxconv-pt -i ./quant_models/qmodel_weights_mp.onnx -o ./quant_models/output --overwrite-output

# Print the imx500-converter memory report:
!cat ./quant_models/output/qmodel_weights_mp_MemoryReport.json

### Weights and activation memory constraint

The compressed quantized model accuracy may degrade compared to a non-compressed quantized model. In order to improve accuracy, MCT can exploit the fact that the activation memory requirement is the maximal activation memory during inference, and increase a certain operation activation quantization to 16 bits instead of the default 8 bits, assuming that increase doesn't change the maximal activation memory. Triggering activation memory optimization in MCT requires adding `activation_memory` to the `ResourceUtilization`.

In the example below, we ask the MCT to compress weights to 96%, and keep the activation memory the same, while allowing it to set some activation bit-width to 16 bits.

In [None]:
ru = mct.core.ResourceUtilization(weights_memory=ru_data.weights_memory*compression_ratio,
                                  activation_memory=ru_data.weights_memory)

quantized_model, quantization_info = mct.ptq.pytorch_post_training_quantization(
        in_module=float_model,
        representative_data_gen=representative_data_generator,
        target_resource_utilization=ru,
        target_platform_capabilities=tpc
)

# Export quantized model to an onnx file.
onnx_path = os.path.join('./quant_models/qmodel_weights_activation_mp.onnx')
mct.exporter.pytorch_export_model(quantized_model, save_model_path=onnx_path, repr_dataset=representative_data_generator)

# Run imx500-converter:
!imxconv-pt -i ./quant_models/qmodel_weights_activation_mp.onnx -o ./quant_models/output --overwrite-output

# Print the imx500-converter memory report:
!cat ./quant_models/output/qmodel_weights_activation_mp_MemoryReport.json

### Total memory constraint

Another option to activate weights mixed-precision and activation mixed-precision optimization is to provide a target `total_memory` to the `ResourceUtilization`. Note that using the total memory constraint allows MCT to achieve optimal memory utilization by providing a fixed total memory constraint.

In [None]:
ru = mct.core.ResourceUtilization(total_memory=ru_data.total_memory*compression_ratio)

quantized_model, quantization_info = mct.ptq.pytorch_post_training_quantization(
        in_module=float_model,
        representative_data_gen=representative_data_generator,
        target_resource_utilization=ru,
        target_platform_capabilities=tpc
)

# Export quantized model to an onnx file.
onnx_path = os.path.join('./quant_models/qmodel_total_memory_mp.onnx')
mct.exporter.pytorch_export_model(quantized_model, save_model_path=onnx_path, repr_dataset=representative_data_generator)

# Run imx500-converter:
!imxconv-pt -i ./quant_models/qmodel_total_memory_mp.onnx -o ./quant_models/output --overwrite-output

# Print the imx500-converter memory report:
!cat ./quant_models/output/qmodel_total_memory_mp_MemoryReport.json

## Summary

We quantized, exported and deployed the ShuffleNetV2 model in 3 different mixed-precision variations, so it will fit the IMX500 memory constraint. Each variation has its benefits, so we can choose either to fit our goals.

## Copyrights

Copyright 2025 Sony Semiconductor Israel, Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.