# Model Compression Toolkit (MCT) Wrapper API Comprehensive Quantization Comparison

[Run this tutorial in Google Colab](https://colab.research.google.com/github/sony/model_optimization/blob/main/tutorials/notebooks/mct_features_notebooks/keras/example_keras_wrapper.ipynb)

## Overview 
This notebook provides a comprehensive demonstration of the MCT (Model Compression Toolkit) Wrapper API functionality, showcasing five different quantization methods on a MobileNetV2 model. The tutorial systematically compares the implementation, performance characteristics, and accuracy trade-offs of each quantization approach: PTQ (Post-Training Quantization), PTQ with Mixed Precision, GPTQ (Gradient-based PTQ), GPTQ with Mixed Precision, and LQ-PTQ (Low-bit Quantizer PTQ). Each method utilizes the unified MCTWrapper interface for consistent implementation and comparison.

## Summary
1. **Environment Setup**: Import required libraries and configure MCT with MobileNetV2 model
2. **Dataset Preparation**: Load and prepare ImageNet validation dataset with representative data generation
3. **PTQ Implementation**: Execute basic Post-Training Quantization with 8-bit precision and bias correction
4. **PTQ + Mixed Precision**: Apply intelligent bit-width allocation based on layer sensitivity analysis (75% compression ratio)
5. **GPTQ Implementation**: Perform gradient-based optimization with 5-epoch fine-tuning for enhanced accuracy
6. **GPTQ + Mixed Precision**: Combine gradient optimization with mixed precision for optimal accuracy-compression trade-off
7. **LQ-PTQ Implementation**: Execute ultra-low bit quantization (2-4 bits) with specialized converter requirements
8. **Performance Evaluation**: Comprehensive accuracy assessment and comparison across all quantization methods
9. **Results Analysis**: Compare model sizes, inference accuracy, and quantization trade-offs

## Setup

In [None]:
# Import required libraries for deep learning and file handling
import os
import tensorflow as tf
import keras
from keras.applications.mobilenet_v2 import MobileNetV2 
from pathlib import Path
from typing import List, Tuple, Generator, Any, Callable

# Alternative pip install commands (commented out for local development)
!pip install -q tensorflow

In [None]:
# Import MCT core
#import importlib
#if not importlib.util.find_spec('model_compression_toolkit'):
#    !pip install model_compression_toolkit

import sys
sys.path.append('/home/ubuntu/wrapper/sonyfork/mct-model-optimization')

import model_compression_toolkit as mct
from model_compression_toolkit.core import QuantizationErrorMethod

## Dataset preparation
Download ImageNet dataset with only the validation split.

**Note** that for demonstration purposes we use the validation set for the model quantization routines. Usually, a subset of the training dataset is used, but loading it is a heavy procedure that is unnecessary for the sake of this demonstration.

This step may take several minutes...

In [None]:
# Download and extract ImageNet validation dataset if not already present
# This setup is required for model quantization and evaluation
if not os.path.isdir('imagenet'):
    # Create directory and download dataset files
    os.system('mkdir imagenet')
    os.system('wget -P imagenet https://image-net.org/data/ILSVRC/2012/ILSVRC2012_devkit_t12.tar.gz')
    os.system('wget -P imagenet https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar')
    # Extract downloaded archives
    os.system('cd imagenet && tar -xzf ILSVRC2012_devkit_t12.tar.gz &&       mkdir ILSVRC2012_img_val && tar -xf ILSVRC2012_img_val.tar -C ILSVRC2012_img_val')

# Define paths for dataset organization
root = Path('./imagenet')
imgs_dir = root / 'ILSVRC2012_img_val'
target_dir = root /'val'

def extract_labels() -> List[str]:
    """Extract ground truth labels from ImageNet metadata"""
    os.system('pip install -q scipy')
    import scipy
    mat = scipy.io.loadmat(root / 'ILSVRC2012_devkit_t12/data/meta.mat', squeeze_me=True)
    cls_to_nid = {s[0]: s[1] for i, s in enumerate(mat['synsets']) if s[4] == 0} 
    with open(root / 'ILSVRC2012_devkit_t12/data/ILSVRC2012_validation_ground_truth.txt', 'r') as f:
        return [cls_to_nid[int(cls)] for cls in f.readlines()]

# Organize images into class-specific directories for tf.data.Dataset compatibility
if not target_dir.exists():
    labels = extract_labels()
    for lbl in set(labels):
        os.makedirs(target_dir / lbl)
    for img_file, lbl in zip(sorted(os.listdir(imgs_dir)), labels):
        shutil.move(imgs_dir / img_file, target_dir / lbl)

# Preprocessing function for MobileNetV2
def imagenet_preprocess_input(images: tf.Tensor, labels: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
    """Apply MobileNetV2-specific preprocessing to input images"""
    return tf.keras.applications.mobilenet_v2.preprocess_input(images), labels

def get_dataset(batch_size: int, shuffle: bool) -> tf.data.Dataset:
    """
    Create a tf.data.Dataset from ImageNet validation images
    
    Args:
        batch_size: Number of images per batch
        shuffle: Whether to shuffle the dataset
    
    Returns:
        Preprocessed and optimized tf.data.Dataset
    """
    # Load images from directory structure with automatic labeling
    dataset = tf.keras.utils.image_dataset_from_directory(
        directory='./imagenet/val',
        batch_size=batch_size,
        image_size=[224, 224],
        shuffle=shuffle,
        crop_to_aspect_ratio=True,
        interpolation='bilinear')
    dataset = dataset.map(lambda x, y: (imagenet_preprocess_input(x, y)), num_parallel_calls=tf.data.AUTOTUNE)
    dataset = dataset.prefetch(buffer_size=tf.data.AUTOTUNE)
    return dataset

### Representative dataset construction
We show how to create a generator for the representative dataset, which is required for post-training quantization.

The representative dataset is used for collecting statistics on the inference outputs of all layers in the model.
 
In order to decide on the size of the representative dataset, we configure the batch size and the number of calibration iterations.
This gives us the total number of samples that will be used during PTQ (batch_size x n_iter).
In this example we set `batch_size = 50` and `n_iter = 10`, resulting in a total of 500 representative images.

Please ensure that the dataset path has been set correctly.

In [None]:
# Configuration parameters for representative dataset generation
batch_size: int = 5  # Number of images per batch for quantization calibration
n_iter: int = 2      # Number of iterations to generate representative data

# Create dataset instance for representative data generation
dataset = get_dataset(batch_size, shuffle=True)

# Generator for representative dataset used in quantization
def representative_dataset_gen() -> Generator[List[Any], None, None]:
    """
    Generator function for representative dataset used in quantization calibration.
    
    This function provides a small subset of data that MCT uses to:
    - Calibrate quantization parameters
    - Determine optimal activation ranges
    - Configure quantization thresholds
    
    Yields:
        List containing numpy arrays of image batches
    """
    for _ in range(n_iter):
        # Extract one batch and convert to numpy format required by MCT
        yield [dataset.take(1).get_single_element()[0].numpy()]

## Model Post-Training quantization using MCTWrapper

In [None]:
# Decorator to provide consistent logging and error handling for quantization functions
def decorator(func: Callable[[keras.Model], Tuple[bool, keras.Model]]) -> Callable[[keras.Model], Tuple[bool, keras.Model]]:
    """
    Wrapper decorator that:
    - Logs function start and end execution
    - Handles success/failure status from quantization functions
    - Exits program if quantization fails
    
    Args:
        func: Function to be decorated (quantization function)
    
    Returns:
        Wrapped function with logging and error handling
    """
    def wrapper(*args: Any, **kwargs: Any) -> Tuple[bool, keras.Model]:
        print(f"----------------- {func.__name__} Start ---------------")
        flag, result = func(*args, **kwargs)
        print(f"----------------- {func.__name__} End -----------------")
        if not flag:exit()
        return flag, result
    return wrapper

Run PTQ (Post-Training Quantization) with Keras

In [None]:
@decorator
def PTQ_Keras(float_model: keras.Model) -> Tuple[bool, keras.Model]:
    """
    Perform Post-Training Quantization (PTQ) using MCT on Keras model.
    
    PTQ is a quantization method that:
    - Does not require model retraining
    - Uses representative data for calibration
    - Provides good accuracy with minimal computational overhead
    
    Args:
        float_model: Original floating-point Keras model
    
    Returns:
        tuple: (success_flag, quantized_model)
    """
    # Configuration for basic PTQ quantization
    method = 'PTQ'                    # Post-Training Quantization method
    framework = 'tensorflow'          # Target framework (Keras/TensorFlow)
    use_MCT_TPC = True                # Use MCT's built-in Target Platform Capabilities
    use_MixP = False                  # Disable mixed-precision quantization

    # Parameter configuration for PTQ
    param_items = [
        ['tpc_version', '1.0', 'The version of the TPC to use.'],
        
        # Quantization configuration parameters
        ['activation_error_method', QuantizationErrorMethod.MSE, 'Error metric for activation quantization'],
        ['weights_bias_correction', True, 'Enable bias correction for weights'],
        ['z_threshold', float('inf'), 'Threshold for zero-point quantization'],
        ['linear_collapsing', True, 'Enable linear layer collapsing optimization'],
        ['residual_collapsing', True, 'Enable residual connection collapsing'],
        
        # Output configuration
        ['save_model_path', './qmodel_PTQ_Keras.tflite', 'Path to save the quantized model']
    ]

    # Execute quantization using MCTWrapper
    wrapper = mct.wrapper.mctwrapper.MCTWrapper()
    flag, quantized_model = wrapper.quantize_and_export(
        float_model, method, framework, use_MCT_TPC, use_MixP, 
        representative_dataset_gen, param_items)
    return flag, quantized_model

Run PTQ + Mixed Precision Quantization (MixP) with Keras

In [None]:
@decorator
def PTQ_Keras_MixP(float_model: keras.Model) -> Tuple[bool, keras.Model]:
    """
    Perform Post-Training Quantization with Mixed Precision (PTQ + MixP) on Keras model.
    
    Mixed Precision Quantization:
    - Uses different bit-widths for different layers
    - Optimizes model size while maintaining accuracy
    - Automatically selects optimal precision for each layer
    - Uses resource constraints to guide precision allocation
    
    Args:
        float_model: Original floating-point Keras model
    
    Returns:
        tuple: (success_flag, quantized_model)
    """
    # Configuration for PTQ with mixed precision
    method = 'PTQ'                    # Post-Training Quantization method
    framework = 'tensorflow'          # Target framework (Keras/TensorFlow)
    use_MCT_TPC = True                # Use MCT's built-in Target Platform Capabilities
    use_MixP = True                   # Enable mixed-precision quantization

    # Parameter configuration for PTQ with Mixed Precision
    param_items = [
        ['tpc_version', '1.0', 'The version of the TPC to use.'],
        
        # Mixed precision configuration
        ['num_of_images', 5, 'Number of images for mixed precision analysis'],
        ['use_hessian_based_scores', False, 'Use Hessian-based sensitivity scores for layer importance'],
        
        # Resource constraint configuration
        ['weights_compression_ratio', 0.75, 'Target compression ratio for model weights (75% of original size)'],
        
        # Output configuration
        ['save_model_path', './qmodel_PTQ_Keras_MixP.tflite', 'Path to save the mixed precision quantized model']
    ]

    # Execute mixed precision quantization using MCTWrapper
    wrapper = mct.wrapper.mctwrapper.MCTWrapper()
    flag, quantized_model = wrapper.quantize_and_export(
        float_model, method, framework, use_MCT_TPC, use_MixP, 
        representative_dataset_gen, param_items)
    return flag, quantized_model

Run GPTQ (Gradient-based PTQ) with Keras

In [None]:
@decorator
def GPTQ_Keras(float_model: keras.Model) -> Tuple[bool, keras.Model]:
    """
    Perform Gradient-based Post-Training Quantization (GPTQ) on Keras model.
    
    GPTQ is an advanced quantization method that:
    - Uses gradient information to optimize quantization parameters
    - Fine-tunes the model during quantization process
    - Generally provides better accuracy than standard PTQ
    - Requires slightly more computational resources than PTQ
    
    Args:
        float_model: Original floating-point Keras model
    
    Returns:
        tuple: (success_flag, quantized_model)
    """
    # Configuration for GPTQ quantization
    method = 'GPTQ'                   # Gradient-based Post-Training Quantization
    framework = 'tensorflow'          # Target framework (Keras/TensorFlow)
    use_MCT_TPC = False               # Use external EdgeMDT Target Platform Capabilities
    use_MixP = False                  # Disable mixed-precision quantization

    # Parameter configuration for GPTQ
    param_items = [
        # Platform configuration
        ['target_platform_version', 'v1', 'Target platform capabilities version'],
        
        # GPTQ-specific training parameters
        ['n_epochs', 5, 'Number of epochs for gradient-based fine-tuning'],
        ['optimizer', None, 'Optimizer for fine-tuning (None = use default)'],
        
        # Output configuration
        ['save_model_path', './qmodel_GPTQ_Keras.tflite', 'Path to save the GPTQ quantized model']
    ]

    # Execute GPTQ quantization using MCTWrapper
    wrapper = mct.wrapper.mctwrapper.MCTWrapper()
    flag, quantized_model = wrapper.quantize_and_export(
        float_model, method, framework, use_MCT_TPC, use_MixP, 
        representative_dataset_gen, param_items)
    return flag, quantized_model

Run GPTQ + Mixed Precision Quantization (MixP) with Keras

In [None]:
@decorator
def GPTQ_Keras_MixP(float_model: keras.Model) -> Tuple[bool, keras.Model]:
    """
    Perform Gradient-based Post-Training Quantization with Mixed Precision (GPTQ + MixP).
    
    This combines the benefits of both techniques:
    - GPTQ: Gradient-based optimization for better quantization accuracy
    - Mixed Precision: Optimal bit-width allocation for size/accuracy trade-off
    
    This is the most advanced quantization method available, providing:
    - Best possible accuracy preservation
    - Optimal model size reduction
    - Automatic precision selection per layer
    
    Args:
        float_model: Original floating-point Keras model
    
    Returns:
        tuple: (success_flag, quantized_model)
    """
    # Configuration for GPTQ with mixed precision
    method = 'GPTQ'                   # Gradient-based Post-Training Quantization
    framework = 'tensorflow'          # Target framework (Keras/TensorFlow)
    use_MCT_TPC = False               # Use external EdgeMDT Target Platform Capabilities
    use_MixP = True                   # Enable mixed-precision quantization

    # Parameter configuration for GPTQ with Mixed Precision
    param_items = [
        # Platform configuration
        ['target_platform_version', 'v1', 'Target platform capabilities version'],
        
        # GPTQ-specific training parameters
        ['n_epochs', 5, 'Number of epochs for gradient-based fine-tuning'],
        ['optimizer', None, 'Optimizer for fine-tuning (None = use default)'],
        
        # Mixed precision configuration
        ['num_of_images', 5, 'Number of images for mixed precision sensitivity analysis'],
        ['use_hessian_based_scores', False, 'Use Hessian-based scores for layer importance ranking'],
        
        # Resource constraint configuration
        ['weights_compression_ratio', 0.75, 'Target compression ratio for model weights (75% reduction)'],
        
        # Output configuration
        ['save_model_path', './qmodel_GPTQ_Keras_MixP.tflite', 'Path to save the GPTQ+MixP quantized model']
    ]

    # Execute advanced GPTQ+MixP quantization using MCTWrapper
    wrapper = mct.wrapper.mctwrapper.MCTWrapper()
    flag, quantized_model = wrapper.quantize_and_export(
        float_model, method, framework, use_MCT_TPC, use_MixP, 
        representative_dataset_gen, param_items)
    return flag, quantized_model

Run LQPTQ (Low-bit Quantizer PTQ) with Keras

In [None]:
@decorator
def LQPTQ_Keras(float_model: keras.Model) -> Tuple[bool, keras.Model]:
    """
    Perform Low-bit Quantizer Post-Training Quantization (LQ-PTQ) on Keras model.
    
    LQ-PTQ is a specialized quantization method that:
    - Targets very low bit-width quantization (e.g., 2-4 bits)
    - Uses advanced techniques for ultra-low precision
    - Requires specific converter versions for deployment
    - Currently only supports TensorFlow/Keras framework
    
    Args:
        float_model: Original floating-point Keras model
    
    Returns:
        tuple: (success_flag, quantized_model)
    """
    # Configuration for LQ-PTQ quantization
    method = 'LQPTQ'                  # Low-bit Quantizer Post-Training Quantization
    framework = 'tensorflow'          # Target framework (TensorFlow only for LQ-PTQ)
    use_MCT_TPC = False               # Use external Target Platform Capabilities
    use_MixP = False                  # Mixed precision not applicable for LQ-PTQ

    # Parameter configuration for LQ-PTQ
    param_items = [
        # LQ-PTQ specific training parameters
        ['learning_rate', 0.0001, 'Learning rate for low-bit quantization optimization'],
        ['converter_ver', 'v3.14', 'Converter version for deployment compatibility'],
        
        # Output configuration
        ['save_model_path', './qmodel_LQPTQ_Keras.tflite', 'Path to save the LQ-PTQ quantized model']
    ]

    # LQ-PTQ requires a different representative dataset format (single batch, not generator)
    representative_dataset = dataset.take(1).get_single_element()[0].numpy()
    
    # Execute LQ-PTQ quantization using MCTWrapper
    wrapper = mct.wrapper.mctwrapper.MCTWrapper()
    flag, quantized_model = wrapper.quantize_and_export(
        float_model, method, framework, use_MCT_TPC, use_MixP, 
        representative_dataset, param_items)
    return flag, quantized_model

### Run model Post-Training Quantization
Lastly, we quantize our model using MCTWrapper API.

In [None]:
# Load pre-trained MobileNetV2 model as the base model for quantization experiments
float_model = MobileNetV2()

# Execute comprehensive quantization method comparison using MCT Wrapper functionality
print("Starting quantization experiments with different methods...")

# Method 1: Basic Post-Training Quantization (PTQ)
# - Standard 8-bit quantization without optimization
# - Fastest method with good baseline performance
flag, quantized_model = PTQ_Keras(float_model)

# Method 2: PTQ with Mixed Precision Quantization
# - Uses different bit-widths for different layers based on sensitivity analysis
# - Optimizes model size while maintaining accuracy through intelligent bit allocation
flag, quantized_model2 = PTQ_Keras_MixP(float_model)

# Method 3: Gradient-based Post-Training Quantization (GPTQ)
# - Uses gradient information to fine-tune quantization parameters
# - Provides better accuracy than basic PTQ through learned optimization
flag, quantized_model3 = GPTQ_Keras(float_model)

# Method 4: GPTQ with Mixed Precision Quantization
# - Combines gradient-based optimization with mixed precision
# - Delivers the best accuracy-compression trade-off available
flag, quantized_model4 = GPTQ_Keras_MixP(float_model)

# Method 5: Low-bit Quantization Post-Training Quantization (LQ-PTQ)
# - Experimental ultra-low precision quantization (commented out - requires specific setup)
#flag, quantized_model5 = LQPTQ_Keras(float_model)

print("All quantization methods completed successfully!")

## Models evaluation
In order to evaluate our models, we first need to load the validation dataset. As before, please ensure that the dataset path has been set correctly.

In [None]:
# Model Evaluation and Accuracy Comparison
print("Starting model evaluation phase...")

# Prepare validation dataset for accuracy assessment
val_dataset = get_dataset(batch_size=50, shuffle=False)

# Evaluate original floating-point model accuracy
print("\n=== Original Model Evaluation ===")
float_model.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
float_accuracy = float_model.evaluate(val_dataset)
print(f"Float model's Top 1 accuracy on the Imagenet validation set: {(float_accuracy[1] * 100):.2f}%")

# Evaluate PTQ quantized model accuracy
print("\n=== PTQ Model Evaluation ===")
quantized_model.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
quantized_accuracy = quantized_model.evaluate(val_dataset)
print(f"PTQ_Keras Quantized model's Top 1 accuracy on the Imagenet validation set: {(quantized_accuracy[1] * 100):.2f}%")

# Evaluate PTQ + Mixed Precision model accuracy
print("\n=== PTQ + Mixed Precision Model Evaluation ===")
quantized_model2.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
quantized_accuracy = quantized_model2.evaluate(val_dataset)
print(f"PTQ_Keras_MixP Quantized model's Top 1 accuracy on the Imagenet validation set: {(quantized_accuracy[1] * 100):.2f}%")

# Evaluate GPTQ quantized model accuracy
print("\n=== GPTQ Model Evaluation ===")
quantized_model3.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
quantized_accuracy = quantized_model3.evaluate(val_dataset)
print(f"GPTQ_Keras Quantized model's Top 1 accuracy on the Imagenet validation set: {(quantized_accuracy[1] * 100):.2f}%")

# Evaluate GPTQ + Mixed Precision model accuracy
print("\n=== GPTQ + Mixed Precision Model Evaluation ===")
quantized_model4.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
quantized_accuracy = quantized_model4.evaluate(val_dataset)
print(f"GPTQ_Keras_MixP Quantized model's Top 1 accuracy on the Imagenet validation set: {(quantized_accuracy[1] * 100):.2f}%")

# LQ-PTQ model evaluation (commented out)
#print("\n=== LQ-PTQ Model Evaluation ===")
#quantized_model5.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
#quantized_accuracy = quantized_model5.evaluate(val_dataset)
#print(f"Quantized model5's Top 1 accuracy on the Imagenet validation set: {(quantized_accuracy[1] * 100):.2f}%")

print("Fisish")

## Conclusion
In this tutorial, we demonstrated how to quantize a pre-trained model using MCTWrapper with a few lines of code.

Copyright 2024 Sony Semiconductor Solutions, Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.