# Model Compression Toolkit (MCT) Wrapper API Comprehensive Quantization Comparison(tensorflow)

[Run this tutorial in Google Colab](https://colab.research.google.com/github/SonySemiconductorSolutions/mct-model-optimization/blob/main/tutorials/notebooks/mct_features_notebooks/keras/example_keras_mct_wrapper.ipynb)

## Overview 
This notebook provides a comprehensive demonstration of the MCT (Model Compression Toolkit) Wrapper API functionality, showcasing five different quantization methods on a MobileNetV2 model. The tutorial systematically compares the implementation, performance characteristics, and accuracy trade-offs of each quantization approach: PTQ (Post-Training Quantization), PTQ with Mixed Precision, GPTQ (Gradient-based PTQ), GPTQ with Mixed Precision, and LQ-PTQ (Low-bit Quantizer PTQ). Each method utilizes the unified MCTWrapper interface for consistent implementation and comparison.

## Summary
1. **Environment Setup**: Import required libraries and configure MCT with MobileNetV2 model
2. **Dataset Preparation**: Load and prepare ImageNet validation dataset with representative data generation
3. **PTQ Implementation**: Execute basic Post-Training Quantization with 8-bit precision and bias correction
4. **PTQ + Mixed Precision**: Apply intelligent bit-width allocation based on layer sensitivity analysis (75% compression ratio)
5. **GPTQ Implementation**: Perform gradient-based optimization with 5-epoch fine-tuning for enhanced accuracy
6. **GPTQ + Mixed Precision**: Combine gradient optimization with mixed precision for optimal accuracy-compression trade-off
7. **Performance Evaluation**: Comprehensive accuracy assessment and comparison across all quantization methods
8. **Results Analysis**: Compare model sizes, inference accuracy, and quantization trade-offs

## Setup

In [1]:
# Import required libraries for deep learning and file handling
import os
import tensorflow as tf
import keras
from keras.applications.mobilenet_v2 import MobileNetV2 
from pathlib import Path
from typing import Callable, Generator, List, Tuple, Any

# Alternative pip install commands (commented out for local development)
!pip install -q tensorflow

2025-10-31 08:46:04.196917: I tensorflow/core/util/port.cc:111] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2025-10-31 08:46:04.235814: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2025-10-31 08:46:04.419130: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-10-31 08:46:04.419162: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-10-31 08:46:04.420457: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to regi

In [2]:
# Import MCT core
#import importlib
#if not importlib.util.find_spec('model_compression_toolkit'):
#    !pip install model_compression_toolkit

import sys
sys.path.append('/home/ubuntu/wrapper/sonyfork/mct-model-optimization')

import model_compression_toolkit as mct
from model_compression_toolkit.core import QuantizationErrorMethod

## Dataset preparation
Download ImageNet dataset with only the validation split.

**Note** that for demonstration purposes we use the validation set for the model quantization routines. Usually, a subset of the training dataset is used, but loading it is a heavy procedure that is unnecessary for the sake of this demonstration.

This step may take several minutes...

In [3]:
# Download and setup ImageNet validation dataset if not already present
if not os.path.isdir('imagenet'):
    # Create directory and download required ImageNet files
    os.system('mkdir imagenet')
    os.system('wget -P imagenet https://image-net.org/data/ILSVRC/2012/ILSVRC2012_devkit_t12.tar.gz')
    os.system('wget -P imagenet https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar')

    # Move downloaded files to imagenet directory
    os.system('mv ILSVRC2012_devkit_t12.tar.gz imagenet/')
    os.system('mv ILSVRC2012_img_val.tar imagenet/')

In [4]:
# Setup ImageNet validation directory structure if not exists
# This creates the directory structure expected by TensorFlow's image_dataset_from_directory
# Check if ImageNet validation directory already exists
if not os.path.isdir('imagenet/val'):
    import subprocess
    
    # Clone MCT repository temporarily to access setup scripts
    # This provides access to ImageNet data preparation utilities
    subprocess.run(['git', 'clone', 'https://github.com/sony/model_optimization.git', 'temp_mct'])
    
    # Make ImageNet preparation script executable with proper permissions
    os.system('chmod +x ../../../resources/scripts/prepare_imagenet.sh')

    # Run the preparation script to organize ImageNet data into proper directory structure
    # This script handles data extraction and organization for TensorFlow compatibility
    subprocess.run(['../../../resources/scripts/prepare_imagenet.sh'])

def imagenet_preprocess_input(images: tf.Tensor, labels: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
    """
    Apply MobileNetV2-specific preprocessing to input images.
    
    This function normalizes pixel values according to MobileNetV2 requirements,
    ensuring consistent input format for the model.
    
    Args:
        images: Input image tensor
        labels: Corresponding label tensor
        
    Returns:
        Tuple of preprocessed images and unchanged labels
    """
    return tf.keras.applications.mobilenet_v2.preprocess_input(images), labels

def get_dataset(batch_size: int, shuffle: bool):
    dataset = tf.keras.utils.image_dataset_from_directory(
        directory='./imagenet/val',
        batch_size=batch_size,
        image_size=[224, 224],
        shuffle=shuffle,
        crop_to_aspect_ratio=True,
        interpolation='bilinear'
    )
    dataset = dataset.map(lambda x, y: imagenet_preprocess_input(x, y), num_parallel_calls=tf.data.AUTOTUNE)
    return dataset.prefetch(buffer_size=tf.data.AUTOTUNE)


In [5]:
# Configuration parameters for representative dataset generation
# These parameters control the calibration dataset used for quantization
batch_size = 5  # Number of images per batch for quantization calibration
n_iter = 2      # Number of iterations to generate representative data
                # Total calibration samples = batch_size * n_iter = 10 images

# Create dataset instance for representative data generation
# Use shuffled data to ensure diverse representative samples
dataset = get_dataset(batch_size, shuffle=True)

# Generator function for representative dataset used in quantization calibration
def representative_dataset_gen():
    """
    Generator function for representative dataset used in quantization calibration.

    This function provides a small subset of data that MCT uses for:
    - Calibrating quantization parameters across all model layers
    - Determining optimal activation value ranges for each layer
    - Computing quantization thresholds based on actual data distribution
    - Minimizing quantization error through data-driven parameter selection
    
    Yields:
        List containing numpy arrays of image batches in MCT-expected format
    """
    for _ in range(n_iter):
        # Extract one batch from the dataset and convert to numpy format
       yield [dataset.take(1).get_single_element()[0].numpy()]

Found 50000 files belonging to 1000 classes.


2025-10-31 08:46:14.058074: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:880] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2025-10-31 08:46:14.063086: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2211] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


## Model Post-Training quantization using MCTWrapper

In [6]:
# Decorator to provide consistent logging and error handling for quantization functions
def decorator(func):
    """
    Wrapper decorator that provides standardized execution logging and error handling.
    
    This decorator enhances quantization functions by:
    - Providing clear start/end execution markers for debugging
    - Handling success/failure status from quantization operations
    - Implementing fail-fast behavior on quantization errors
    - Ensuring consistent logging format across all quantization methods
    
    Usage:
        @decorator
        def quantization_function(model):
            # quantization implementation
            return flag, quantized_model
    
    Args:
        func: Function to be decorated (typically a quantization function)
    
    Returns:
        Wrapped function with enhanced logging and error handling capabilities
    """
    def wrapper(*args, **kwargs):
        # Log function execution start with clear delimiter
        print(f"----------------- {func.__name__} Start ---------------")
        
        # Execute the quantization function and capture return values
        # Expected return format: (success_flag, quantized_model)
        flag, result = func(*args, **kwargs)
        
        # Log function execution completion
        print(f"----------------- {func.__name__} End -----------------")
        
        # Implement fail-fast behavior: exit immediately on quantization failure
        # This ensures early detection of quantization issues
        if not flag:
            exit()
        
        # Return original function results if successful
        return flag, result
    
    return wrapper

Run PTQ (Post-Training Quantization) with Keras

In [7]:
@decorator
def PTQ_Keras(float_model):
    """
    Perform Post-Training Quantization (PTQ) using MCT on Keras model.
    
    PTQ is a quantization method that:
    - Does not require model retraining
    - Uses representative data for calibration
    - Provides good accuracy with minimal computational overhead
    
    Args:
        float_model: Original floating-point Keras model
    
    Returns:
        tuple: (success_flag, quantized_model)
    """
    # Configuration for basic PTQ quantization
    method = 'PTQ'                    # Post-Training Quantization method
    framework = 'tensorflow'          # Target framework (Keras/TensorFlow)
    use_internal_tpc = True                # Use MCT's built-in Target Platform Capabilities
    use_mixed_precision = False                  # Disable mixed-precision quantization

    # Parameter configuration for PTQ
    param_items = [
        ['tpc_version', '1.0', 'The version of the TPC to use.'],
        
        # Quantization configuration parameters
        ['activation_error_method', QuantizationErrorMethod.MSE, 'Error metric for activation quantization'],
        ['weights_bias_correction', True, 'Enable bias correction for weights'],
        ['z_threshold', float('inf'), 'Threshold for zero-point quantization'],
        ['linear_collapsing', True, 'Enable linear layer collapsing optimization'],
        ['residual_collapsing', True, 'Enable residual connection collapsing'],
        
        # Output configuration
        ['save_model_path', './qmodel_PTQ_Keras.tflite', 'Path to save the quantized model']
    ]

    # Execute quantization using MCTWrapper
    wrapper = mct.wrapper.mct_wrapper.MCTWrapper()
    flag, quantized_model = wrapper.quantize_and_export(
        float_model, method, framework, use_internal_tpc, use_mixed_precision, 
        representative_dataset_gen, param_items)
    return flag, quantized_model

Run PTQ + Mixed Precision Quantization with Keras

In [8]:
@decorator
def PTQ_Keras_mixed_precision(float_model):
    """
    Perform Post-Training Quantization with Mixed Precision (PTQ + mixed_precision) on Keras model.
    
    Mixed Precision Quantization:
    - Uses different bit-widths for different layers
    - Optimizes model size while maintaining accuracy
    - Automatically selects optimal precision for each layer
    - Uses resource constraints to guide precision allocation
    
    Args:
        float_model: Original floating-point Keras model
    
    Returns:
        tuple: (success_flag, quantized_model)
    """
    # Configuration for PTQ with mixed precision
    method = 'PTQ'                    # Post-Training Quantization method
    framework = 'tensorflow'          # Target framework (Keras/TensorFlow)
    use_internal_tpc = True                # Use MCT's built-in Target Platform Capabilities
    use_mixed_precision = True                   # Enable mixed-precision quantization

    # Parameter configuration for PTQ with Mixed Precision
    param_items = [
        ['tpc_version', '1.0', 'The version of the TPC to use.'],
        
        # Mixed precision configuration
        ['num_of_images', 5, 'Number of images for mixed precision analysis'],
        ['use_hessian_based_scores', False, 'Use Hessian-based sensitivity scores for layer importance'],
        
        # Resource constraint configuration
        ['weights_compression_ratio', 0.75, 'Target compression ratio for model weights (75% of original size)'],
        
        # Output configuration
        ['save_model_path', './qmodel_PTQ_Keras_mixed_precision.tflite', 'Path to save the mixed precision quantized model']
    ]

    # Execute mixed precision quantization using MCTWrapper
    wrapper = mct.wrapper.mct_wrapper.MCTWrapper()
    flag, quantized_model = wrapper.quantize_and_export(
        float_model, method, framework, use_internal_tpc, use_mixed_precision, 
        representative_dataset_gen, param_items)
    return flag, quantized_model

Run GPTQ (Gradient-based PTQ) with Keras

In [9]:
@decorator
def GPTQ_Keras(float_model):
    """
    Perform Gradient-based Post-Training Quantization (GPTQ) on Keras model.
    
    GPTQ is an advanced quantization method that:
    - Uses gradient information to optimize quantization parameters
    - Fine-tunes the model during quantization process
    - Generally provides better accuracy than standard PTQ
    - Requires slightly more computational resources than PTQ
    
    Args:
        float_model: Original floating-point Keras model
    
    Returns:
        tuple: (success_flag, quantized_model)
    """
    # Configuration for GPTQ quantization
    method = 'GPTQ'                   # Gradient-based Post-Training Quantization
    framework = 'tensorflow'          # Target framework (Keras/TensorFlow)
    use_internal_tpc = True                # Use external EdgeMDT Target Platform Capabilities
    use_mixed_precision = False                  # Disable mixed-precision quantization

    # Parameter configuration for GPTQ
    param_items = [
        # Platform configuration
        ['target_platform_version', 'v1', 'Target platform capabilities version'],
        
        # GPTQ-specific training parameters
        ['n_epochs', 5, 'Number of epochs for gradient-based fine-tuning'],
        ['optimizer', None, 'Optimizer for fine-tuning (None = use default)'],
        
        # Output configuration
        ['save_model_path', './qmodel_GPTQ_Keras.tflite', 'Path to save the GPTQ quantized model']
    ]

    # Execute GPTQ quantization using MCTWrapper
    wrapper = mct.wrapper.mct_wrapper.MCTWrapper()
    flag, quantized_model = wrapper.quantize_and_export(
        float_model, method, framework, use_internal_tpc, use_mixed_precision, 
        representative_dataset_gen, param_items)
    return flag, quantized_model

Run GPTQ + Mixed Precision Quantization with Keras

In [10]:
@decorator
def GPTQ_Keras_mixed_precision(float_model):
    """
    Perform Gradient-based Post-Training Quantization with Mixed Precision (GPTQ + mixed_precision).
    
    This combines the benefits of both techniques:
    - GPTQ: Gradient-based optimization for better quantization accuracy
    - Mixed Precision: Optimal bit-width allocation for size/accuracy trade-off
    
    This is the most advanced quantization method available, providing:
    - Best possible accuracy preservation
    - Optimal model size reduction
    - Automatic precision selection per layer
    
    Args:
        float_model: Original floating-point Keras model
    
    Returns:
        tuple: (success_flag, quantized_model)
    """
    # Configuration for GPTQ with mixed precision
    method = 'GPTQ'                   # Gradient-based Post-Training Quantization
    framework = 'tensorflow'          # Target framework (Keras/TensorFlow)
    use_internal_tpc = True                # Use external EdgeMDT Target Platform Capabilities
    use_mixed_precision = True                   # Enable mixed-precision quantization

    # Parameter configuration for GPTQ with Mixed Precision
    param_items = [
        # Platform configuration
        ['target_platform_version', 'v1', 'Target platform capabilities version'],
        
        # GPTQ-specific training parameters
        ['n_epochs', 5, 'Number of epochs for gradient-based fine-tuning'],
        ['optimizer', None, 'Optimizer for fine-tuning (None = use default)'],
        
        # Mixed precision configuration
        ['num_of_images', 5, 'Number of images for mixed precision sensitivity analysis'],
        ['use_hessian_based_scores', False, 'Use Hessian-based scores for layer importance ranking'],
        
        # Resource constraint configuration
        ['weights_compression_ratio', 0.75, 'Target compression ratio for model weights (75% reduction)'],
        
        # Output configuration
        ['save_model_path', './qmodel_GPTQ_Keras_mixed_precision.tflite', 'Path to save the GPTQ+mixed_precision quantized model']
    ]

    # Execute advanced GPTQ+mixed_precision quantization using MCTWrapper
    wrapper = mct.wrapper.mct_wrapper.MCTWrapper()
    flag, quantized_model = wrapper.quantize_and_export(
        float_model, method, framework, use_internal_tpc, use_mixed_precision, 
        representative_dataset_gen, param_items)
    return flag, quantized_model

### Run model Post-Training Quantization
Lastly, we quantize our model using MCTWrapper API.

In [11]:
# Load pre-trained MobileNetV2 model as the base model for quantization experiments
# This model serves as the reference floating-point model for all quantization methods
float_model = MobileNetV2()

# Execute comprehensive quantization method comparison using MCT Wrapper functionality
# Each method represents different trade-offs between accuracy, model size, and computation time
print("Starting quantization experiments with different methods...")

Starting quantization experiments with different methods...


In [12]:
# Method 1: Basic Post-Training Quantization (PTQ)
# - Standard 8-bit quantization without advanced optimization techniquesed
flag, quantized_model = PTQ_Keras(float_model)

----------------- PTQ_Keras Start ---------------


Statistics Collection: 2it [00:01,  1.18it/s]
Calculating quantization parameters: 100%|████████████████████████████████| 105/105 [00:09<00:00, 11.52it/s]


INFO:tensorflow:Assets written to: /tmp/tmp7xdcy5yr/assets


INFO:tensorflow:Assets written to: /tmp/tmp7xdcy5yr/assets


----------------- PTQ_Keras End -----------------


2025-10-31 08:46:32.517592: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:378] Ignored output_format.
2025-10-31 08:46:32.517625: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:381] Ignored drop_control_dependency.
2025-10-31 08:46:32.518240: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmp7xdcy5yr
2025-10-31 08:46:32.525428: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2025-10-31 08:46:32.525442: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: /tmp/tmp7xdcy5yr
2025-10-31 08:46:32.541575: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:382] MLIR V1 optimization pass is not enabled
2025-10-31 08:46:32.544377: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2025-10-31 08:46:32.646410: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/tmp7xdcy5yr
2025-10

In [13]:
# Method 2: PTQ with Mixed Precision Quantization
# - Uses different bit-widths for different layers based on sensitivity analysis
flag, quantized_model2 = PTQ_Keras_mixed_precision(float_model)

----------------- PTQ_Keras_mixed_precision Start ---------------


Statistics Collection: 2it [00:01,  1.20it/s]
Calculating quantization parameters: 100%|████████████████████████████████| 105/105 [00:25<00:00,  4.13it/s]
53it [00:22,  2.40it/s]


Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /home/ubuntu/.local/lib/python3.10/site-packages/pulp/apis/../solverdir/cbc/linux/i64/cbc /tmp/b5f1aa121e8d4d99961e81e2116922b0-pulp.mps -sec 60 -timeMode elapsed -branch -printingOptions all -solution /tmp/b5f1aa121e8d4d99961e81e2116922b0-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 112 COLUMNS
At line 961 RHS
At line 1069 BOUNDS
At line 1229 ENDATA
Problem MODEL has 107 rows, 212 columns and 477 elements
Coin0008I MODEL read with 0 errors
seconds was changed from 1e+100 to 60
Option for timeMode changed from cpu to elapsed
Continuous objective value is 0.0383051 - 0.00 seconds
Cgl0004I processed model has 54 rows, 159 columns (159 integer (159 of which binary)) and 265 elements
Cbc0038I Initial state - 2 integers unsatisfied sum - 0.589583
Cbc0038I Pass   1: suminf.    0.14150 (2) obj. 0.0527664 iterations 2
Cbc0038I Solution found of 0.276364
Cbc0038I Befor

INFO:tensorflow:Assets written to: /tmp/tmp8emk5ahg/assets


----------------- PTQ_Keras_mixed_precision End -----------------


2025-10-31 08:47:30.017307: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:378] Ignored output_format.
2025-10-31 08:47:30.017473: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:381] Ignored drop_control_dependency.
2025-10-31 08:47:30.017590: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmp8emk5ahg
2025-10-31 08:47:30.023289: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2025-10-31 08:47:30.023312: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: /tmp/tmp8emk5ahg
2025-10-31 08:47:30.037463: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2025-10-31 08:47:30.124746: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/tmp8emk5ahg
2025-10-31 08:47:30.160757: I tensorflow/cc/saved_model/loader.cc:316] SavedModel load for tags { serve }; Status: success: OK. Took 143167 

In [14]:
# Method 3: Gradient-based Post-Training Quantization (GPTQ)
# - Uses gradient information to fine-tune quantization parameters during conversion
flag, quantized_model3 = GPTQ_Keras(float_model)

----------------- GPTQ_Keras Start ---------------


Statistics Collection: 2it [00:01,  1.14it/s]
Calculating quantization parameters: 100%|████████████████████████████████| 105/105 [00:09<00:00, 11.65it/s]
Estimating representative dataset size: 2it [00:00, 75.36it/s]
100%|█████████████████████████████████████████████████████████████████████| 100/100 [09:25<00:00,  5.66s/it]
Running GPTQ optimization:   0%|                                                      | 0/5 [00:00<?, ?it/s]
  0%|                                                                                 | 0/2 [00:00<?, ?it/s][A
 50%|████████████████████████████████████▌                                    | 1/2 [00:04<00:04,  4.56s/it][A
100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  2.02s/it][A
Running GPTQ optimization:  20%|█████████▏                                    | 1/5 [00:04<00:19,  4.80s/it][A
  0%|                                                                                 | 0/2 [00:00<?, ?it/s][A
 50%

INFO:tensorflow:Assets written to: /tmp/tmphpy5r3ow/assets


INFO:tensorflow:Assets written to: /tmp/tmphpy5r3ow/assets


----------------- GPTQ_Keras End -----------------


2025-10-31 08:57:23.154546: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:378] Ignored output_format.
2025-10-31 08:57:23.154591: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:381] Ignored drop_control_dependency.
2025-10-31 08:57:23.154717: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmphpy5r3ow
2025-10-31 08:57:23.160674: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2025-10-31 08:57:23.160696: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: /tmp/tmphpy5r3ow
2025-10-31 08:57:23.176997: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2025-10-31 08:57:23.274922: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/tmphpy5r3ow
2025-10-31 08:57:23.315084: I tensorflow/cc/saved_model/loader.cc:316] SavedModel load for tags { serve }; Status: success: OK. Took 160367 

In [15]:
# Method 4: GPTQ with Mixed Precision Quantization
# - Combines gradient-based optimization with mixed precision techniques
flag, quantized_model4 = GPTQ_Keras_mixed_precision(float_model)

----------------- GPTQ_Keras_mixed_precision Start ---------------


Statistics Collection: 2it [00:01,  1.21it/s]
Calculating quantization parameters: 100%|████████████████████████████████| 105/105 [00:26<00:00,  3.91it/s]
53it [00:22,  2.38it/s]


Welcome to the CBC MILP Solver 
Version: 2.10.3 
Build Date: Dec 15 2019 

command line - /home/ubuntu/.local/lib/python3.10/site-packages/pulp/apis/../solverdir/cbc/linux/i64/cbc /tmp/db7fa09d229843d3a68a02712b04141b-pulp.mps -sec 60 -timeMode elapsed -branch -printingOptions all -solution /tmp/db7fa09d229843d3a68a02712b04141b-pulp.sol (default strategy 1)
At line 2 NAME          MODEL
At line 3 ROWS
At line 112 COLUMNS
At line 961 RHS
At line 1069 BOUNDS
At line 1229 ENDATA
Problem MODEL has 107 rows, 212 columns and 477 elements
Coin0008I MODEL read with 0 errors
seconds was changed from 1e+100 to 60
Option for timeMode changed from cpu to elapsed
Continuous objective value is 0.0258184 - 0.00 seconds
Cgl0004I processed model has 54 rows, 159 columns (159 integer (159 of which binary)) and 265 elements
Cbc0038I Initial state - 2 integers unsatisfied sum - 0.589583
Cbc0038I Pass   1: suminf.    0.14150 (2) obj. 0.0507359 iterations 2
Cbc0038I Solution found of 0.38846
Cbc0038I Before

100%|█████████████████████████████████████████████████████████████████████| 100/100 [09:43<00:00,  5.84s/it]
Running GPTQ optimization:   0%|                                                      | 0/5 [00:00<?, ?it/s]
  0%|                                                                                 | 0/2 [00:00<?, ?it/s][A
 50%|████████████████████████████████████▌                                    | 1/2 [00:04<00:04,  4.36s/it][A
100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:04<00:00,  1.94s/it][A
Running GPTQ optimization:  20%|█████████▏                                    | 1/5 [00:04<00:18,  4.62s/it][A
  0%|                                                                                 | 0/2 [00:00<?, ?it/s][A
 50%|████████████████████████████████████▌                                    | 1/2 [00:00<00:00,  3.43it/s][A
100%|█████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.78it/s][

INFO:tensorflow:Assets written to: /tmp/tmph7rh4ek7/assets


INFO:tensorflow:Assets written to: /tmp/tmph7rh4ek7/assets


----------------- GPTQ_Keras_mixed_precision End -----------------


2025-10-31 09:08:15.356270: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:378] Ignored output_format.
2025-10-31 09:08:15.356316: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:381] Ignored drop_control_dependency.
2025-10-31 09:08:15.356430: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmph7rh4ek7
2025-10-31 09:08:15.362247: I tensorflow/cc/saved_model/reader.cc:51] Reading meta graph with tags { serve }
2025-10-31 09:08:15.362263: I tensorflow/cc/saved_model/reader.cc:146] Reading SavedModel debug info (if present) from: /tmp/tmph7rh4ek7
2025-10-31 09:08:15.376570: I tensorflow/cc/saved_model/loader.cc:233] Restoring SavedModel bundle.
2025-10-31 09:08:15.463795: I tensorflow/cc/saved_model/loader.cc:217] Running initialization op on SavedModel bundle at path: /tmp/tmph7rh4ek7
2025-10-31 09:08:15.499840: I tensorflow/cc/saved_model/loader.cc:316] SavedModel load for tags { serve }; Status: success: OK. Took 143410 

In [16]:
print("All quantization methods completed successfully!")

All quantization methods completed successfully!


## Models evaluation
In order to evaluate our models, we first need to load the validation dataset. As before, please ensure that the dataset path has been set correctly.

In [17]:
# Model Evaluation and Accuracy Comparison
print("Starting model evaluation phase...")

# Prepare validation dataset for accuracy assessment
val_dataset = get_dataset(batch_size=50, shuffle=False)

# Evaluate original floating-point model accuracy
print("\n=== Original Model Evaluation ===")
float_model.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
float_accuracy = float_model.evaluate(val_dataset)
print(f"Float model's Top 1 accuracy on the Imagenet validation set: {(float_accuracy[1] * 100):.2f}%")

# Evaluate PTQ quantized model accuracy
print("\n=== PTQ Model Evaluation ===")
quantized_model.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
quantized_accuracy = quantized_model.evaluate(val_dataset)
print(f"PTQ_Keras Quantized model's Top 1 accuracy on the Imagenet validation set: {(quantized_accuracy[1] * 100):.2f}%")

# Evaluate PTQ + Mixed Precision model accuracy
print("\n=== PTQ + Mixed Precision Model Evaluation ===")
quantized_model2.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
quantized_accuracy = quantized_model2.evaluate(val_dataset)
print(f"PTQ_Keras_mixed_precision Quantized model's Top 1 accuracy on the Imagenet validation set: {(quantized_accuracy[1] * 100):.2f}%")

# Evaluate GPTQ quantized model accuracy
print("\n=== GPTQ Model Evaluation ===")
quantized_model3.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
quantized_accuracy = quantized_model3.evaluate(val_dataset)
print(f"GPTQ_Keras Quantized model's Top 1 accuracy on the Imagenet validation set: {(quantized_accuracy[1] * 100):.2f}%")

# Evaluate GPTQ + Mixed Precision model accuracy
print("\n=== GPTQ + Mixed Precision Model Evaluation ===")
quantized_model4.compile(loss=keras.losses.SparseCategoricalCrossentropy(), metrics="accuracy")
quantized_accuracy = quantized_model4.evaluate(val_dataset)
print(f"GPTQ_Keras_mixed_precision Quantized model's Top 1 accuracy on the Imagenet validation set: {(quantized_accuracy[1] * 100):.2f}%")

print("Fisish")

Starting model evaluation phase...
Found 50000 files belonging to 1000 classes.

=== Original Model Evaluation ===
Float model's Top 1 accuracy on the Imagenet validation set: 71.85%

=== PTQ Model Evaluation ===
PTQ_Keras Quantized model's Top 1 accuracy on the Imagenet validation set: 71.67%

=== PTQ + Mixed Precision Model Evaluation ===
PTQ_Keras_mixed_precision Quantized model's Top 1 accuracy on the Imagenet validation set: 71.53%

=== GPTQ Model Evaluation ===
GPTQ_Keras Quantized model's Top 1 accuracy on the Imagenet validation set: 71.37%

=== GPTQ + Mixed Precision Model Evaluation ===
GPTQ_Keras_mixed_precision Quantized model's Top 1 accuracy on the Imagenet validation set: 71.08%
Fisish


## Conclusion

In this tutorial, we demonstrated how to quantize a pre-trained model using MCTWrapper with a few lines of code.

MCT can deliver competitive results across a wide range of tasks and network architectures. For more details, [check out the paper:](https://arxiv.org/abs/2109.09113).

## Copyrights

Copyright 2025 Sony Semiconductor Solutions, Inc. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
