# Neural networks quantization

Today we will deal with neural networks quantization!

Remember that our goal is to reduce network size while keeping the accuracy high!

For this purpose we will use Intel's OpenVino and Neural Network Compression Framework (NNCF). Be aware, that there are other frameworks to choose from: buildin PyTorch quantization, Brevitas from Xilinx, TensorRT and others.

Use this link for OpenVino reference and documentation: https://docs.openvino.ai/2023.0/home.html

First, install and import nessessery libraries.

In [1]:
!pip3 install openvino
!pip3 install nncf
!pip3 install torch
!pip3 install torchvision
!pip3 install matplotlib
!pip3 install tqdm



In [2]:
import torch
import nncf
import openvino as ov
import time
import numpy as np
import tqdm
import os

from nncf import NNCFConfig
from nncf.torch import create_compressed_model, register_default_init_args
from openvino.runtime.ie_api import CompiledModel
from torch import nn
from torch.utils.data import DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor, RandomRotation
from typing import Union, List, Tuple, Any
from abc import ABC, abstractmethod


INFO:nncf:NNCF initialized successfully. Supported frameworks detected: torch, tensorflow, openvino


No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1'


In [3]:
#disable nncf errors
import logging
# Get the logger for the specific library or use 'root' to set it globally
logger = logging.getLogger('nncf')  # Replace 'nncf' with the specific name if it's different
# Set the logging level to WARNING to suppress INFO and DEBUG messages
logger.setLevel(logging.WARNING)

Let's start with...

##Quantizing Models Post-training

Post-training model optimization is the process of applying special methods that transform the model into a more hardware-friendly representation without retraining or fine-tuning. The most popular and widely-spread method here is 8-bit post-training quantization because it is:

- It is easy-to-use.
- It does not hurt accuracy a lot.
- It provides significant performance improvement.
- It suites many hardware available in stock since most of them support 8-bit computation natively.

8-bit integer quantization lowers the precision of weights and activations to 8 bits, which leads to significant reduction in the model footprint and significant improvements in inference speed.

Source: https://docs.openvino.ai/2023.0/ptq_introduction.html

So first! We need a model to quantize.
Reuse the CNN model from Laboratory 1 (along with training loops, metics, optimazers and loss function).

Train it for 5 epochs with MNIST dataset. You should get around ~90% accuracy.

Name the final trained model `CNN_MNIST`

In [4]:
import os
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam
from torchvision import datasets, transforms

# Define the CNN model
class CNN(nn.Module):
    def __init__(self, input_channels=1):
        super(CNN, self).__init__()
        self.conv1 = nn.Conv2d(input_channels, 32, kernel_size=5, padding=2)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5, padding=2)
        self.fc1 = nn.Linear(64*7*7 if input_channels==1 else 64*8*8, 128)  # MNIST is 28x28, CIFAR10 is 32x32
        self.fc2 = nn.Linear(128, 10)
        
    def forward(self, x):
        x = F.relu(self.conv1(x))
        x = F.max_pool2d(x, 2)
        x = F.relu(self.conv2(x))
        x = F.max_pool2d(x, 2)
        x = x.view(x.size(0), -1)
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

# Choose Dataset: 'MNIST' or 'CIFAR10'
dataset_choice = 'MNIST'  # You can change this to 'MNIST' if needed

# Set device
torch_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Prepare datasets and dataloaders
transform = transforms.Compose([transforms.ToTensor()])

if dataset_choice == 'MNIST':
    train_dataset = datasets.MNIST(root="./data", train=True, transform=transform, download=True)
    test_dataset = datasets.MNIST(root="./data", train=False, transform=transform)
    input_channels = 1
else:
    transform.transforms.append(transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5)))
    train_dataset = datasets.CIFAR10(root="./data", train=True, transform=transform, download=True)
    test_dataset = datasets.CIFAR10(root="./data", train=False, transform=transform)
    input_channels = 3

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)

# Model, metric, loss, and optimizer
CNN_Model = CNN(input_channels=input_channels).to(torch_device)
metric = nn.CrossEntropyLoss()
loss_fcn = nn.CrossEntropyLoss().to(torch_device)
optimizer = Adam(CNN_Model.parameters(), lr=0.001)

# Train and test loops
def train_test_pass(model, data_loader, loss_function, opt=None):
    is_train = model.training
    total_loss = 0.0
    correct_preds = 0
    
    for inputs, targets in data_loader:
        inputs, targets = inputs.to(torch_device), targets.to(torch_device)
        outputs = model(inputs)
        loss = loss_function(outputs, targets)
        total_loss += loss.item() * inputs.size(0)
        
        if is_train:
            opt.zero_grad()
            loss.backward()
            opt.step()
        
        _, preds = torch.max(outputs, 1)
        correct_preds += (preds == targets).sum().item()
    
    avg_loss = total_loss / len(data_loader.dataset)
    accuracy = 100.0 * correct_preds / len(data_loader.dataset)
    
    return avg_loss, accuracy

def training(model, train_dl, test_dl, loss_fcn, optimizer, epochs=5):
    history = {"train_loss": [], "train_accuracy": [], "test_loss": [], "test_accuracy": []}
    for epoch in range(epochs):
        model.train()
        train_loss, train_accuracy = train_test_pass(model, train_dl, loss_fcn, optimizer)
        history["train_loss"].append(train_loss)
        history["train_accuracy"].append(train_accuracy)
        
        model.eval()
        with torch.no_grad():
            test_loss, test_accuracy = train_test_pass(model, test_dl, loss_fcn)
            history["test_loss"].append(test_loss)
            history["test_accuracy"].append(test_accuracy)
        
        print(f"Epoch {epoch+1}/{epochs} => "
              f"Train loss: {train_loss:.4f}, Train accuracy: {train_accuracy:.2f}% | "
              f"Test loss: {test_loss:.4f}, Test accuracy: {test_accuracy:.2f}%")
    return model, history

# File path to save and load the model
MODEL_PATH = "CNN_MNIST_model.pth"

# # Check if the model file exists
# if os.path.exists(MODEL_PATH):
#     # Load the model from the saved file
#     CNN_Model.load_state_dict(torch.load(MODEL_PATH))
#     print("Model loaded from file.")
# else:
#     pass
#     # Train the model
#     CNN_Model, history = training(CNN_Model, train_loader, test_loader, loss_fcn, optimizer, epochs=5)
#     # Save the trained model
#     torch.save(CNN_Model.state_dict(), MODEL_PATH)
#     print(f"Model saved to {MODEL_PATH}.")
CNN_Model, history = training(CNN_Model, train_loader, test_loader, loss_fcn, optimizer, epochs=5)

Epoch 1/5 => Train loss: 0.1213, Train accuracy: 96.28% | Test loss: 0.0349, Test accuracy: 98.80%
Epoch 2/5 => Train loss: 0.0394, Train accuracy: 98.79% | Test loss: 0.0250, Test accuracy: 99.15%
Epoch 3/5 => Train loss: 0.0271, Train accuracy: 99.15% | Test loss: 0.0390, Test accuracy: 98.74%
Epoch 4/5 => Train loss: 0.0192, Train accuracy: 99.39% | Test loss: 0.0444, Test accuracy: 98.79%
Epoch 5/5 => Train loss: 0.0142, Train accuracy: 99.54% | Test loss: 0.0286, Test accuracy: 99.16%


Now - we will quantize this model to INT8.

NNCF enables post-training quantization (PTQ) by adding the quantization layers into the model graph and then using a subset of the training dataset to initialize the parameters of these additional quantization layers.

By default PTQ uses an unannotated dataset to perform quantization. It uses representative dataset items to estimate the range of activation values in a network and then quantizes the network.

Create an instance of `nncf.Dataset` class by passing two parameters:
- data_source (PyTorch loader containing training samples)
- transform_fn (to make data suitable for API).

Call this instance `calibration_dataset`.

Then, quantize `CNN_MNIST` model with `nncf.quantize()` function, which takes as input two parameters - the model and `calibration_dataset`. Call it `quantized_model`

In [5]:
# Define the transformation function for the calibration dataset
def transform_fn(data_item):
    images, _ = data_item
    return images

# Create the calibration dataset
calibration_dataset = nncf.Dataset(train_loader, transform_fn)

# Quantize the CNN_MNIST model
quantized_model = nncf.quantize(model=CNN_Model, calibration_dataset=calibration_dataset)


Reason: Command '['where', 'cl']' returned non-zero exit status 1.




In [6]:
from abc import ABC, abstractmethod
from typing import Any

class BaseMetic(ABC):

    @abstractmethod
    def __call__(self, y_pred, y_ref) -> Any:
        raise NotImplementedError()

class AccuracyMetic(BaseMetic):

    def __init__(self) -> None:
        pass

    @torch.no_grad()
    def __call__(self, y_pred: torch.Tensor, y_ref: torch.Tensor) -> torch.Tensor:
        """
        :param y_pred: tensor of shape (batch_size, num_of_classes) type float
        :param y_ref: tensor with shape (batch_size,) and type Long
        :return: scalar tensor with accuracy metric for batch
        """
        # Get the predicted class indices by taking argmax along dimension 1
        predicted_classes = torch.argmax(y_pred, dim=1)

        # Calculate the accuracy by comparing predicted and reference labels
        correct_predictions = (predicted_classes == y_ref).sum().item()
        total_samples = y_ref.size(0)

        # Compute the accuracy as a scalar tensor
        accuracy = torch.tensor(correct_predictions / total_samples, dtype=torch.float32)

        return accuracy


metric = AccuracyMetic()

Finally, we will convert modes to OpenVINO Intermediate Representation (IR) format.

OpenVINO IR is the proprietary model format of OpenVINO. It is produced after converting a model with model conversion API. Model conversion API translates the frequently used deep learning operations to their respective similar representation in OpenVINO and tunes them with the associated weights and biases from the trained model. The resulting IR contains two files:
- `xml` - Describes the model topology.
- `bin` - Contains the weights and binary data.

To do that, we'll need `dummy_input` filled with random values and of size:

`[batch_size, channel_number, image_shape[0], image_shape[1]]`

Create `MNIST_fp32_ir` model with `ov.convert_model` that takes three parameters: the model, the dummy input and input size. Use `CNN_MNIST` model.

Then, create `MNIST_int8_ir` model in the same way using `quantized_model`.

Save both models to files (named `MNIST_fp32_ir.xml` and `MNIST_int8_ir.xml` respectively. Use `ov.save_model()` function.

Finally - compile both models with `core.compile_model` function and use  `validate` function to calculate both models accuracy.

In [7]:
# 1. Create a dummy input for the model.
# Assuming MNIST images are 1x28x28 and a batch size of 32
dummy_input = torch.randn(32, 1, 28, 28)

# 2. Convert models to OpenVINO IR format
MNIST_fp32_ir = ov.convert_model(CNN_Model, example_input=dummy_input, input=[-1, 1, 28, 28])
MNIST_int8_ir = ov.convert_model(quantized_model, example_input=dummy_input, input=[-1, 1, 28, 28])

# 3. Save models to XML files
ov.save_model(MNIST_fp32_ir, "MNIST_fp32_ir.xml")
ov.save_model(MNIST_int8_ir, "MNIST_int8_ir.xml")

# 4. Compile models
core = ov.Core()
devices = core.available_devices
fp32_compiled_model = core.compile_model(MNIST_fp32_ir, devices[0])
int8_compiled_model = core.compile_model(MNIST_int8_ir, devices[0])


def validate(val_loader: torch.utils.data.DataLoader, model: Union[torch.nn.Module, CompiledModel], metric: AccuracyMetic):

    # Switch to evaluate mode.
    if not isinstance(model, CompiledModel):
        model.eval()
        model.to(torch_device)
    total_accuracy = 0
    samples_num = 0
    with torch.no_grad():
        end = time.time()
        for i, (images, target) in tqdm.tqdm(enumerate(val_loader)):
            images = images.to(torch_device)
            target = target.to(torch_device)

            # Compute the output.
            if isinstance(model, CompiledModel):
                output_layer = model.output(0)
                output = model(images)[output_layer]
                output = torch.from_numpy(output)
            else:
                output = model(images)

            # Measure accuracy and record loss.
            accuracy = metric(output, target)
            total_accuracy += accuracy.item() * target.shape[0]
            samples_num += target.shape[0]

    return total_accuracy / samples_num


# 5. Calculate and print accuracy for both models using the validate function
acc1 = validate(test_loader, fp32_compiled_model, metric)
print(f'FP 32 model acc={acc1:.4f}')

acc2 = validate(test_loader, int8_compiled_model, metric)
print(f'INT 8 model acc={acc2:.4f}')





  return self._level_low.item()
  return self._level_high.item()
313it [00:01, 179.12it/s]


FP 32 model acc=0.9916


313it [00:01, 234.12it/s]

INT 8 model acc=0.9917





Is INT8 model accuracy similar to FP32 model accuracy? We should hope so!

But let's verify what we have saved in terms of memory resources and network throughput!

First, check the size of OpenVINO IR binary files. You saved both of them on your drive. Is the INT8 model smaller?

In [8]:
# !ls -lh MNIST_fp32_ir.bin
# !ls -lh MNIST_int8_ir.bin
import os

# Get file sizes
size_fp32 = os.path.getsize("MNIST_fp32_ir.bin")
size_int8 = os.path.getsize("MNIST_int8_ir.bin")

# Calculate the compression ratio
compression_ratio = size_fp32 / size_int8

print(f"Size of FP32 model: {size_fp32 / (1024 * 1024):.2f} MB")
print(f"Size of INT8 model: {size_int8 / (1024 * 1024):.2f} MB")
print(f"Compression ratio: {compression_ratio:.2f}")


Size of FP32 model: 0.87 MB
Size of INT8 model: 0.43 MB
Compression ratio: 2.00


Then, use the following code to benchmark both models. Is INT8 model faster?

In [9]:
# def parse_benchmark_output(benchmark_output: str):
#     """Prints the output from benchmark_app in human-readable format"""
#     parsed_output = [line for line in benchmark_output if 'FPS' in line]
#     print(*parsed_output, sep='\n')


# print('Benchmark FP32 model on CPU')
# benchmark_output = ! benchmark_app -m MNIST_fp32_ir.xml -d CPU -api async -t 15 -shape "[1, 1, 28, 28]"
# parse_benchmark_output(benchmark_output)

# print('Benchmark INT8 model on CPU')
# benchmark_output = ! benchmark_app -m MNIST_int8_ir.xml -d CPU -api async -t 15 -shape "[1, 1, 28, 28]"
# parse_benchmark_output(benchmark_output)

import subprocess

def parse_benchmark_output(benchmark_output: str):
    """Prints the output from benchmark_app in human-readable format"""
    parsed_output = [line for line in benchmark_output.decode('utf-8').split('\n') if 'FPS' in line]
    print(*parsed_output, sep='\n')

def run_benchmark(model_file: str, device: str = "CPU"):
    """Runs benchmark_app for a given model on the specified device and returns its output"""
    command = [
        "benchmark_app",
        "-m", model_file,
        "-d", device,
        "-api", "async",
        "-t", "5",
        "-shape", "[1, 1, 28, 28]"
    ]
    result = subprocess.run(command, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    return result.stdout


print('Benchmark FP32 model on CPU')
benchmark_output = run_benchmark("MNIST_fp32_ir.xml")
parse_benchmark_output(benchmark_output)

print('Benchmark INT8 model on CPU')
benchmark_output = run_benchmark("MNIST_int8_ir.xml")
parse_benchmark_output(benchmark_output)

print('Benchmark FP32 model on GPU')
benchmark_output = run_benchmark("MNIST_fp32_ir.xml", device="GPU")
parse_benchmark_output(benchmark_output)

print('Benchmark INT8 model on GPU')
benchmark_output = run_benchmark("MNIST_int8_ir.xml", device="GPU")
parse_benchmark_output(benchmark_output)

Benchmark FP32 model on CPU
[ INFO ] Throughput:   16892.43 FPS
Benchmark INT8 model on CPU
[ INFO ] Throughput:   22529.36 FPS
Benchmark FP32 model on GPU
[ INFO ] Throughput:   4872.17 FPS
Benchmark INT8 model on GPU
[ INFO ] Throughput:   4316.03 FPS


Note, that we used very small network and we deal with very simple task. For bigger models and harder networks the perfomance and size differences can be even more significant!

***Extention exercises***

Read about `Quantizing with Accuracy Control` and try to use it for some pretrained network. Use `nncf.quantize_with_accuracy_control`. You can find pretrained networks with `torchvision.models`

## Quantization-aware Training (QAT)

Training-time model compression improves model performance by applying optimizations (such as quantization) during the training. The training process minimizes the loss associated with the lower-precision optimizations, so it is able to maintain the model’s accuracy while reducing its latency and memory footprint. Generally, training-time model optimization results in better model performance and accuracy than post-training optimization, but it can require more effort to set up.

Quantization-aware Training is a popular method that allows quantizing a model and applying fine-tuning to restore accuracy degradation caused by quantization. In fact, this is the most accurate quantization method.

For this part, let's use a bit harder Dataset. For MNIST, PTQ method was enough, right?

Train your CNN model on CIFAR10 dataset for 10-20 epochs (google it!). Use the same training loops, metics, optimazers and loss function.

Name the final trained model `CNN_CIFAR`, convert it to OpenVino IR and save to xml file.

We start our QAT process with creating compressed models. Just use the following code (fill in the gaps).

In [10]:
import torchvision.transforms as transforms

# Set up CIFAR10 data loading
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Define the loss function and optimizer
loss_fcn = torch.nn.CrossEntropyLoss()
CNN_CIFAR= CNN(input_channels=3).to(torch_device)
optimizer = torch.optim.Adam(CNN_CIFAR.parameters(), lr=0.01)

# Train the CNN model using CIFAR10
CNN_CIFAR, history = training(CNN_CIFAR, train_loader, test_loader, loss_fcn, optimizer, epochs=5)

# Convert trained model to OpenVINO IR and save
dummy_input = torch.randn([1, 3, 32, 32])  # CIFAR10 images are 3x32x32
CIFAR_fp32_ir = ov.convert_model(CNN_CIFAR, example_input=dummy_input, input=[-1, 3, 32, 32])
ov.save_model(CIFAR_fp32_ir, "CIFAR_fp32_ir.xml")

# Compress model using NNCF for Quantization-aware Training
nncf_config_dict = {
    "input_info": {"sample_size": [1, 3, 32, 32]},
    "compression": {
        "algorithm": "quantization",
    },
}
nncf_config = NNCFConfig.from_dict(nncf_config_dict)
nncf_config = register_default_init_args(nncf_config, train_loader)
compression_ctrl, CNN_CIFAR_int8 = create_compressed_model(CNN_CIFAR, nncf_config)


Files already downloaded and verified
Files already downloaded and verified
Epoch 1/5 => Train loss: 1.8311, Train accuracy: 33.80% | Test loss: 1.6774, Test accuracy: 39.71%
Epoch 2/5 => Train loss: 1.6161, Train accuracy: 41.32% | Test loss: 1.5918, Test accuracy: 41.90%
Epoch 3/5 => Train loss: 1.5601, Train accuracy: 43.91% | Test loss: 1.5640, Test accuracy: 44.64%
Epoch 4/5 => Train loss: 1.5141, Train accuracy: 45.11% | Test loss: 1.5337, Test accuracy: 45.29%
Epoch 5/5 => Train loss: 1.4775, Train accuracy: 46.61% | Test loss: 1.4903, Test accuracy: 48.05%


We have our CIFAR CNN model ready to QAT. So... Just train it!

Use your `training` function to train `CNN_CIFAR_int8` model for one more epoch!

Thanks to OpenVINO API, after creating compressed model all we need to do is to continue training on INT8 model :) We call this process fine-tuning. It is applied to futher improve quantized model accuracy! Normally, several epochs of tuning are required with a small learning rate, the same that is usually used at the end of the training of the original model. No other changes in the training pipeline are required.

In [11]:
# Define the loss function
loss_fcn_finetune = torch.nn.CrossEntropyLoss()

# Define the optimizer with a smaller learning rate for fine-tuning
optimizer_finetune = torch.optim.Adam(CNN_CIFAR_int8.parameters(), lr=0.0001)

# Fine-tuning the quantized model
CNN_CIFAR_int8_finetuned, history = training(CNN_CIFAR_int8, train_loader, test_loader, loss_fcn_finetune, optimizer_finetune, epochs=5)

Epoch 1/5 => Train loss: 1.3666, Train accuracy: 50.73% | Test loss: 1.4184, Test accuracy: 49.76%
Epoch 2/5 => Train loss: 1.3199, Train accuracy: 52.19% | Test loss: 1.3999, Test accuracy: 50.62%
Epoch 3/5 => Train loss: 1.2985, Train accuracy: 52.82% | Test loss: 1.3913, Test accuracy: 50.93%
Epoch 4/5 => Train loss: 1.2843, Train accuracy: 53.30% | Test loss: 1.3839, Test accuracy: 51.17%
Epoch 5/5 => Train loss: 1.2732, Train accuracy: 53.69% | Test loss: 1.3777, Test accuracy: 51.05%


Convert fine-tuned model to OpenVinoIR, save it to xml and verify both `CIFAR_fp32_ir` and `CIFAR_int8_ir` sizes.

Is the INT8 network smaller?

In [12]:
import os

# Convert the fine-tuned model to OpenVINO IR
dummy_input = torch.randn([1, 3, 32, 32])  # CIFAR10 images are 3x32x32
CIFAR_int8_ir = ov.convert_model(CNN_CIFAR_int8_finetuned, example_input=dummy_input, input=[-1, 3, 32, 32])

# Save the INT8 model to XML
ov.save_model(CIFAR_int8_ir, "CIFAR_int8_ir.xml")

# Verify the sizes of the FP32 and INT8 IR models
fp32_size = os.path.getsize("CIFAR_fp32_ir.bin")
int8_size = os.path.getsize("CIFAR_int8_ir.bin")

print(f"Size of CIFAR_fp32_ir.bin: {fp32_size / (1024*1024):.2f} MB")
print(f"Size of CIFAR_int8_ir.bin: {int8_size / (1024*1024):.2f} MB")

# Compare the sizes
if int8_size < fp32_size:
    print("The INT8 network is smaller!")
else:
    print("The INT8 network is not smaller.")


Size of CIFAR_fp32_ir.bin: 1.11 MB
Size of CIFAR_int8_ir.bin: 0.55 MB
The INT8 network is smaller!


Finally - compile models, validate and benchmark them.

Did accuracy decreased?
Is INT8 model faster?

In [13]:
import time

# Ensure you have loaded the models in OpenVINO IR format
fp32_cifar_ir_model = core.read_model("CIFAR_fp32_ir.xml") # Load the FP32 model
int8_cifar_ir_model = core.read_model("CIFAR_int8_ir.xml") # Load the INT8 model

device = "CPU"  # Choose the device, can be "GPU", "MYRIAD", etc.

# Compile the models
fp32_cifar_compiled_model = core.compile_model(fp32_cifar_ir_model, device)
int8_cifar_compiled_model = core.compile_model(int8_cifar_ir_model, device)

# Validate and benchmark
start_time = time.time()
acc1 = validate(test_loader, fp32_cifar_compiled_model, metric)
fp32_time = time.time() - start_time

print(f"FP32 Model Accuracy: {acc1:.2f}%")
print(f"FP32 Model Inference Time: {fp32_time:.4f} seconds")

start_time = time.time()
acc2 = validate(test_loader, int8_cifar_compiled_model, metric)
int8_time = time.time() - start_time

print(f"INT8 Model Accuracy: {acc2:.2f}%")
print(f"INT8 Model Inference Time: {int8_time:.4f} seconds")

# Check if accuracy decreased for INT8
if acc2 < acc1:
    print("Accuracy decreased with INT8 quantization.")
else:
    print("Accuracy remained stable or increased with INT8 quantization.")

# Check if INT8 model is faster
if int8_time < fp32_time:
    print("INT8 model is faster!")
else:
    print("FP32 model is faster or there's no significant difference.")


157it [00:03, 49.28it/s]


FP32 Model Accuracy: 0.48%
FP32 Model Inference Time: 3.1880 seconds


157it [00:02, 73.62it/s]

INT8 Model Accuracy: 0.51%
INT8 Model Inference Time: 2.1356 seconds
Accuracy remained stable or increased with INT8 quantization.
INT8 model is faster!





***Extention exercise***

Compare PTQ and QAT. Create CNN model and:
- train it for 20 epochs and save as `CNN_long.pth`
- train it for 15 epochs and save as `CNN_short.pth`

Then, apply PTQ on `CNN_long.pth` model and QAT (for 5 epochs) on `CNN_short.pth`. Compare the resulting models (in terms of accuracy, size and FPS).