# Deploying ViTs
## Chapter 6 Module 3

In this module, we will look at some common techniques on how to deploy transformers. Now, transformers can be very bulky so it helps to use methods like quanitzation to shrink the size of the model at deployment time. This will allow the model to run much faster without sacrificing too much accuracy. 

##

In [1]:
pip install transformers torchvision optimum onnx onnxruntime #--quiet


Note: you may need to restart the kernel to use updated packages.


In [2]:
from transformers import ViTForImageClassification, AutoImageProcessor
from PIL import Image
import requests
import torch
import torchvision.transforms as T
from io import BytesIO
import fiftyone as fo 
import fiftyone.zoo as foz


  from .autonotebook import tqdm as notebook_tqdm


Let's kick things off by loading in our ViT:

In [3]:
model_name = "google/vit-base-patch16-224"
model = ViTForImageClassification.from_pretrained(model_name)
processor = AutoImageProcessor.from_pretrained(model_name)

Fast image processor class <class 'transformers.models.vit.image_processing_vit_fast.ViTImageProcessorFast'> is available for this model. Using slow image processor class. To use the fast image processor class set `use_fast=True`.


Now let's grab our imagenet sample to test with:

In [5]:
#fo.delete_dataset("Imagenet-Sample")
dataset = foz.load_zoo_dataset(
    "imagenet-sample",
    dataset_name="Imagenet-Sample",
    max_samples=50,
    shuffle=True,
    overwrite=True,
)
session = fo.launch_app(dataset)

Overwriting existing directory '/fiftyone/zoo/datasets/imagenet-sample'
Downloading dataset to '/fiftyone/zoo/datasets/imagenet-sample'
Downloading dataset...
 100% |████|  762.4Mb/762.4Mb [961.0ms elapsed, 0s remaining, 793.3Mb/s]      
Extracting dataset...
Parsing dataset metadata
Found 1000 samples
Dataset info written to '/fiftyone/zoo/datasets/imagenet-sample/info.json'
Loading 'imagenet-sample'
 100% |███████████████████| 50/50 [48.4ms elapsed, 0s remaining, 1.0K samples/s]   
Dataset 'Imagenet-Sample' created


## What Is Quantization (and Why Does It Matter)?
In deep learning, quantization is the process of converting a model’s weights and activations from high-precision floating point numbers (like float32) into lower-precision formats — such as int8 or float16.

This sounds simple, but it has powerful benefits. Vision Transformers (ViTs) are large and memory-hungry. They often have millions of parameters, heavy matrix multiplications, and high VRAM or RAM usage. That makes them hard to deploy on mobile phones, edge devices, low-latency environments.

Quanitzation gives you a smaller model, faster inference, and lower memory usage. All this combines for real world deployability.


To start we need to check our support engines. Basically there are many different forms of quantization including post-training quantization which we will be doing and quantized aware training. Quantization can be a mixed bag at times, to get the best results, target exactly your hardware and use a good quantization method. 

We will be using a basic one from torch first:

In [6]:
import torch

torch.backends.quantized.supported_engines

['qnnpack', 'none', 'onednn', 'x86', 'fbgemm']

In [7]:
import torch

# Explicitly enable a compatible quantization engine
torch.backends.quantized.engine = "qnnpack"

Quantize the model to `qint8`making it effectively 4x smaller in memory

In [8]:
quantized_model = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

Now all that is left to do is to test it! Let's start with a single sample.

In [9]:
# Load the image

image = Image.open(dataset.first().filepath)
inputs = processor(images=image, return_tensors="pt")

In [10]:
# Test the image

with torch.no_grad():
    outputs = quantized_model(**inputs)
    logits = outputs.logits
    predicted_class = logits.argmax(-1).item()
    print("Predicted class:", model.config.id2label[predicted_class])
    print("Actual class:", dataset.first().ground_truth.label)



Predicted class: Chihuahua
Actual class: Chihuahua


We also want to confirm that are model got faster from quantization! Let's benchmark our model on 100 images next. `perf_counter` is great for this:

In [13]:
from time import perf_counter

def benchmark_model(model, inputs, num_iterations=10): #100
    start_time = perf_counter()
    for _ in range(num_iterations):
        with torch.no_grad():
            _ = model(**inputs)
    end_time = perf_counter()
    elapsed_time = end_time - start_time
    avg_time = elapsed_time / num_iterations
    return avg_time

# Benchmark the normal model
avg_time = benchmark_model(model, inputs)

print(f"Average inference time for normal model: {avg_time:.6f} seconds")

# Benchmark the quantized model
avg_time = benchmark_model(quantized_model, inputs)

print(f"Average inference time for quantized model: {avg_time:.6f} seconds")

Average inference time for normal model: 0.459200 seconds
Average inference time for quantized model: 0.945831 seconds


Uh oh! Did your quantized model not get faster? Don't worry, chances are then that your computer just does not support int8 acceleration and is just inferring it instead! To capture all of quantization, targeting a device with int8 support is ideal, like a Raspberry Pi, Nvidia GPUs, or Qualcomm Snapdragons.