<a href="https://colab.research.google.com/github/BedinEduardo/Colab_Repositories/blob/master/PyTorch_Vision_Transformer_For_Deployment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Optmizing Vision Transformer Model for Deployment

Vision Transformer models apply the cutting-edge attention-based transformer models, introduced in Natural Language Processing to achieve all kinds of the state of the art (SOTA) results, to Computer Vision Tasks.
Facebook Data-efficient Image Transformers DeiT is a Vision Transformer model trained on ImageNet for image classification.

In this tutorial, we will first cover what is DeiT is and how to use it, then go trough the complete steps of scripting, quantizing, optimizing, and using the model in iOS and Android apps. We will also compare the performance of quantized, optimized and non-qunatized, non-optmized models, and show the benefits of applying quantization and optimization to the model along the steps.

## What is DeiT

CNNs have been the main models for image classification since deep learning took off in 2012, but CNN's typically requires hundred of millions of images for training to achieve teh SOTA results. DeiT is a vision transformer model that requires a lot less data and computing resources for training to compete with the leading CNNs in performing image classification, which is made possible by two key components of DeiT:

* Data augmentation that simulates training on a much larger dataset.
* Native distillation that allows the transformer network to learn from a CNN's output.

DeiT shows that transformers can be succefully applied to CV tasks, with limited access to data and resourcess.

## Classifying Images with DeIT

In [None]:
!pip install torch torchvision timm pandas requests



In [None]:
from PIL import Image
import torch
import timm
import requests
import torchvision.transforms as transforms
from timm.data.constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD

In [None]:
print(torch.__version__)

2.6.0+cu124


In [None]:
model = torch.hub.load('facebookresearch/deit:main','deit_base_patch16_224', pretrained=True)
model.eval()

Using cache found in /root/.cache/torch/hub/facebookresearch_deit_main
  @register_model
  @register_model
  @register_model
  @register_model
  @register_model
  @register_model
  @register_model
  @register_model


VisionTransformer(
  (patch_embed): PatchEmbed(
    (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
    (norm): Identity()
  )
  (pos_drop): Dropout(p=0.0, inplace=False)
  (patch_drop): Identity()
  (norm_pre): Identity()
  (blocks): Sequential(
    (0): Block(
      (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (attn): Attention(
        (qkv): Linear(in_features=768, out_features=2304, bias=True)
        (q_norm): Identity()
        (k_norm): Identity()
        (attn_drop): Dropout(p=0.0, inplace=False)
        (proj): Linear(in_features=768, out_features=768, bias=True)
        (proj_drop): Dropout(p=0.0, inplace=False)
      )
      (ls1): Identity()
      (drop_path1): Identity()
      (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
      (mlp): Mlp(
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (act): GELU(approximate='none')
        (drop1): Dropout(p=0.0, inplace=False)
        (norm): Identity(

In [None]:
transform = transforms.Compose([
    transforms.Resize(256, interpolation=3),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD)
])

In [None]:
img = Image.open(requests.get("https://raw.githubusercontent.com/pytorch/ios-demo-app/master/HelloWorld/HelloWorld/HelloWorld/image.png", stream=True).raw)

In [None]:
img = transform(img)[None]
out = model(img)
clsidx = torch.argmax(out)
print(clsidx.item())

269


## Scripting DeiT

To use the model on mobile, we first need to script the model.
Run the code below to convert the DeiT model used in the previous step to the TorchScrip format that can run on mobile.

In [None]:
model = torch.hub.load('facebookresearch/deit:main','deit_base_patch16_224',
                       pretrained=True)
model.eval()
scripted_model = torch.jit.script(model)
scripted_model.save("fbdeit_scripted.pt")

Using cache found in /root/.cache/torch/hub/facebookresearch_deit_main


## Quantizing DeiT

To reduce the trained model size signigficantly while keeping the inference accuracy about the same, quantization can be applied to the model. Thanks to the transformer model used in DeiT, we can easily apply dynamic-quantization to the model, because dynamic quantization works best for LSTM and transformer models.

Now run the code:

In [None]:
# Use 'x86' for server inference (the old 'fbgemm' is still available but 'x86' is the recommended default) and ''qnnback'' for mobile inference
backend = "x86"
model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend

In [None]:
quantized_model = torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_quantized_model = torch.jit.script(quantized_model)
scripted_quantized_model.save("fbdeit_scripted_quantized.pt")



In [None]:
out = scripted_quantized_model(img)
clsidx = torch.argmax(out)
print(clsidx.item())

269


## Optimizing DeiT

The final step before using the quantized and scripted model on mobile is to optimze it:

In [None]:
from torch.utils.mobile_optimizer import optimize_for_mobile
optimized_scripted_quantized_model = optimize_for_mobile(scripted_quantized_model)
optimized_scripted_quantized_model.save("fbdeit_optimized_quantized.pt")

The generated `fbdeit_optimized_quantized.pt`file has about the same size as the quantized, scripted, but non-optimized model. The inference result remains the same.

In [None]:
out = optimized_scripted_quantized_model(img)
clsidx = torch.argmax(out)
print(clsidx.item())

269


## Using Lite Interpreter

To see how much model size reduction and inference speed up the Lite Interpreter can result in, let's build the lite version of the model.

In [None]:
optimized_scripted_quantized_model._save_for_lite_interpreter("fbdeit_optmized_quantized_lite.ptl")
ptl = torch.jit.load("fbdeit_optmized_quantized_lite.ptl")

## Comparing Inference Speed

with torch.autograd.profiler.profile(use_cuda=False) as prof1:
  out = model(img)

with torch.autograd.profiler.profile(use_cuda=False) as prof2:
  out = scripted_model(img)

with torch.autograd.profiler.profile(use_cuda=False) as prf3:
  out = scripted_quantized_model(img)

with torch.autograd.profiler.profile(use_cuda=False) as prf4:
  out = optimized_scripted_quantized_model(img)

with torch.autograd.profiler.profile(use_cuda=False):
  out = ptl(img)

In [None]:
with torch.autograd.profiler.profile(use_cuda=False) as prof1:
  out = model(img)

with torch.autograd.profiler.profile(use_cuda=False) as prof2:
  out = scripted_model(img)

with torch.autograd.profiler.profile(use_cuda=False) as prf3:
  out = scripted_quantized_model(img)

with torch.autograd.profiler.profile(use_cuda=False) as prf4:
  out = optimized_scripted_quantized_model(img)

with torch.autograd.profiler.profile(use_cuda=False) as prof5:
  out = ptl(img)

In [None]:
print("original model: {:.2f}ms".format(prof1.self_cpu_time_total/1000))
print("scripted model: {:.2f}ms".format(prof2.self_cpu_time_total/1000))
print("scripted & quantized model: {:.2f}ms".format(prf3.self_cpu_time_total/1000))
print("scripted & quantized & optimized model: {:.2f}ms".format(prf4.self_cpu_time_total/1000))
print("lite model: {:.2f}ms".format(prof5.self_cpu_time_total/1000))

original model: 689.31ms
scripted model: 599.72ms
scripted & quantized model: 494.93ms
scripted & quantized & optimized model: 548.21ms
lite model: 672.99ms


The following results summarize the inference time taken by each model and the percentage reduction of each model relative to the original model.

In [None]:
import pandas as pd
import numpy as np

In [None]:
df = pd.DataFrame({"Model": ['original model','scripted model','scripted and quantized', 'scripted and quantized e optmized','lite model']})
df = pd.concat([df, pd.DataFrame([
    ["{:.2f}mf".format(prof1.self_cpu_time_total/1000), "0%"],
    ["{:.2f}ms".format(prof2.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof2.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prf3.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prf3.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prf4.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prf4.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof5.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof5.self_cpu_time_total)/prof1.self_cpu_time_total*100)]],
    columns=['Inference Time', 'Reduction'])], axis=1)


In [None]:
print(df)

                               Model Inference Time Reduction
0                     original model       689.31mf        0%
1                     scripted model       599.72ms    13.00%
2             scripted and quantized       494.93ms    28.20%
3  scripted and quantized e optmized       548.21ms    20.47%
4                         lite model       672.99ms     2.37%
