<a href="https://colab.research.google.com/github/ChoiDae1/2022_CVLAB_Winter_Study/blob/main/PyTorch/Image%20and%20Video/Optimizing_Vision_Transfomer_Model_for_Deployment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction to DeiT (Data efficient image Transfomer)

**Vision Transfomer** -> 훈련에 있어서 엄청나게 많은 양의 데이터가 필요함 -> 활용에 한계가 있음

</br> **DeiT** -> Vision Transfomer의 구조를 그대로 사용하지만, **knowledge distillation**을 통해, 적은 양의 데이터로도 SOTA에 준하는 성능을 낼 수 있음
(knowledge distillation: Teacher model과 Student model을 각각 두어, Student model이 Teacher model의 학습지식을 배울 수 있도록 하는 학습기법)

In [1]:
!pip install timm pandas requests

Collecting timm
  Downloading timm-0.5.4-py3-none-any.whl (431 kB)
[?25l[K     |▊                               | 10 kB 34.1 MB/s eta 0:00:01[K     |█▌                              | 20 kB 33.8 MB/s eta 0:00:01[K     |██▎                             | 30 kB 20.0 MB/s eta 0:00:01[K     |███                             | 40 kB 16.7 MB/s eta 0:00:01[K     |███▉                            | 51 kB 7.7 MB/s eta 0:00:01[K     |████▋                           | 61 kB 8.0 MB/s eta 0:00:01[K     |█████▎                          | 71 kB 8.6 MB/s eta 0:00:01[K     |██████                          | 81 kB 9.7 MB/s eta 0:00:01[K     |██████▉                         | 92 kB 9.7 MB/s eta 0:00:01[K     |███████▋                        | 102 kB 7.4 MB/s eta 0:00:01[K     |████████▍                       | 112 kB 7.4 MB/s eta 0:00:01[K     |█████████▏                      | 122 kB 7.4 MB/s eta 0:00:01[K     |█████████▉                      | 133 kB 7.4 MB/s eta 0:00:01[K     

## Classifying Images with DeiT

In [2]:
from PIL import Image
import torch
import timm
import requests
import torchvision.transforms as transforms
from timm.data.constants import IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD

print(torch.__version__)
# should be 1.8.0


model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)
model.eval()

transform = transforms.Compose([
    transforms.Resize(256, interpolation=3),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD),
])

img = Image.open(requests.get("https://raw.githubusercontent.com/pytorch/ios-demo-app/master/HelloWorld/HelloWorld/HelloWorld/image.png", stream=True).raw)
img = transform(img)[None,]
out = model(img)
clsidx = torch.argmax(out)
print(clsidx.item())

1.10.0+cu111


Downloading: "https://github.com/facebookresearch/deit/archive/main.zip" to /root/.cache/torch/hub/main.zip
Downloading: "https://dl.fbaipublicfiles.com/deit/deit_base_patch16_224-b5f2ef4d.pth" to /root/.cache/torch/hub/checkpoints/deit_base_patch16_224-b5f2ef4d.pth


  0%|          | 0.00/330M [00:00<?, ?B/s]

  "Argument interpolation should be of type InterpolationMode instead of int. "


269


## Scripting DeiT

To use the model on mobile, we first need to script the model. See the Script and Optimize recipe for a quick overview. Run the code below to convert the DeiT model used in the previous step to the TorchScript format that can run on mobile.

In [4]:
model = torch.hub.load('facebookresearch/deit:main', 'deit_base_patch16_224', pretrained=True)
model.eval()
scripted_model = torch.jit.script(model)
scripted_model.save("fbdeit_scripted.pt")

Using cache found in /root/.cache/torch/hub/facebookresearch_deit_main


## Quantizing DeiT

**Quantization 정리**: [링크 참조](
https://velog.io/@jooh95/%EB%94%A5%EB%9F%AC%EB%8B%9D-Quantization%EC%96%91%EC%9E%90%ED%99%94-%EC%A0%95%EB%A6%AC)
</br>-> 한마디로 inference 시, 훈련된 소수점 형태인 모델 가중치들을 정수형으로 바꿈으로써, 모델의 속도 상향시키고와 사이즈를 줄이는 방법
 (다만, 성능은 줄어들 수 있음)  

In [5]:
# Use 'fbgemm' for server inference and 'qnnpack' for mobile inference
backend = "fbgemm" # replaced with qnnpack causing much worse inference speed for quantized model on this notebook
model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend

quantized_model = torch.quantization.quantize_dynamic(model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_quantized_model = torch.jit.script(quantized_model)
scripted_quantized_model.save("fbdeit_scripted_quantized.pt")

  reduce_range will be deprecated in a future release of PyTorch."


In [6]:
out = scripted_quantized_model(img)
clsidx = torch.argmax(out)
print(clsidx.item())
# The same output 269 should be printed

269


## Optimizing DeiT

The final step before using the quantized and scripted model on mobile is to optimize it

In [7]:
from torch.utils.mobile_optimizer import optimize_for_mobile
optimized_scripted_quantized_model = optimize_for_mobile(scripted_quantized_model)
optimized_scripted_quantized_model.save("fbdeit_optimized_scripted_quantized.pt")

In [8]:
out = optimized_scripted_quantized_model(img)
clsidx = torch.argmax(out)
print(clsidx.item())
# Again, the same output 269 should be printed

269


  return forward_call(*input, **kwargs)


## Using Lite Interpreter

Although the lite model size is comparable to the non-lite version, when running the lite version on mobile, the inference speed up is expected.

In [9]:
optimized_scripted_quantized_model._save_for_lite_interpreter("fbdeit_optimized_scripted_quantized_lite.ptl")
ptl = torch.jit.load("fbdeit_optimized_scripted_quantized_lite.ptl")

## Comparing Inference Speed

In [10]:
with torch.autograd.profiler.profile(use_cuda=False) as prof1:
    out = model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof2:
    out = scripted_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof3:
    out = scripted_quantized_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof4:
    out = optimized_scripted_quantized_model(img)
with torch.autograd.profiler.profile(use_cuda=False) as prof5:
    out = ptl(img)

print("original model: {:.2f}ms".format(prof1.self_cpu_time_total/1000))
print("scripted model: {:.2f}ms".format(prof2.self_cpu_time_total/1000))
print("scripted & quantized model: {:.2f}ms".format(prof3.self_cpu_time_total/1000))
print("scripted & quantized & optimized model: {:.2f}ms".format(prof4.self_cpu_time_total/1000))
print("lite model: {:.2f}ms".format(prof5.self_cpu_time_total/1000))

original model: 663.23ms
scripted model: 694.24ms
scripted & quantized model: 427.35ms
scripted & quantized & optimized model: 506.47ms
lite model: 474.93ms


In [11]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'Model': ['original model','scripted model', 'scripted & quantized model', 'scripted & quantized & optimized model', 'lite model']})
df = pd.concat([df, pd.DataFrame([
    ["{:.2f}ms".format(prof1.self_cpu_time_total/1000), "0%"],
    ["{:.2f}ms".format(prof2.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof2.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof3.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof3.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof4.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof4.self_cpu_time_total)/prof1.self_cpu_time_total*100)],
    ["{:.2f}ms".format(prof5.self_cpu_time_total/1000),
     "{:.2f}%".format((prof1.self_cpu_time_total-prof5.self_cpu_time_total)/prof1.self_cpu_time_total*100)]],
    columns=['Inference Time', 'Reduction'])], axis=1)

print(df)

                                    Model Inference Time Reduction
0                          original model       663.23ms        0%
1                          scripted model       694.24ms    -4.68%
2              scripted & quantized model       427.35ms    35.57%
3  scripted & quantized & optimized model       506.47ms    23.64%
4                              lite model       474.93ms    28.39%
