# Notebook to experiment with model optimizing model for inference

In [5]:
# import dependencies
import torch
import cv2
import numpy as np
from time import time
from torchvision.models import detection
import torch.utils.benchmark as benchmark

print(torch.__version__)

2.3.1+cu121


In [6]:
# load baseline model for optimization and benchmarking...
baseline_model = detection.ssdlite320_mobilenet_v3_large(weights=detection.SSDLite320_MobileNet_V3_Large_Weights.DEFAULT)
baseline_model.eval()

SSD(
  (backbone): SSDLiteFeatureExtractorMobileNet(
    (features): Sequential(
      (0): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (1): BatchNorm2d(16, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
          (2): Hardswish()
        )
        (1): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=16, bias=False)
              (1): BatchNorm2d(16, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(16, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(16, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
            )
          )
        )
        (2): Invert

In [7]:
# benchmark basic model.
num_threads = torch.get_num_threads()
print(f"Benchmarking on {num_threads} threads")

Benchmarking on 2 threads


### Benchmarking Baseline Model
- Attempt: Benchmark the baseline model and try using reducing image size used to benchmark model

In [None]:
# benchmark model
setup = '''
import torch
from __main__ import baseline_model

x = torch.rand(1, 3, 640, 480)
'''

baseline_benchmark = benchmark.Timer(
    stmt = "baseline_model(x)",
    setup= setup,
    num_threads=num_threads,
    label="Baseline model")

print(baseline_benchmark.timeit(100))

In [None]:
reduced_resolution_setup = '''
import torch
from __main__ import baseline_model

x = torch.rand(1, 3, 320, 240)
'''

baseline_benchmark_reduced_resolution =  benchmark.Timer(
    stmt = "baseline_model(x)",
    setup= reduced_resolution_setup,
    num_threads=num_threads,
    label="Baseline model",
    sub_label="reduced resolution")

print(baseline_benchmark_reduced_resolution.timeit(100))

### Torch Scripting
general points to note scripting and tracing applies optimization to the model to improves inference speed in production environment. Tracing the model effectively freezes the conditional logic of the model to match the data given during tracing. There are more subtle difference between the two approaches.

In [None]:
# Apply torch scripitng...
scripted_model = torch.jit.script(baseline_model.eval())
scripted_model.eval()

In [None]:
# benchmark the performance of the model
setup = '''
import torch
from __main__ import scripted_model

x = torch.rand(3, 640, 480)
'''
scripted_model_benchmark = benchmark.Timer(
    stmt = "scripted_model([x])", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="Scripted baseline model")

print(scripted_model_benchmark.timeit(100))

Scripting the model results in some improvement in time taken to run inference, the resulting speedup are not consistent with multiple run. As a sidenote i've also tried reducing the resolution of the input image passed to the model to see if it improves inference time, it doesn't seem to result in any significant speed up. The results reported above where obtained after briefly going through notes on torchscripting there might be more details/approach to scripting and tracing that might improve inference time.

i'll try other approaches
- quantizations
- experimenting with onnx runtime

https://www.reddit.com/r/MachineLearning/comments/yg1mpz/d_how_to_get_the_fastest_pytorch_inference_and/

As a side note or maybe a more concluding thought on scripting as an optimization approach.   
Scripted model main advantage is that they can also run independently of a python environment, So they are designed to be flexible and portable, enabling deployment into a non-python environment. They are not necessarily an optimization technique to speed up inference, it's more focused on flexible and portable deployments.


### Quantization
(More of a summary of notes in model optimization.md)

Quantization is an optimization technique that can be applied to a model to reduce it's size and increase it's inference speed (with the caveat that we have *about* the same accuracy as the original model). Quantization at a glance operates by converting high precision numbers into lower precision, the result is a reduction in model size and computational cost, with some negliable loss in accuracy. Three [approaches](https://pytorch.org/tutorials/recipes/quantization.html) to quantization are
- Post training Dynamic Quantization (converts model weight to 8bit integers but does not convert activation untill it's used to compute further activation. At time of writing this method only supports nn.Linear and nn.LSTM modules to apply quantization to.)
- Post training Static Quantization (Converts both weights and activation to 8 bit integers, there is no on-the-fly conversion of the model activation during inference)
- Quantization aware training (from what i can gather this approach involves quantizing/dequantizing the inputs and activation of the model, the conversion is baked into the model architecture and used as part of model training..)


for my application is it worth checking if there already exist a quantized version of the detection models?? There are quantized versions of the backbone but there are no quantized versions of the detection models.

In [None]:
print(torch.backends.quantized.supported_engines)

In [None]:
# Post training static quantization
backend = "qnnpack"

baseline_model = baseline_model.to("cpu")

baseline_model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend
model_static_quantized = torch.quantization.prepare(baseline_model, inplace=False)
model_static_quantized = torch.quantization.convert(model_static_quantized, inplace=False)


In [None]:
# try running inference on quantized model
x = torch.rand(1, 3, 640, 480)

quant = torch.ao.quantization.QuantStub()
x_quantized = quant(x)
prediction = model_static_quantized(x_quantized)
print(prediction)

understanding *Could not run 'quantized::conv2d.new' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build).*
looking at this [post](https://discuss.pytorch.org/t/error-in-running-quantised-model-runtimeerror-could-not-run-quantized-conv2d-new-with-arguments-from-the-cpu-backend/151718) it seems that you have to quanitize the input passed to the quantized model, and the approach is use `QuantStub` and `DeQuantStub` function to quantize and dequantize the input. Also it seems a different approach to quantizating the specific layers in the model was introduced (the different approach seems to be more related to quantization aware training), it might be worth trying with


In [None]:
# wrap model input and output, quantizing input and de-quantizing the output
class QuantizedModel(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model_fp32 = model
        self.quant = torch.quantization.QuantStub()
        self.dequant = torch.quantization.DeQuantStub()
        
    def forward(self, x):
        x = self.quant(x)
        x = self.model_fp32(x)
        x = self.dequant(x)
        return x


In [None]:
# quantize model with inputs wrapped with quant & dequant function
quantized_model = QuantizedModel(baseline_model)

backend = "fbgemm"
baseline_model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend
model_static_quantized = torch.quantization.prepare(quantized_model, inplace=False)
model_static_quantized = torch.quantization.convert(quantized_model, inplace=False)

In [None]:
x = torch.rand(1, 3, 640, 480)

# quant = torch.ao.quantization.QuantStub()
# x_quantized = quant(x)
prediction = model_static_quantized(x)
print(prediction)

# IT WORKS !!!!!

I tried simply quantizing the inputs before running the model forward function, but runing that did not resolve the issue, but wrapping up the forward function for the quantized model class did the trick. Does that mean the quantization stubs need to be within the forward function of the model?

In [None]:
# benchmark static quantized model
setup = '''
import torch
from __main__ import model_static_quantized

x = torch.rand(1, 3, 640, 480)
'''
quantized_model = benchmark.Timer(
    stmt = "model_static_quantized(x)", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="quantized model")

print(quantized_model.timeit(100))


resulting quantized model is not as fast as i would have thought. There are some improvements but the inference time is not consistent with multiple runs.

In [None]:
# is there a difference in quantizing the model with different backend
# available backend ['qnnpack', 'none', 'onednn', 'x86', 'fbgemm']

def quantize_model_with_different_backend(model: torch.nn.Module, backend: str):

    quantized_model = QuantizedModel(model.to("cpu")) # not really quantized, but it's inputs and outputs have been wrapped with quantization stub

    quantized_model.qconfig = torch.quantization.get_default_qconfig(backend)
    torch.backends.quantized.engine = backend
    model_static_quantized = torch.quantization.prepare(quantized_model, inplace=False)
    model_static_quantized = torch.quantization.convert(quantized_model, inplace=False)
    

    return model_static_quantized



#### Quantizing with different backends

In [None]:
# qnnpack model
qnn_pack_backend_quantized_model = quantize_model_with_different_backend(baseline_model, "qnnpack")

# benchmark static quantized model
setup = '''
import torch
from __main__ import qnn_pack_backend_quantized_model

x = torch.rand(1, 3, 640, 480)
'''
quantized_qnnpack_model = benchmark.Timer(
    stmt = "qnn_pack_backend_quantized_model(x)", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="quantized qnnpack model")

print(quantized_qnnpack_model.timeit(100))

In [None]:
# onednn backend
onednn_backend_quantized_model = quantize_model_with_different_backend(baseline_model, "onednn")

# benchmark static quantized model
setup = '''
import torch
from __main__ import onednn_backend_quantized_model

x = torch.rand(1, 3, 640, 480)
'''
quantized_onednn_model = benchmark.Timer(
    stmt = "onednn_backend_quantized_model(x)", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="quantized onednn model")

print(quantized_onednn_model.timeit(100))

In [None]:
# x86 backend
x86_backend_quantized_model = quantize_model_with_different_backend(baseline_model, "x86")

# benchmark static quantized model
setup = '''
import torch
from __main__ import x86_backend_quantized_model

x = torch.rand(1, 3, 640, 480)
'''
quantized_x86_model = benchmark.Timer(
    stmt = "x86_backend_quantized_model(x)", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="quantized x86 model")

print(quantized_x86_model.timeit(100))

#### Try fusing model layers

**Quantization backend Significance**   
significance of different backend, which backend to use?   
The different backends specfiy how quantized operations are implemented and optimized for specific hardware platform.   
- `fbgem` backend, is `optimized for x86 platforms such as intels CPU`. It uses efficient algorithms and instructions specific to x86 hardware to maximize performance. It's optimized for x86 server CPU and is widely used in data center environment where intel or AMD x86 processors are predominant.

- `qnnpack` is `optimized for mobile devices` such as ARM CPUs and is designed to offer efficient performance on smartphones and tablets, making it suitable for mobile applications. qnnpack supports 8-bit integer quantization and provides optimizations for mobile inference.

- `onednn` is an `intel-developed library that provides optimization for both CPU and GPU`. It's used for server-side applications where intel hardware is predominant. It can be useful for both quantized and floating point compuations. Also worth noting that `onednn` is `focused on x86 intel platforms`, the key difference between it and `fbgem` is that it's `more versatile in that it provides optimization for both CPU and GPU`

Given that i've switched from running the model locally on the raspberry pi to hosting it on a fastapi server on an old linux server, i can't use qnnpack backend so that leaves `x86`, `onednn`, `fbgem`, also looking at the benchmarking results with the model quantized with qnnpack, had the longest inference time. With fbgem and x86 giving the *best reduction in inference time* 

note.   
- checking linux laptop: running `uname -m` gave results x86-64, querying on results indicated that laptop is running a 64-bit version of the x86 architecture, which is also known as "AMD64" or "Intel 64." 


One last try, with **fusing model layers**
**Try fusing model layers**
Notes from [quantization in practice](https://pytorch.org/blog/quantization-in-practice/#backend-engine)
- Post training static quantization:
    - quantizes the model weightts and activation, as opposed to applying on-the-fly quantization for the activations. The activation would stay in a quantized precision between operations during inference.
    - Additional methods to apply into post training static quantization workflow. Fuse modules, get calibration data and calibrate the model before applying quantization.

*"Module fusion combines multiple sequential modules into one. Fusing modules mean the compiler needs to only run one kernel instead of many; this speeds things up and improve accuracy by reducing quantization error"* 

In [None]:
from torch.ao.quantization import fuse_modules
import copy


baseline_model_copy = copy.deepcopy(baseline_model)
for name, module in baseline_model_copy.named_children():
    print(name)
    print("-------------")


In [None]:
for name, module in baseline_model_copy.named_children():
    print(name, module)
    print("-------------")

In [None]:
# iterate through the backbone and fuse first two elements in conv2d norm activation block
def find_all_conv2d_norm_activation_blocks_and_fuse_them(model):

    def recurse_submodules(modules):
        for name, sub_module in modules.named_children():
            if type(sub_module) == torch.nn.modules.container.Sequential:
               for layer in sub_module:
                   if type(layer).__name__ == 'Conv2dNormActivation':
                        # if type(layer[1]) == torch.nn.modules.batchnorm.BatchNorm2d:
                        #     fuse_modules(layer, ["0", "1"], inplace=True)
                        print(layer)
                   else:
                        recurse_submodules(sub_module)
            else:
                if type(sub_module).__name__ == 'Conv2dNormActivation':
                    # if type(sub_module[1]) == torch.nn.modules.batchnorm.BatchNorm2d:
                    #     fuse_modules(sub_module, ["0", "1"], inplace=True)
                    print(sub_module)
                else:
                    recurse_submodules(sub_module)

    recurse_submodules(model)

find_all_conv2d_norm_activation_blocks_and_fuse_them(baseline_model_copy)

In [None]:
#  model
fbgem_model_copy = copy.deepcopy(baseline_model_copy)
fbgemm_backend_quantized_model = quantize_model_with_different_backend(fbgem_model_copy, "fbgemm")

# benchmark static quantized model
setup = '''
import torch
from __main__ import fbgemm_backend_quantized_model

x = torch.rand(1, 3, 640, 480)
'''
quantized_fbgemm_model_benchmark = benchmark.Timer(
    stmt = "fbgemm_backend_quantized_model(x)", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="quantized fbgemm model")

print(quantized_fbgemm_model_benchmark.timeit(100))

In [None]:
x86_model_copy = copy.deepcopy(baseline_model_copy)
x86_backend_quantized_model = quantize_model_with_different_backend(x86_model_copy, "x86")

# benchmark static quantized model
setup = '''
import torch
from __main__ import x86_backend_quantized_model

x = torch.rand(1, 3, 640, 480)
'''
quantized_x86_model_benchmark = benchmark.Timer(
    stmt = "x86_backend_quantized_model(x)", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="quantized x86 model")

print(quantized_x86_model_benchmark.timeit(100))

There is some minor speed up in inference but, it does not seem to be consistent or significant. I can spend more effort looking into other optimization techniques available on pytorch, there most likely is something i'm missing, i'll put the quantization and pytorch optimization techniques in the backlog for now and look into Onnx & OpenVino.

Notes on what else i can try   
- Look into quantization aware training.
- Train custom and more lightweight model for object detection (requires collection of data and labelling effort).
- Try calibrating model before applying quantization.
- Use classifers with quantized weights and retrain for object detection.
- Look into other model sources or ML platform (Tensorflow?)
- Research some more/ ask pytorch forum for further guidance

### Onnx

Onnx (Open Neural Network Exchange), is an open standard format, desgined to represent machine learning models. It facilitates the sharing and deployment of models across various frameworks, platforms and tools. Key features of Onnx

- It enables models to be tained in one framework and then deployed in another
- It works with many existing model frameworks and using it's own native runtime we can optimize and execute models efficiently.
- It provides optimizers to imporve performance on target hardware or specialised accelerators.

Onnx's ecosystem includes
- a runtime inference engines,
- a model zoo consisting of a collection of pre-trained models in Onnx format
- a model converter which converts model into the Onnx format.

Typical use cases for Onnx includes model deployment, optimization and portability



In [9]:
# install onnxruntime and onnx (docs claims it should be installed with pytorch)
# convert the baseline model into onnx format
torch.onnx.export(baseline_model,
                  torch.rand(1, 3, 640, 480).to("cpu"),
                  "baseline_model.onnx")


In [10]:
# load the onnx model
import onnx

baseline_model_onnx = onnx.load("baseline_model.onnx")
onnx.checker.check_model(baseline_model_onnx)


In [12]:
# create an inference session and run the model using random input
import onnxruntime as ort

random_input = torch.rand(1, 3, 640, 480).to("cpu")
ort_sess = ort.InferenceSession("baseline_model.onnx")
output = ort_sess.run(None, {"images": random_input.numpy()})

print(output)

[array([[ 12.629723 ,  16.061035 , 470.73523  , 629.4593   ],
       [ 11.8641815,  14.589386 , 471.1045   , 630.06995  ],
       [ 13.892509 ,   3.1773682, 469.0436   , 638.76     ],
       ...,
       [435.1275   ,  64.93329  , 453.64447  ,  93.64925  ],
       [413.16055  ,  52.537483 , 450.42133  ,  98.12908  ],
       [374.31528  , 321.45984  , 391.1797   , 353.82904  ]],
      dtype=float32), array([0.05734642, 0.03076115, 0.02796767, 0.02418317, 0.02380787,
       0.02320855, 0.02276548, 0.02227207, 0.02121965, 0.02090928,
       0.02090814, 0.02069603, 0.02061966, 0.01978374, 0.01970458,
       0.01952676, 0.01912172, 0.01906435, 0.01887483, 0.01878311,
       0.01864178, 0.01831578, 0.0182779 , 0.01819321, 0.01763916,
       0.01762065, 0.01730298, 0.01702949, 0.01693076, 0.01687463,
       0.0168498 , 0.01670376, 0.01670239, 0.01667927, 0.0166126 ,
       0.01647821, 0.01642449, 0.01641004, 0.01631321, 0.01630311,
       0.0161071 , 0.01598804, 0.01597957, 0.01591544, 0.01588

In [None]:
# compare the onnx model against the original baseline model
#

# OpenVino