# Notebook to experiment with model optimizing model for inference

In [20]:
# import dependencies
import torch
import cv2
import numpy as np
from time import time
from torchvision.models import detection
import torch.utils.benchmark as benchmark

print(torch.__version__)

2.3.1+cu121


In [21]:
# load baseline model for optimization and benchmarking...
baseline_model = detection.ssdlite320_mobilenet_v3_large(weights=detection.SSDLite320_MobileNet_V3_Large_Weights.DEFAULT)
baseline_model.eval()

SSD(
  (backbone): SSDLiteFeatureExtractorMobileNet(
    (features): Sequential(
      (0): Sequential(
        (0): Conv2dNormActivation(
          (0): Conv2d(3, 16, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
          (1): BatchNorm2d(16, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
          (2): Hardswish()
        )
        (1): InvertedResidual(
          (block): Sequential(
            (0): Conv2dNormActivation(
              (0): Conv2d(16, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), groups=16, bias=False)
              (1): BatchNorm2d(16, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
              (2): ReLU(inplace=True)
            )
            (1): Conv2dNormActivation(
              (0): Conv2d(16, 16, kernel_size=(1, 1), stride=(1, 1), bias=False)
              (1): BatchNorm2d(16, eps=0.001, momentum=0.03, affine=True, track_running_stats=True)
            )
          )
        )
        (2): Invert

In [22]:
# benchmark basic model.
num_threads = torch.get_num_threads()
print(f"Benchmarking on {num_threads} threads")

Benchmarking on 2 threads


### Benchmarking Baseline Model
- Attempt: Benchmark the baseline model and try using reducing image size used to benchmark model

In [23]:
# benchmark model
setup = '''
import torch
from __main__ import baseline_model

x = torch.rand(1, 3, 640, 480)
'''

baseline_benchmark = benchmark.Timer(
    stmt = "baseline_model(x)",
    setup= setup,
    num_threads=num_threads,
    label="Baseline model")

print(baseline_benchmark.timeit(100))

<torch.utils.benchmark.utils.common.Measurement object at 0x7e2c35cc46d0>
Baseline model
setup:
  import torch
  from __main__ import baseline_model

  x = torch.rand(1, 3, 640, 480)

  554.66 ms
  1 measurement, 100 runs , 2 threads


In [24]:
reduced_resolution_setup = '''
import torch
from __main__ import baseline_model

x = torch.rand(1, 3, 320, 240)
'''

baseline_benchmark_reduced_resolution =  benchmark.Timer(
    stmt = "baseline_model(x)",
    setup= reduced_resolution_setup,
    num_threads=num_threads,
    label="Baseline model",
    sub_label="reduced resolution")

print(baseline_benchmark_reduced_resolution.timeit(100))

<torch.utils.benchmark.utils.common.Measurement object at 0x7e2c35a13be0>
Baseline model: reduced resolution
setup:
  import torch
  from __main__ import baseline_model

  x = torch.rand(1, 3, 320, 240)

  534.61 ms
  1 measurement, 100 runs , 2 threads


### Torch Scripting
general points to note scripting and tracing applies optimization to the model to improves inference speed in production environment. Tracing the model effectively freezes the conditional logic of the model to match the data given during tracing. There are more subtle difference between the two approaches.

In [None]:
# Apply torch scripitng...
scripted_model = torch.jit.script(baseline_model.eval())
scripted_model.eval()

In [None]:
# benchmark the performance of the model
setup = '''
import torch
from __main__ import scripted_model

x = torch.rand(3, 640, 480)
'''
scripted_model_benchmark = benchmark.Timer(
    stmt = "scripted_model([x])", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="Scripted baseline model")

print(scripted_model_benchmark.timeit(1000))

Scripting the model results in some improvement in time taken to run inference, the resulting speedup are not consistent with multiple run. As a sidenote i've also tried reducing the resolution of the input image passed to the model to see if it improves inference time, it doesn't seem to result in any significant speed up. The results reported above where obtained after briefly going through notes on torchscripting there might be more details/approach to scripting and tracing that might improve inference time.

i'll try other approaches
- quantizations
- experimenting with onnx runtime

https://www.reddit.com/r/MachineLearning/comments/yg1mpz/d_how_to_get_the_fastest_pytorch_inference_and/

As a side note or maybe a more concluding thought on scripting as an optimization approach.   
Scripted model main advantage is that they can also run independently of a python environment, So they are designed to be flexible and portable, enabling deployment into a non-python environment. They are not necessarily an optimization technique to speed up inference, it's more focused on flexible and portable deployments.


### Quantization
(More of a summary of notes in model optimization.md)

Quantization is an optimization technique that can be applied to a model to reduce it's size and increase it's inference speed (with the caveat that we have *about* the same accuracy as the original model). Quantization at a glance operates by converting high precision numbers into lower precision, the result is a reduction in model size and computational cost, with some negliable loss in accuracy. Three [approaches](https://pytorch.org/tutorials/recipes/quantization.html) to quantization are
- Post training Dynamic Quantization (converts model weight to 8bit integers but does not convert activation untill it's used to compute further activation. At time of writing this method only supports nn.Linear and nn.LSTM modules to apply quantization to.)
- Post training Static Quantization (Converts both weights and activation to 8 bit integers, there is no on-the-fly conversion of the model activation during inference)
- Quantization aware training (from what i can gather this approach involves quantizing/dequantizing the inputs and activation of the model, the conversion is baked into the model architecture and used as part of model training..)


for my application is it worth checking if there already exist a quantized version of the detection models?? There are quantized versions of the backbone but there are no quantized versions of the detection models.

In [30]:
print(torch.backends.quantized.supported_engines)

['qnnpack', 'none', 'onednn', 'x86', 'fbgemm']


In [None]:
# Post training static quantization
backend = "qnnpack"

baseline_model = baseline_model.to("cpu")

baseline_model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend
model_static_quantized = torch.quantization.prepare(baseline_model, inplace=False)
model_static_quantized = torch.quantization.convert(model_static_quantized, inplace=False)


In [None]:
x = torch.rand(1, 3, 640, 480)

quant = torch.ao.quantization.QuantStub()
x_quantized = quant(x)
prediction = model_static_quantized(x_quantized)
print(prediction)

understanding *Could not run 'quantized::conv2d.new' with arguments from the 'CPU' backend. This could be because the operator doesn't exist for this backend, or was omitted during the selective/custom build process (if using custom build).*
looking at this [post](https://discuss.pytorch.org/t/error-in-running-quantised-model-runtimeerror-could-not-run-quantized-conv2d-new-with-arguments-from-the-cpu-backend/151718) it seems that you have to quanitize the input passed to the quantized model, and the approach is use `QuantStub` and `DeQuantStub` function to quantize and dequantize the input. Also it seems a different approach to quantizating the specific layers in the model was introduced (the different approach seems to be more related to quantization aware training), it might be worth trying with


In [25]:
# wrap model input and output, quantizing input and de-quantizing the output
class QuantizedModel(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model_fp32 = model
        self.quant = torch.quantization.QuantStub()
        self.dequant = torch.quantization.DeQuantStub()
        
    def forward(self, x):
        x = self.quant(x)
        x = self.model_fp32(x)
        x = self.dequant(x)
        return x


In [26]:
quantized_model = QuantizedModel(baseline_model)

backend = "fbgemm"
baseline_model.qconfig = torch.quantization.get_default_qconfig(backend)
torch.backends.quantized.engine = backend
model_static_quantized = torch.quantization.prepare(quantized_model, inplace=False)
model_static_quantized = torch.quantization.convert(quantized_model, inplace=False)

In [27]:
x = torch.rand(1, 3, 640, 480)

# quant = torch.ao.quantization.QuantStub()
# x_quantized = quant(x)
prediction = model_static_quantized(x)
print(prediction)

# IT WORKS !!!!!

[{'boxes': tensor([[  9.5187,  14.9876, 474.6890, 627.7271],
        [ 14.2784,   8.9762, 469.1784, 633.8606],
        [293.3202, 223.3219, 311.1457, 263.6357],
        ...,
        [ 44.2903, 270.9238,  66.3061, 321.9006],
        [  5.9281, 400.8070,  14.5601, 433.3872],
        [171.9038, 226.4181, 186.8317, 263.3245]], grad_fn=<StackBackward0>), 'scores': tensor([0.0423, 0.0382, 0.0240, 0.0238, 0.0235, 0.0220, 0.0219, 0.0217, 0.0212,
        0.0211, 0.0209, 0.0196, 0.0196, 0.0190, 0.0190, 0.0188, 0.0187, 0.0187,
        0.0186, 0.0186, 0.0184, 0.0182, 0.0182, 0.0176, 0.0175, 0.0174, 0.0173,
        0.0171, 0.0171, 0.0170, 0.0170, 0.0169, 0.0168, 0.0167, 0.0167, 0.0166,
        0.0165, 0.0165, 0.0165, 0.0165, 0.0164, 0.0164, 0.0163, 0.0162, 0.0162,
        0.0162, 0.0161, 0.0161, 0.0161, 0.0161, 0.0160, 0.0160, 0.0159, 0.0158,
        0.0158, 0.0158, 0.0157, 0.0157, 0.0156, 0.0156, 0.0156, 0.0156, 0.0155,
        0.0154, 0.0154, 0.0154, 0.0153, 0.0153, 0.0153, 0.0152, 0.0152, 0.0152

I tried simply quantizing the inputs before running the model forward function, but simply runing that did not resolve the issue, but wrapping up the forward function for the quantized model class did the trick. Does that mean the quantization stubs need to be within the forward function of the model?

In [None]:
# benchmark static quantized model
setup = '''
import torch
from __main__ import model_static_quantized

x = torch.rand(1, 3, 640, 480)
'''
quantized_model = benchmark.Timer(
    stmt = "model_static_quantized(x)", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="quantized model")

print(quantized_model.timeit(100))


<torch.utils.benchmark.utils.common.Measurement object at 0x7e2c35a113c0>
quantized model
setup:
  import torch
  from __main__ import model_static_quantized

  x = torch.rand(1, 3, 640, 480)

  545.30 ms
  1 measurement, 100 runs , 2 threads


In [33]:
# is there a difference in quantizing the model with different backend
# available backend ['qnnpack', 'none', 'onednn', 'x86', 'fbgemm']

def quantize_model_with_different_backend(model: torch.nn.Module, backend: str):

    quantized_model = QuantizedModel(model.to("cpu")) # not really quantized, but it's inputs and outputs have been wrapped with quantization stub

    quantized_model.qconfig = torch.quantization.get_default_qconfig(backend)
    torch.backends.quantized.engine = backend
    model_static_quantized = torch.quantization.prepare(quantized_model, inplace=False)
    model_static_quantized = torch.quantization.convert(quantized_model, inplace=False)

    return model_static_quantized



In [None]:
# qnnpack model
qnn_pack_backend_quantized_model = quantize_model_with_different_backend(baseline_model, "qnnpack")

# benchmark static quantized model
setup = '''
import torch
from __main__ import qnn_pack_backend_quantized_model

x = torch.rand(1, 3, 640, 480)
'''
quantized_qnnpack_model = benchmark.Timer(
    stmt = "qnn_pack_backend_quantized_model(x)", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="quantized qnnpack model")

print(quantized_qnnpack_model.timeit(100))

<torch.utils.benchmark.utils.common.Measurement object at 0x7e2c340e20b0>
quantized qnnpack model
setup:
  import torch
  from __main__ import qnn_pack_backend_quantized_model

  x = torch.rand(1, 3, 640, 480)

  676.64 ms
  1 measurement, 100 runs , 2 threads


In [None]:
# onednn backend
onednn_backend_quantized_model = quantize_model_with_different_backend(baseline_model, "onednn")

# benchmark static quantized model
setup = '''
import torch
from __main__ import onednn_backend_quantized_model

x = torch.rand(1, 3, 640, 480)
'''
quantized_onednn_model = benchmark.Timer(
    stmt = "onednn_backend_quantized_model(x)", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="quantized onednn model")

print(quantized_onednn_model.timeit(100))



<torch.utils.benchmark.utils.common.Measurement object at 0x7e2c2f5c5540>
quantized onednn model
setup:
  import torch
  from __main__ import qnn_pack_backend_quantized_model

  x = torch.rand(1, 3, 640, 480)

  618.82 ms
  1 measurement, 100 runs , 2 threads


In [37]:
# x86 backend
x86_backend_quantized_model = quantize_model_with_different_backend(baseline_model, "x86")

# benchmark static quantized model
setup = '''
import torch
from __main__ import x86_backend_quantized_model

x = torch.rand(1, 3, 640, 480)
'''
quantized_x86_model = benchmark.Timer(
    stmt = "x86_backend_quantized_model(x)", # interesting note that i've add to pass the 
    setup= setup,
    num_threads=num_threads,
    label="quantized x86 model")

print(quantized_x86_model.timeit(100))



<torch.utils.benchmark.utils.common.Measurement object at 0x7e2c3420dc30>
quantized x86 model
setup:
  import torch
  from __main__ import x86_backend_quantized_model

  x = torch.rand(1, 3, 640, 480)

  536.71 ms
  1 measurement, 100 runs , 2 threads


**Quantization backend Significance**
significance of different backend, which backend to use?   
The different backends specfiy how quantized operations are implemented and optimized for specific hardware platform.   
- `fbgem` backend, is `optimized for x86 platforms such as intels CPU`. It uses efficient algorithms and instructions specific to x86 hardware to maximize performance. It's optimized for x86 server CPU and is widely used in data center environment where intel or AMD x86 processors are predominant.

- `qnnpack` is optimized for mobile devices such as ARM CPUs and is designed to offer efficient performance on smartphones and tablets, making it suitable for mobile applications. qnnpack supports 8-bit integer quantization and provides optimizations for mobile inference.

- `onednn` is an `intel-developed library that provides optimization for both CPU and GPU`. It's used for server-side applications where intel hardware is predominant. It can be useful for both quantized and floating point compuations. Also worth noting that `onednn` is `focused on x86 intel platforms`, the key difference between it and `fbgem` is that it's `more versatile in that it provides optimization for both CPU and GPU`

Given that i've switched from running the model locally on the raspberry pi to hosting it on a fastapi server on an old linux server, i can't use qnnpack backend so that leaves `x86`, `onednn`, `fbgem`, also looking at the benchmarking results with the model quantized with qnnpack, had the longest inference time. With fbgem and x86 giving the *best reduction in inference time* 

note.   
- checking linux laptop: running `uname -m` gave results x86-64, querying on results indicated that laptop is running a 64-bit version of the x86 architecture, which is also known as "AMD64" or "Intel 64." 


One last try, with **fusing model layers**
**Try fusing model layers**
Notes from [quantization in practice](https://pytorch.org/blog/quantization-in-practice/#backend-engine)
- Post training static quantization.


# Onnx