## CPU Optimized

In [1]:
%env CUDA_VISIBLE_DEVICES=-1

env: CUDA_VISIBLE_DEVICES=-1


### Scenario 1 - No accuracy drop

First we load the model and optimize it using the Speedster API:

In [5]:
import torch
import torchvision.models as models
from speedster import optimize_model, save_model, load_model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load a resnet as example
model = models.resnet50().to(device)

# Provide an input data for the model    
input_data = [((torch.randn(1, 3, 256, 256), ), torch.tensor([0]))]

# Run Speedster optimization
optimized_model = optimize_model(
  model, input_data=input_data, optimization_time="unconstrained",
device="cpu")

# Try the optimized model
x = torch.randn(1, 3, 256, 256).to(device)
model.eval()
res_optimized = optimized_model(x)
res_original = model(x)

[32m2023-02-11 11:28:04[0m | [1mINFO    [0m | [1mRunning Speedster on CPU[0m
[32m2023-02-11 11:28:08[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-11 11:28:13[0m | [1mINFO    [0m | [1mOriginal model latency: 0.044352409839630125 sec/iter[0m
[32m2023-02-11 11:28:14[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-11 11:28:15[0m | [1mINFO    [0m | [1mOptimized model latency: 0.03699028491973877 sec/iter[0m
[32m2023-02-11 11:28:15[0m | [1mINFO    [0m | [1mOptimizing with DeepSparseCompiler and q_type: None.[0m
[32m2023-02-11 11:28:17[0m | [1mINFO    [0m | [1mOptimized model latency: 0.015450000762939453 sec/iter[0m
[32m2023-02-11 11:28:17[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-11 11:28:19[0m | [1mINFO    [0m | [1mOptimized model latency: 0.023233652114868164 sec/iter[0m
[32m2023-02-11 11:28:19[0m | [1mINF



[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.
Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html
[ SUCCESS ] Generated IR version 11 model.
[ SUCCESS ] XML file: /tmp/tmp__v474zc/fp32/temp.xml
[ SUCCESS ] BIN file: /tmp/tmp__v474zc/fp32/temp.bin

[Speedster results on 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz]
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Metric      ┃ Original Model   ┃ Optimized Model   ┃ Improvement   ┃
┣━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━┫
┃ backend     ┃ PYTORCH          ┃ DeepSparse        ┃               ┃
┃ latency     ┃ 0.0444 sec/batch ┃ 0.0155 sec/batch  ┃ 2.87x         ┃
┃ thr

We can print the type of the optimized model to see which compiler was faster:

In [7]:
optimized_model

PytorchDeepSparseInferenceLearner(network_parameters=ModelParams(batch_size=1, input_infos=[<nebullvm.tools.base.InputInfo object at 0x7f2202bc3a60>], output_sizes=[(1000,)], dynamic_info=None), input_tfms=None, device=<Device.CPU: 'cpu'>)

Then, let's compare the performances:

In [8]:
from nebullvm.tools.benchmark import benchmark

In [10]:
# Set the model to eval mode and move it to the available device

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.eval()
model = model.to(device)

In [11]:
benchmark(model, input_data, device='cpu')

[32m2023-02-11 11:30:44[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:02<00:00, 21.38it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:48<00:00, 20.48it/s]

Batch size: 1
Average Throughput: 20.62 data/second
Average Latency: 0.0485 seconds/data





In [12]:
benchmark(optimized_model, input_data, device='cpu')

[32m2023-02-11 11:31:36[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:01<00:00, 45.70it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:20<00:00, 49.20it/s]

Batch size: 1
Average Throughput: 49.62 data/second
Average Latency: 0.0202 seconds/data





## Scenario 2 - Accuracy drop

In this scenario, we set a max threshold for the accuracy drop to 2%

In [13]:
import torch
import torchvision.models as models
from speedster import optimize_model

# Load a resnet as example
model = models.resnet50().to(device)

# Provide 100 random input data for the model  
input_data = [((torch.randn(1, 3, 256, 256), ), torch.tensor([0])) for _ in range(100)]

# Run Speedster optimization
optimized_model = optimize_model(
  model, input_data=input_data, optimization_time="unconstrained", metric="accuracy", metric_drop_ths=0.02
, device="cpu")

# Try the optimized model
x = torch.randn(1, 3, 256, 256).to(device)
res = optimized_model(x)

[32m2023-02-11 11:46:09[0m | [1mINFO    [0m | [1mRunning Speedster on CPU[0m
[32m2023-02-11 11:46:17[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-11 11:46:22[0m | [1mINFO    [0m | [1mOriginal model latency: 0.040916502475738525 sec/iter[0m
[32m2023-02-11 11:46:24[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-11 11:46:26[0m | [1mINFO    [0m | [1mOptimized model latency: 0.036638498306274414 sec/iter[0m
[32m2023-02-11 11:46:26[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.DYNAMIC.[0m
[32m2023-02-11 11:46:29[0m | [1mINFO    [0m | [1mOptimized model latency: 0.03601813316345215 sec/iter[0m
[32m2023-02-11 11:46:29[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.STATIC.[0m
[32m2023-02-11 11:46:52[0m | [1mINFO    [0m | [1mOptimized model latency: 0.01111173629760742

2023-02-11 11:46:54 [INFO] Because both eval_dataloader_cfg and user-defined eval_func are None, automatically setting 'tuning.exit_policy.performance_only = True'.
2023-02-11 11:46:54 [INFO] Generate a fake evaluation function.
2023-02-11 11:46:54 [INFO] Pass query framework capability elapsed time: 137.24 ms
2023-02-11 11:46:54 [INFO] Get FP32 model baseline.
2023-02-11 11:46:54 [INFO] Save tuning history to /home/venom/repo/nebullvm/notebooks/speedster/pytorch/nc_workspace/2023-02-11_11-26-03/./history.snapshot.
2023-02-11 11:46:54 [INFO] FP32 baseline is: [Accuracy: 1.0000, Duration (seconds): 0.0000]
2023-02-11 11:46:55 [INFO] |******Mixed Precision Statistics******|
2023-02-11 11:46:55 [INFO] +---------------+-----------+----------+
2023-02-11 11:46:55 [INFO] |    Op Type    |   Total   |   INT8   |
2023-02-11 11:46:55 [INFO] +---------------+-----------+----------+
2023-02-11 11:46:55 [INFO] |     Linear    |     1     |    1     |
2023-02-11 11:46:55 [INFO] +---------------+---

[32m2023-02-11 11:46:57[0m | [1mINFO    [0m | [1mOptimized model latency: 0.04393506050109863 sec/iter[0m
[32m2023-02-11 11:46:57[0m | [1mINFO    [0m | [1mOptimizing with IntelNeuralCompressorCompiler and q_type: QuantizationType.STATIC.[0m


2023-02-11 11:46:57 [INFO] Pass query framework capability elapsed time: 134.41 ms
2023-02-11 11:46:57 [INFO] Get FP32 model baseline.
2023-02-11 11:47:00 [INFO] Save tuning history to /home/venom/repo/nebullvm/notebooks/speedster/pytorch/nc_workspace/2023-02-11_11-26-03/./history.snapshot.
2023-02-11 11:47:00 [INFO] FP32 baseline is: [Accuracy: 0.0000, Duration (seconds): 3.5599]
2023-02-11 11:47:12 [INFO] |******Mixed Precision Statistics******|
2023-02-11 11:47:12 [INFO] +----------------------+-------+-------+
2023-02-11 11:47:12 [INFO] |       Op Type        | Total |  INT8 |
2023-02-11 11:47:12 [INFO] +----------------------+-------+-------+
2023-02-11 11:47:12 [INFO] | quantize_per_tensor  |   1   |   1   |
2023-02-11 11:47:12 [INFO] |      ConvReLU2d      |   33  |   33  |
2023-02-11 11:47:12 [INFO] |      MaxPool2d       |   1   |   1   |
2023-02-11 11:47:12 [INFO] |        Conv2d        |   20  |   20  |
2023-02-11 11:47:12 [INFO] |       add_relu       |   16  |   16  |
2023

[32m2023-02-11 11:47:14[0m | [1mINFO    [0m | [1mOptimized model latency: 0.013910055160522461 sec/iter[0m
[32m2023-02-11 11:47:14[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-11 11:47:17[0m | [1mINFO    [0m | [1mOptimized model latency: 0.03476357460021973 sec/iter[0m
[32m2023-02-11 11:47:17[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-11 11:47:24[0m | [1mINFO    [0m | [1mOptimized model latency: 0.08549785614013672 sec/iter[0m
[32m2023-02-11 11:47:24[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: QuantizationType.DYNAMIC.[0m
[32m2023-02-11 11:47:30[0m | [1mINFO    [0m | [1mOptimized model latency: 0.031424760818481445 sec/iter[0m
[32m2023-02-11 11:47:30[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: QuantizationType.STATIC.[0m
[32m2023-02-11 11:47:38[0m | [1mINFO    [0m | [1mOptimized model late



[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.
Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html
[ SUCCESS ] Generated IR version 11 model.
[ SUCCESS ] XML file: /tmp/tmpygrxxiw9/fp32/temp.xml
[ SUCCESS ] BIN file: /tmp/tmpygrxxiw9/fp32/temp.bin
[32m2023-02-11 11:47:46[0m | [1mINFO    [0m | [1mOptimized model latency: 0.02113056182861328 sec/iter[0m
[32m2023-02-11 11:47:46[0m | [1mINFO    [0m | [1mOptimizing with OpenVINOCompiler and q_type: QuantizationType.STATIC.[0m
[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inferen

In [17]:
# Set the model to eval mode and move it to the available device
model.eval()
model = model.to(device)

In [15]:
benchmark(model, input_data, device='cpu')

[32m2023-02-11 11:48:16[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:02<00:00, 18.63it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:42<00:00, 23.37it/s]

Batch size: 1
Average Throughput: 23.53 data/second
Average Latency: 0.0425 seconds/data





In [16]:
benchmark(optimized_model, input_data, device='cpu')

[32m2023-02-11 11:49:01[0m | [1mINFO    [0m | [1mRunning benchmark on CPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 145.15it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:07<00:00, 142.73it/s]

Batch size: 1
Average Throughput: 144.27 data/second
Average Latency: 0.0069 seconds/data



