## GPU Optimized (Google Colab)

In [1]:
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_VISIBLE_DEVICES=0


In [None]:
!pip install speedster

In [None]:
!python -m nebullvm.installers.auto_installer --compilers all

In [None]:
!pip install pillow==9.0.1

### Scenario 1 - No accuracy drop

First we load the model and optimize it using the Speedster API:

In [2]:
import torch
import torchvision.models as models
from speedster import optimize_model, save_model, load_model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load a resnet as example
model = models.resnet50().to(device)

# Provide an input data for the model    
input_data = [((torch.randn(1, 3, 256, 256), ), torch.tensor([0]))]

# Run Speedster optimization
optimized_model = optimize_model(
  model, input_data=input_data, optimization_time="unconstrained")

# Try the optimized model
x = torch.randn(1, 3, 256, 256).to(device)
model.eval()
res_optimized = optimized_model(x)
res_original = model(x)

[32m2023-02-11 06:42:05[0m | [1mINFO    [0m | [1mRunning Speedster on GPU[0m
[32m2023-02-11 06:42:15[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-11 06:42:17[0m | [1mINFO    [0m | [1mOriginal model latency: 0.013446285724639892 sec/iter[0m
[32m2023-02-11 06:42:19[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-11 06:42:32[0m | [1mINFO    [0m | [1mOptimized model latency: 0.014631271362304688 sec/iter[0m
[32m2023-02-11 06:42:32[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-11 06:42:33[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-11 06:42:54[0m | [1mINFO    [0m | [1mOptimized model latency: 0.005392551422119141 sec/iter[0m
[32m2023-02-11 06:42:54[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: Quantization

We can print the type of the optimized model to see which compiler was faster:

In [3]:
optimized_model

PytorchONNXTensorRTInferenceLearner(network_parameters=ModelParams(batch_size=1, input_infos=[<nebullvm.tools.base.InputInfo object at 0x7fdbe6e5bdf0>], output_sizes=[(1000,)], dynamic_info=None), input_tfms=<nebullvm.tools.transformations.MultiStageTransformation object at 0x7fdbe48af9a0>, device=<Device.GPU: 'gpu'>)

Then, let's compare the performances:

In [4]:
from nebullvm.tools.benchmark import benchmark

In [5]:
# Set the model to eval mode and move it to the available device

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.eval()
model = model.to(device)

In [6]:
benchmark(model, input_data)

[32m2023-02-11 06:44:40[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 103.99it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:13<00:00, 72.05it/s]

Batch size: 1
Average Throughput: 74.01 data/second
Average Latency: 0.0135 seconds/data





In [7]:
benchmark(optimized_model, input_data)

[32m2023-02-11 06:44:54[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 263.85it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:03<00:00, 252.06it/s]

Batch size: 1
Average Throughput: 255.90 data/second
Average Latency: 0.0039 seconds/data





## Scenario 2 - Accuracy drop

In this scenario, we set a max threshold for the accuracy drop to 2%

In [8]:
import torch
import torchvision.models as models
from speedster import optimize_model

# Load a resnet as example
model = models.resnet50().to(device)

# Provide 100 random input data for the model  
input_data = [((torch.randn(1, 3, 256, 256), ), torch.tensor([0])) for _ in range(100)]

# Run Speedster optimization
optimized_model = optimize_model(
  model, input_data=input_data, optimization_time="unconstrained", metric="accuracy", metric_drop_ths=0.02)
# Try the optimized model
x = torch.randn(1, 3, 256, 256).to(device)
res = optimized_model(x)

[32m2023-02-11 06:44:59[0m | [1mINFO    [0m | [1mRunning Speedster on GPU[0m
[32m2023-02-11 06:45:03[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-11 06:45:05[0m | [1mINFO    [0m | [1mOriginal model latency: 0.01355921745300293 sec/iter[0m
[32m2023-02-11 06:45:08[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-11 06:45:13[0m | [1mINFO    [0m | [1mOptimized model latency: 0.00795292854309082 sec/iter[0m
[32m2023-02-11 06:45:13[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-11 06:45:25[0m | [1mINFO    [0m | [1mOptimized model latency: 0.014817476272583008 sec/iter[0m
[32m2023-02-11 06:45:25[0m | [1mINFO    [0m | [1mOptimizing with PyTorchTensorRTCompiler and q_type: None.[0m
[32m2023-02-11 06:45:37[0m | [1mINFO    [0m | [1mOptimized model latency: 0.00420689582824707 sec/iter[0m
[32m202

In [9]:
# Set the model to eval mode and move it to the available device
model.eval()
model = model.to(device)

In [10]:
benchmark(model, input_data)

[32m2023-02-11 06:50:19[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 104.68it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:13<00:00, 74.53it/s]

Batch size: 1
Average Throughput: 76.19 data/second
Average Latency: 0.0131 seconds/data





In [11]:
benchmark(optimized_model, input_data)

[32m2023-02-11 06:50:33[0m | [1mINFO    [0m | [1mRunning benchmark on GPU[0m


Performing warm up on 50 iterations: 100%|██████████| 50/50 [00:00<00:00, 1117.78it/s]
Performing benchmark on 1000 iterations: 100%|██████████| 1000/1000 [00:00<00:00, 1011.71it/s]

Batch size: 1
Average Throughput: 1031.41 data/second
Average Latency: 0.0010 seconds/data



