## BERT GPU (Google Colab)

In [1]:
%env CUDA_VISIBLE_DEVICES=0

env: CUDA_VISIBLE_DEVICES=0


In [None]:
!pip install speedster

In [None]:
!python -m nebullvm.installers.auto_installer  --compilers all

In [None]:
!pip install pillow==9.0.1

In [None]:
!pip install protobuf==3.19.6

## Model and Dataset setup

We chose BERT as the pre-trained model that we want to optimize. Let's download both the pre-trained model and the tokenizer from the Hugging Face model hub.

In [None]:
import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', torchscript=True)

# Move the model to gpu if available and set eval mode
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device).eval()

Let's create an example dataset with some random sentences

In [6]:
import random

sentences = [
    "Mars is the fourth planet from the Sun.",
    "has a crust primarily composed of elements",
    "However, it is unknown",
    "can be viewed from Earth",
    "It was the Romans",
]

len_dataset = 100

texts = []
for _ in range(len_dataset):
    n_times = random.randint(1, 30)
    texts.append(" ".join(random.choice(sentences) for _ in range(n_times)))

In [7]:
encoded_inputs = [tokenizer(text, return_tensors="pt") for text in texts]

## Speed up inference with Speedster: no metric drop

It's now time of improving a bit the performance in terms of speed. Let's use `Speedster`.

In [10]:
from speedster import optimize_model, save_model, load_model

Using Speedster is very simple and straightforward! Just use the `optimize_model` function and provide as input the model, some input data as example and the optimization time mode. Optionally a dynamic_info dictionary can be also provided, in order to support inputs with dynamic shape.

In [11]:
dynamic_info = {
    "inputs": [
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch', 1: 'num_tokens'},
    ],
    "outputs": [
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch'},
    ]
}

optimized_model = optimize_model(
    model=model,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["tensor_rt", "tvm"],  # TensorRT does not work for this model
    dynamic_info=dynamic_info,
)

[32m2023-02-11 07:02:10[0m | [1mINFO    [0m | [1mRunning Speedster on GPU[0m
[32m2023-02-11 07:02:17[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-11 07:02:18[0m | [1mINFO    [0m | [1mOriginal model latency: 0.008934900760650635 sec/iter[0m
[32m2023-02-11 07:02:26[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-11 07:03:31[0m | [1mINFO    [0m | [1mOptimized model latency: 0.01056361198425293 sec/iter[0m
[32m2023-02-11 07:03:31[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-11 07:03:32[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-11 07:03:37[0m | [1mINFO    [0m | [1mOptimized model latency: 0.00930929183959961 sec/iter[0m
[32m2023-02-11 07:03:37[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: QuantizationType.HALF.[0m

[Speedst

In [12]:
import time

# Move inputs to gpu if available
encoded_inputs = [tokenizer(text, return_tensors="pt").to(device) for text in texts]

Let's run the prediction 100 times to calculate the average response time of the original model.

In [13]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = model(**encoded_input)
    times.append(time.time()-st)
original_model_time = sum(times)/len(times)*1000
print(f"Average response time for original DistilBERT: {original_model_time} ms")

Average response time for original DistilBERT: 9.37291145324707 ms


Let's run the prediction 100 times to calculate the average response time of the optimized model.

In [15]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)*1000
print(f"Average response time for optimized BERT (no metric drop): {optimized_model_time} ms")

Average response time for optimized BERT (no metric drop): 7.079629898071289 ms


## Speed up inference with Speedster: metric drop

This time we will use the `metric_drop_ths` argument to accept a little drop in terms of precision, in order to enable quantization and obtain an higher speedup

In [17]:
optimized_model = optimize_model(
    model=model,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["tensor_rt", "tvm"],  # TensorRT does not work for this model
    dynamic_info=dynamic_info,
    metric_drop_ths=0.1,
)

[32m2023-02-11 07:03:55[0m | [1mINFO    [0m | [1mRunning Speedster on GPU[0m
[32m2023-02-11 07:03:59[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-11 07:04:00[0m | [1mINFO    [0m | [1mOriginal model latency: 0.012968626022338867 sec/iter[0m
[32m2023-02-11 07:04:09[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-11 07:04:11[0m | [1mINFO    [0m | [1mOptimized model latency: 0.0058023929595947266 sec/iter[0m
[32m2023-02-11 07:04:11[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-11 07:04:12[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-11 07:04:15[0m | [1mINFO    [0m | [1mOptimized model latency: 0.009480714797973633 sec/iter[0m
[32m2023-02-11 07:04:15[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: QuantizationType.HALF.[0m
[32m2

In [24]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = model(**encoded_input)
    times.append(time.time()-st)
original_model_time = sum(times)/len(times)*1000
print(f"Average response time for original BERT: {original_model_time} ms")

Average response time for original BERT: 9.293725490570068 ms


In [25]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)*1000
print(f"Average response time for optimized BERT (metric drop): {optimized_model_time} ms")

Average response time for optimized BERT (metric drop): 3.913660049438477 ms
