## BERT CPU

In [1]:
%env CUDA_VISIBLE_DEVICES=-1

env: CUDA_VISIBLE_DEVICES=-1


## Model and Dataset setup

We chose BERT as the pre-trained model that we want to optimize. Let's download both the pre-trained model and the tokenizer from the Hugging Face model hub.

In [2]:
import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', torchscript=True)

# Move the model to gpu if available and set eval mode
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).eval()

  from .autonotebook import tqdm as notebook_tqdm
2023-02-11 12:51:29.523648: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-02-11 12:51:29.840977: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-02-11 12:51:30.884280: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/venom/lib/:/usr/local/cuda/lib6

Let's create an example dataset with some random sentences

In [3]:
import random

sentences = [
    "Mars is the fourth planet from the Sun.",
    "has a crust primarily composed of elements",
    "However, it is unknown",
    "can be viewed from Earth",
    "It was the Romans",
]

len_dataset = 100

texts = []
for _ in range(len_dataset):
    n_times = random.randint(1, 30)
    texts.append(" ".join(random.choice(sentences) for _ in range(n_times)))

In [4]:
encoded_inputs = [tokenizer(text, return_tensors="pt") for text in texts]

## Speed up inference with Speedster: no metric drop

It's now time of improving a bit the performance in terms of speed. Let's use `Speedster`.

In [7]:
from speedster import optimize_model

2023-02-11 12:51:51.543620: E tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:267] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2023-02-11 12:51:51.543699: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: predator
2023-02-11 12:51:51.543704: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: predator
2023-02-11 12:51:51.543796: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 525.89.2
2023-02-11 12:51:51.543814: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 525.89.2
2023-02-11 12:51:51.543817: I tensorflow/compiler/xla/stream_executor/cuda/cuda_diagnostics.cc:310] kernel version seems to match DSO: 525.89.2
ERROR: [Torch-TensorRT] - Cannot get current device
ERROR: [Torch-TensorRT] - Cannot get current device
ERROR: [Torch-TensorRT] -

Using Speedster is very simple and straightforward! Just use the `optimize_model` function and provide as input the model, some input data as example and the optimization time mode. Optionally a dynamic_info dictionary can be also provided, in order to support inputs with dynamic shape.

In [11]:
dynamic_info = {
    "inputs": [
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch', 1: 'num_tokens'},
    ],
    "outputs": [
        {0: 'batch', 1: 'num_tokens'},
        {0: 'batch'},
    ]
}

optimized_model = optimize_model(
    model=model,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["tensor_rt", "tvm"],  # TensorRT does not work for this model
    dynamic_info=dynamic_info,
    device='cpu',
)

[32m2023-02-11 12:45:40[0m | [1mINFO    [0m | [1mRunning Speedster on CPU[0m
[32m2023-02-11 12:45:49[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-11 12:45:57[0m | [1mINFO    [0m | [1mOriginal model latency: 0.06308063268661498 sec/iter[0m
[32m2023-02-11 12:46:01[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-11 12:46:08[0m | [1mINFO    [0m | [1mOptimized model latency: 0.06523668766021729 sec/iter[0m
[32m2023-02-11 12:46:08[0m | [1mINFO    [0m | [1mOptimizing with DeepSparseCompiler and q_type: None.[0m


DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.3.2 COMMUNITY | (7d31c4bf) (release) (optimized) (system=avx512_vnni, binary=avx512)


[32m2023-02-11 12:46:14[0m | [1mINFO    [0m | [1mOptimized model latency: 0.06429529190063477 sec/iter[0m
[32m2023-02-11 12:46:14[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-11 12:46:19[0m | [1mINFO    [0m | [1mOptimized model latency: 0.04046010971069336 sec/iter[0m
[32m2023-02-11 12:46:19[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-11 12:46:30[0m | [1mINFO    [0m | [1mOptimizing with OpenVINOCompiler and q_type: None.[0m
[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.
Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html
[ SUCCESS ] Generate



[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.
Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html
[ SUCCESS ] Generated IR version 11 model.
[ SUCCESS ] XML file: /tmp/tmp7ss5xzcm/fp32/temp.xml
[ SUCCESS ] BIN file: /tmp/tmp7ss5xzcm/fp32/temp.bin

[Speedster results on 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz]
┏━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Metric      ┃ Original Model   ┃ Optimized Model   ┃ Improvement   ┃
┣━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━┫
┃ backend     ┃ PYTORCH          ┃ ONNXRuntime       ┃               ┃
┃ latency     ┃ 0.0631 sec/batch ┃ 0.0405 sec/batch  ┃ 1.56x         ┃
┃ thr

In [12]:
import time

# Move inputs to gpu if available
encoded_inputs = [tokenizer(text, return_tensors="pt").to(device) for text in texts]

Let's run the prediction 100 times to calculate the average response time of the original model.

In [13]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = model(**encoded_input)
    times.append(time.time()-st)
original_model_time = sum(times)/len(times)*1000
print(f"Average response time for original DistilBERT: {original_model_time} ms")

Average response time for original DistilBERT: 60.156662464141846 ms


Let's run the prediction 100 times to calculate the average response time of the optimized model.

In [14]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)*1000
print(f"Average response time for optimized BERT (no metric drop): {optimized_model_time} ms")

Average response time for optimized BERT (no metric drop): 57.145066261291504 ms


## Speed up inference with Speedster: metric drop

This time we will use the `metric_drop_ths` argument to accept a little drop in terms of precision, in order to enable quantization and obtain an higher speedup

In [8]:
optimized_model = optimize_model(
    model=model,
    input_data=encoded_inputs,
    optimization_time="constrained",
    ignore_compilers=["tensor_rt", "tvm"],  # TensorRT does not work for this model
    dynamic_info=dynamic_info,
    metric_drop_ths=0.1,
    device='cpu',
)

[32m2023-02-11 12:51:56[0m | [1mINFO    [0m | [1mRunning Speedster on CPU[0m
[32m2023-02-11 12:52:04[0m | [1mINFO    [0m | [1mBenchmark performance of original model[0m
[32m2023-02-11 12:52:11[0m | [1mINFO    [0m | [1mOriginal model latency: 0.05229429244995117 sec/iter[0m
[32m2023-02-11 12:52:15[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: None.[0m
[32m2023-02-11 12:52:21[0m | [1mINFO    [0m | [1mOptimized model latency: 0.04668235778808594 sec/iter[0m
[32m2023-02-11 12:52:21[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.DYNAMIC.[0m
[32m2023-02-11 12:52:21[0m | [1mINFO    [0m | [1mOptimizing with PytorchBackendCompiler and q_type: QuantizationType.STATIC.[0m
[32m2023-02-11 12:52:21[0m | [1mINFO    [0m | [1mOptimizing with DeepSparseCompiler and q_type: None.[0m


DeepSparse, Copyright 2021-present / Neuralmagic, Inc. version: 1.3.2 COMMUNITY | (7d31c4bf) (release) (optimized) (system=avx512_vnni, binary=avx512)


[32m2023-02-11 12:52:27[0m | [1mINFO    [0m | [1mOptimized model latency: 0.06431055068969727 sec/iter[0m
[32m2023-02-11 12:52:27[0m | [1mINFO    [0m | [1mOptimizing with IntelNeuralCompressorCompiler and q_type: QuantizationType.DYNAMIC.[0m


2023-02-11 12:52:27 [INFO] Because both eval_dataloader_cfg and user-defined eval_func are None, automatically setting 'tuning.exit_policy.performance_only = True'.
2023-02-11 12:52:27 [INFO] Generate a fake evaluation function.
2023-02-11 12:52:28 [INFO] Pass query framework capability elapsed time: 356.84 ms
2023-02-11 12:52:28 [INFO] Get FP32 model baseline.
2023-02-11 12:52:28 [INFO] Save tuning history to /home/venom/repo/nebullvm/notebooks/speedster/pytorch/nc_workspace/2023-02-11_12-51-54/./history.snapshot.
2023-02-11 12:52:28 [INFO] FP32 baseline is: [Accuracy: 1.0000, Duration (seconds): 0.0000]
2023-02-11 12:52:28 [INFO] Fx trace of the entire model failed, We will conduct auto quantization
2023-02-11 12:52:30 [INFO] |******Mixed Precision Statistics******|
2023-02-11 12:52:30 [INFO] +-----------------+----------+---------+
2023-02-11 12:52:30 [INFO] |     Op Type     |  Total   |   INT8  |
2023-02-11 12:52:30 [INFO] +-----------------+----------+---------+
2023-02-11 12:52:

[32m2023-02-11 12:52:31[0m | [1mINFO    [0m | [1mOptimizing with IntelNeuralCompressorCompiler and q_type: QuantizationType.STATIC.[0m


2023-02-11 12:52:31 [INFO] Pass query framework capability elapsed time: 360.04 ms
2023-02-11 12:52:31 [INFO] Get FP32 model baseline.
2023-02-11 12:52:31 [ERROR] Unexpected exception AssertionError('The dataloader must include label to measure the metric!') happened during tuning.
Traceback (most recent call last):
  File "/home/venom/.local/lib/python3.8/site-packages/neural_compressor/adaptor/pytorch.py", line 885, in eval_func
    metric.update(output, label)
  File "/home/venom/.local/lib/python3.8/site-packages/neural_compressor/experimental/metric/metric.py", line 969, in update
    preds, labels = _topk_shape_validate(preds, labels)
  File "/home/venom/.local/lib/python3.8/site-packages/neural_compressor/experimental/metric/metric.py", line 426, in _topk_shape_validate
    if len(preds.shape) == 1:
AttributeError: 'tuple' object has no attribute 'shape'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/venom/.

[32m2023-02-11 12:52:31[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: None.[0m
[32m2023-02-11 12:52:35[0m | [1mINFO    [0m | [1mOptimized model latency: 0.0500410795211792 sec/iter[0m
[32m2023-02-11 12:52:35[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: QuantizationType.HALF.[0m
[32m2023-02-11 12:52:49[0m | [1mINFO    [0m | [1mOptimized model latency: 0.19515776634216309 sec/iter[0m
[32m2023-02-11 12:52:49[0m | [1mINFO    [0m | [1mOptimizing with ONNXCompiler and q_type: QuantizationType.DYNAMIC.[0m
Ignore MatMul due to non constant B: /[/core_model/encoder/layer.0/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/core_model/encoder/layer.0/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/core_model/encoder/layer.1/attention/self/MatMul]
Ignore MatMul due to non constant B: /[/core_model/encoder/layer.1/attention/self/MatMul_1]
Ignore MatMul due to non constant B: /[/core_model/encoder/la

2023-02-11 12:53:22.069625382 [E:onnxruntime:, inference_session.cc:1499 operator()] Exception during initialization: /onnxruntime_src/onnxruntime/contrib_ops/cpu/quantization/qlinear_softmax.cc:74 onnxruntime::contrib::QLinearSoftmax::QLinearSoftmax(const onnxruntime::OpKernelInfo&) x_shape != nullptr && x_shape->dim_size() > 0 was false. input_shape of QLinearSoftmax must be existed



[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.
Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html
[ SUCCESS ] Generated IR version 11 model.
[ SUCCESS ] XML file: /tmp/tmpg84zkk1x/fp32/temp.xml
[ SUCCESS ] BIN file: /tmp/tmpg84zkk1x/fp32/temp.bin
[32m2023-02-11 12:53:28[0m | [1mINFO    [0m | [1mOptimized model latency: 0.0462033748626709 sec/iter[0m
[32m2023-02-11 12:53:28[0m | [1mINFO    [0m | [1mOptimizing with OpenVINOCompiler and q_type: QuantizationType.HALF.[0m




[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference Engine API v1.0, please use API v2.0 (as of 2022.1) to take advantage of the latest improvements in IR v11.
Find more information about API v2.0 and IR v11 at https://docs.openvino.ai/latest/openvino_2_0_transition_guide.html
[ SUCCESS ] Generated IR version 11 model.
[ SUCCESS ] XML file: /tmp/tmpg84zkk1x/fp32/temp.xml
[ SUCCESS ] BIN file: /tmp/tmpg84zkk1x/fp32/temp.bin
[32m2023-02-11 12:53:36[0m | [1mINFO    [0m | [1mOptimized model latency: 0.0430908203125 sec/iter[0m
[32m2023-02-11 12:53:36[0m | [1mINFO    [0m | [1mOptimizing with OpenVINOCompiler and q_type: QuantizationType.STATIC.[0m
[ INFO ] The model was converted to IR v11, the latest model format that corresponds to the source DL framework input/output format. While IR v11 is backwards compatible with OpenVINO Inference E

In [11]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = model(**encoded_input)
    times.append(time.time()-st)
original_model_time = sum(times)/len(times)*1000
print(f"Average response time for original BERT: {original_model_time} ms")

Average response time for original BERT: 56.99615478515625 ms


In [12]:
times = []

# Warmup for 30 iterations
for encoded_input in encoded_inputs[:30]:
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)

# Benchmark
for encoded_input in encoded_inputs:
    st = time.time()
    with torch.no_grad():
        final_out = optimized_model(**encoded_input)
    times.append(time.time()-st)
optimized_model_time = sum(times)/len(times)*1000
print(f"Average response time for optimized BERT (metric drop): {optimized_model_time} ms")

Average response time for optimized BERT (metric drop): 41.04656457901001 ms
