Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Inference TensorFlow Bert Model with ONNX Runtime on CPU

In this tutorial, you'll be introduced to how to load a Bert model using TensorFlow, convert it to ONNX using tf2onnx, and inference it for high performance using ONNX Runtime. In the following sections, we are going to use the Bert model trained with Stanford Question Answering Dataset (SQuAD) dataset as an example. Bert SQuAD model is used in question answering scenarios, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

## 0. Prerequisites ##
First we need a python environment before running this notebook.

You can install [AnaConda](https://www.anaconda.com/distribution/) and [Git](https://git-scm.com/downloads) and open an AnaConda console when it is done. Then you can run the following commands to create a conda environment named cpu_env:

```console
conda create -n cpu_env python=3.8
conda activate cpu_env
conda install -c anaconda ipykernel
conda install -c conda-forge ipywidgets
python -m ipykernel install --user --name=cpu_env
```

Finally, launch Jupyter Notebook and you can choose cpu_env as kernel to run this notebook.

Let's install [Tensorflow](https://www.tensorflow.org/install), [OnnxRuntime](https://microsoft.github.io/onnxruntime/), [tf2onnx](https://github.com/onnx/tensorflow-onnx) and other packages like the following:

In [2]:
import sys
 
!{sys.executable} -m pip install --quiet --upgrade tensorflow==2.6.0
!{sys.executable} -m pip install --quiet --upgrade onnxruntime==1.8.1
!{sys.executable} -m pip install --quiet --upgrade tf2onnx==1.9.2
!{sys.executable} -m pip install --quiet transformers==4.9.2
!{sys.executable} -m pip install --quiet onnxconverter_common
!{sys.executable} -m pip install --quiet psutil wget pandas

Let's define some constants:

In [3]:
# Whether allow overwrite existing script or model.
enable_overwrite = False

# Number of runs to get average latency.
total_runs = 100

# Max sequence length for the export model
max_sequence_length = 512

In [4]:
import os
cache_dir = './cache_models'
output_dir = './onnx_models'
for directory in [cache_dir, output_dir]:
    if not os.path.exists(directory):
        os.makedirs(directory)

In [5]:
import tensorflow as tf
tf.config.set_visible_devices([], 'GPU') # Disable GPU for fair comparison

## 1. Load Pretrained Bert model ##

Start to load fine-tuned model. This step take a few minutes to download the model for the first time.

In [6]:
from transformers import (TFBertForQuestionAnswering, BertTokenizer)

#model_name_or_path = 'bert-large-uncased-whole-word-masking-finetuned-squad'
model_name_or_path = "bert-base-cased"
is_fine_tuned = (model_name_or_path == 'bert-large-uncased-whole-word-masking-finetuned-squad')

# Load model and tokenizer
tokenizer = BertTokenizer.from_pretrained(model_name_or_path, do_lower_case=True, cache_dir=cache_dir)
model = TFBertForQuestionAnswering.from_pretrained(model_name_or_path, cache_dir=cache_dir)
# Needed this to export onnx model with multiple inputs with TF 2.2
model._saved_model_inputs_spec = None

All model checkpoint layers were used when initializing TFBertForQuestionAnswering.

Some layers of TFBertForQuestionAnswering were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['qa_outputs']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 2. TensorFlow Inference

Use one example to run inference using TensorFlow as baseline.

In [7]:
import numpy

question, text = "What is ONNX Runtime?", "ONNX Runtime is a performance-focused inference engine for ONNX models."
# Pad to max length is needed. Otherwise, position embedding might be truncated by constant folding.
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors='tf',
                               max_length=max_sequence_length, pad_to_max_length=True, truncation=True)
start_scores, end_scores = model(inputs)

num_tokens = len(inputs["input_ids"][0])
if is_fine_tuned:
    all_tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    print("The answer is:", ' '.join(all_tokens[numpy.argmax(start_scores) : numpy.argmax(end_scores)+1]))



In [8]:
import time
start = time.time()
for _ in range(total_runs):
    start_scores, end_scores = model(inputs)
end = time.time()
print("Tensorflow Inference time for sequence length {} = {} ms".format(num_tokens, format((end - start) * 1000 / total_runs, '.2f')))

Tensorflow Inference time for sequence length 512 = 497.17 ms


## 3. Export model to ONNX using tf2onnx

Now we use tf2onnx to export the model to ONNX format.
Note that we could also convert tensorflow checkpoints to pytorch(supported by huggingface team, ref:https://huggingface.co/transformers/converting_tensorflow_models.html) and then convert to onnx using torch.onnx.export().

In [9]:
import tf2onnx
tf2onnx.logging.set_level(tf2onnx.logging.ERROR)

output_model_path =  os.path.join(output_dir, 'tf2onnx_{}.onnx'.format(model_name_or_path))
opset_version = 13
use_external_data_format = False

specs = []
for name, value in inputs.items():
    dims = [None] * len(value.shape)
    specs.append(tf.TensorSpec(tuple(dims), value.dtype, name=name))

if enable_overwrite or not os.path.exists(output_model_path):
    start = time.time()
    _, _ = tf2onnx.convert.from_keras(model,
                                      input_signature=tuple(specs),
                                      opset=opset_version,
                                      large_model=use_external_data_format,
                                      output_path=output_model_path)
    print("tf2onnx run time = {} s".format(format(time.time() - start, '.2f')))

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: module, class, method, function, traceback, frame, or code object was expected, got cython_function_or_method
Instructions for updating:
Use `tf.compat.v1.graph_util.extract_sub_graph`
tf2onnx run time = 76.58 s


## 4. Inference the Exported Model with ONNX Runtime

Now we are ready to inference the model with ONNX Runtime. Here we can see that OnnxRuntime has better performance than TensorFlow for this example even without optimization.

In [10]:
import psutil
import onnxruntime
import numpy

sess_options = onnxruntime.SessionOptions()

# Set the intra_op_num_threads
sess_options.intra_op_num_threads = psutil.cpu_count(logical=True)

# Providers is optional. Only needed when you use onnxruntime-gpu for CPU inference.
session = onnxruntime.InferenceSession(output_model_path, sess_options, providers=['CPUExecutionProvider'])

batch_size = 1
inputs_onnx = {k_: numpy.repeat(v_, batch_size, axis=0) for k_, v_ in inputs.items()}

# Warm up with one run.
results = session.run(None, inputs_onnx)

# Measure the latency.
start = time.time()
for _ in range(total_runs):
    results = session.run(None, inputs_onnx)
end = time.time()
print("ONNX Runtime cpu inference time for sequence length {} (model not optimized): {} ms".format(num_tokens, format((end - start) * 1000 / total_runs, '.2f')))
del session

ONNX Runtime cpu inference time for sequence length 512 (model not optimized): 452.52 ms


In [11]:
# Some weights of TFBertForQuestionAnswering might not be initialized without fine-tuning.
if is_fine_tuned:
    print("***** Verifying correctness (TensorFlow and ONNX Runtime) *****")
    print('start_scores are close:', numpy.allclose(results[0], start_scores.cpu(), rtol=1e-05, atol=1e-04))
    print('end_scores are close:', numpy.allclose(results[1], end_scores.cpu(), rtol=1e-05, atol=1e-04))

## 5. Model Optimization

[ONNX Runtime BERT Model Optimization Tools](https://github.com/microsoft/onnxruntime/tree/master/onnxruntime/python/tools/transformers) is a set of tools for optimizing and testing BERT models. Let's try some of them on the exported models.

### BERT Optimization Script

The script **optimizer.py** can help optimize BERT model exported by PyTorch, tf2onnx or keras2onnx. Since our model is exported by tf2onnx, we shall use **--model_type bert_tf** parameter.

It will also tell whether the model is fully optimized or not. If not, that means you might need change the script to fuse some new pattern of subgraph.

In [12]:
!{sys.executable} -m pip install --quiet coloredlogs sympy 

optimized_model_path =  os.path.join(output_dir, 'tf2onnx_{}_opt_cpu.onnx'.format(model_name_or_path))

from onnxruntime.transformers import optimizer
optimized_model = optimizer.optimize_model(output_model_path, model_type='bert_tf', num_heads=12, hidden_size=768)
optimized_model.use_dynamic_axes()
optimized_model.save_model_to_file(optimized_model_path)

failed in shape inference <class 'AttributeError'>


We run the optimized model using same inputs. The inference latency might be reduced after optimization. The output result is the same as the one before optimization.

In [13]:
session = onnxruntime.InferenceSession(optimized_model_path, sess_options)
# use one run to warm up a session
session.run(None, inputs_onnx)

# measure the latency.
start = time.time()
for _ in range(total_runs):
    opt_results = session.run(None, inputs_onnx)
end = time.time()
print("ONNX Runtime cpu inference time on optimized model: {} ms".format(format((end - start) * 1000 / total_runs, '.2f')))
del session

ONNX Runtime cpu inference time on optimized model: 437.09 ms


In [14]:
print("***** Verifying correctness (before and after optimization) *****")
print('start_scores are close:', numpy.allclose(opt_results[0], results[0], rtol=1e-05, atol=1e-04))
print('end_scores are close:', numpy.allclose(opt_results[1], results[1], rtol=1e-05, atol=1e-04))

***** Verifying correctness (before and after optimization) *****
start_scores are close: True
end_scores are close: True


### Model Results Comparison Tool

If your BERT model has three inputs, a script compare_bert_results.py can be used to do a quick verification. The tool will generate some fake input data, and compare results from both the original and optimized models. If outputs are all close, it is safe to use the optimized model.

Example of comparing the models before and after optimization:

In [15]:
# The baseline model is exported using max sequence length, and no dynamic axes
!{sys.executable} -m onnxruntime.transformers.compare_bert_results --baseline_model $output_model_path --optimized_model $optimized_model_path --batch_size 1 --sequence_length $max_sequence_length --samples 10

100% passed for 10 random inputs given thresholds (rtol=0.001, atol=0.0001).
maximum absolute difference=0
maximum relative difference=0


### Performance Test Tool

This tool measures performance of BERT model inference using OnnxRuntime Python API.

The following command will create 100 samples of batch_size 1 and sequence length 128 to run inference, then calculate performance numbers like average latency and throughput etc.

In [17]:
THREAD_SETTING = '-n {}'.format(psutil.cpu_count(logical=True))

!{sys.executable} -m onnxruntime.transformers.bert_perf_test --model $optimized_model_path --batch_size 1 --sequence_length 128 --samples 100 --test_times 1 $THREAD_SETTING

test setting TestSetting(batch_size=1, sequence_length=128, test_cases=100, test_times=1, use_gpu=False, intra_op_num_threads=24, seed=3, verbose=False)
Generating 100 samples for batch_size=1 sequence_length=128
Running test: model=tf2onnx_bert-base-cased_opt_cpu.onnx,graph_optimization_level=ENABLE_ALL,intra_op_num_threads=24,batch_size=1,sequence_length=128,test_cases=100,test_times=1,use_gpu=False
Average latency = 136.36 ms, Throughput = 7.33 QPS
Test summary is saved to onnx_models/perf_results_CPU_B1_S128_20210830-220600.txt


Let's load the summary file and take a look. In this machine, the best result is achieved by OpenMP. The best setting might be difference using different hardware or model.

In [21]:
import glob     
import pandas

latest_result_file = max(glob.glob(os.path.join(output_dir, "perf_results_*.txt")), key=os.path.getmtime)
result_data = pandas.read_table(latest_result_file)
print(latest_result_file)

result_data.drop(['model', 'graph_optimization_level', 'batch_size', 'sequence_length', 'test_cases', 'test_times', 'use_gpu'], axis=1, inplace=True)
result_data.drop(['Latency_P50', 'Latency_P75', 'Latency_P90', 'Latency_P95'], axis=1, inplace=True)
cols = result_data.columns.tolist()
cols = cols[-4:] + cols[:-4]
result_data = result_data[cols]
result_data

./onnx_models/perf_results_CPU_B1_S128_20210830-220600.txt


Unnamed: 0,Latency(ms),Latency_P99,Throughput(QPS),intra_op_num_threads
0,136.36,324.84,7.33,24


## 6. Additional Info

Note that running Jupyter Notebook has impact on performance result since Jupyter Notebook is using system resources like CPU and memory etc. It is recommended to close Jupyter Notebook and other applications, then run the performance test tool in a console to get more accurate performance numbers.

We have a [benchmark script](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/run_benchmark.sh). It is recommended to use it to measure inference speed of OnnxRuntime.

[OnnxRuntime C API](https://github.com/microsoft/onnxruntime/blob/master/docs/C_API.md) could get slightly better performance than python API. If you use C API in inference, you can use OnnxRuntime_Perf_Test.exe built from source to measure performance instead.

Here is the machine configuration that generated the above results. The machine has GPU but not used in CPU inference.
You might get slower or faster result based on your hardware.

In [25]:
!{sys.executable} -m pip install --quiet py-cpuinfo py3nvml
!{sys.executable} -m onnxruntime.transformers.machine_info --silent

{
  "gpu": {
    "driver_version": "455.45.01",
    "devices": [
      {
        "memory_total": 16945512448,
        "memory_available": 13019643904,
        "name": "Tesla V100-PCIE-16GB"
      },
      {
        "memory_total": 16945512448,
        "memory_available": 16457924608,
        "name": "Tesla V100-PCIE-16GB"
      },
      {
        "memory_total": 16945512448,
        "memory_available": 16457924608,
        "name": "Tesla V100-PCIE-16GB"
      },
      {
        "memory_total": 16945512448,
        "memory_available": 16457924608,
        "name": "Tesla V100-PCIE-16GB"
      }
    ]
  },
  "cpu": {
    "brand": "Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz",
    "cores": 24,
    "logical_cores": 24,
    "hz": [
      2593997000,
      0
    ],
    "l2_cache": 262144,
    "flags": [
      "3dnowprefetch",
      "abm",
      "adx",
      "aes",
      "apic",
      "avx",
      "avx2",
      "bmi1",
      "bmi2",
      "clflush",
      "cmov",
      "constant_tsc",
      "cpu