Copyright (c) Microsoft Corporation. All rights reserved.  
Licensed under the MIT License.

# Inference PyTorch GPT2 Model with ONNX Runtime on CPU

In this tutorial, you'll be introduced to how to load a GPT2 model from PyTorch, convert it to ONNX, and inference it using ONNX Runtime using IO Binding. Note that past state is used to get better performance.

## Prerequisites ##

If you have Jupyter Notebook, you may directly run this notebook. We will use pip to install or upgrade [PyTorch](https://pytorch.org/), [OnnxRuntime](https://microsoft.github.io/onnxruntime/) and other required packages.

Otherwise, you can setup a new environment. First, we install [AnaConda](https://www.anaconda.com/distribution/). Then open an AnaConda prompt window and run the following commands:

```console
conda create -n cpu_env python=3.8
conda activate cpu_env
conda install jupyter
jupyter notebook
```
The last command will launch Jupyter Notebook and we can open this notebook in browser to continue.

In [2]:
# Install Py.Torch 1.6.0 and OnnxRuntime 1.5.1 for CPU-only
import sys
if sys.platform == 'darwin': # Mac
    !{sys.executable} -m pip install --upgrade torch torchvision
else:
    !{sys.executable} -m pip install --upgrade torch==1.6.0+cpu torchvision==0.7.0+cpu -f https://download.pytorch.org/whl/torch_stable.html
!{sys.executable} -m pip install onnxruntime

# Install other packages used in this notebook.
!{sys.executable} -m pip install transformers
!{sys.executable} -m pip install onnx onnxconverter_common psutil pytz pandas py-cpuinfo py3nvml netron coloredlogs

You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/usr/local/opt/python@3.9/bin/python3.9 -m pip install --upgrade pip' command.[0m
Collecting transformers==3.0.2
  Using cached transformers-3.0.2-py3-none-any.whl (769 kB)
Collecting filelock
  Using cached filelock-3.3.1-py3-none-any.whl (9.7 kB)
Collecting regex!=2019.12.17
  Using cached regex-2021.10.23-cp39-cp39-macosx_10_9_x86_64.whl (288 kB)
Collecting sentencepiece!=0.1.92
  Using cached sentencepiece-0.1.96-cp39-cp39-macosx_10_6_x86_64.whl (1.1 MB)
Collecting tokenizers==0.8.1.rc1
  Using cached tokenizers-0.8.1rc1.tar.gz (97 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h    Preparing wheel metadata ... [?25ldone
Collecting tqdm>=4.27
  Using cached tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
Collecting packaging
  Using cached packaging-21.0-py3-non

In [3]:
import os

# Create a cache directory to store pretrained model.
cache_dir = os.path.join(".", "cache_models")
if not os.path.exists(cache_dir):
    os.makedirs(cache_dir)

## Convert GPT2 model from PyTorch to ONNX ##

We have a script [convert_to_onnx.py](https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/python/tools/transformers/convert_to_onnx.py) that could help you to convert GPT2 with past state to ONNX. 

The script accepts a pretrained model name or path of a checkpoint directory as input, and converts the model to ONNX. It also verifies that the ONNX model could generate same input as the pytorch model. The usage is like 
```
python -m onnxruntime.transformers.convert_to_onnx -m model_name_or_path --output gpt2.onnx -o -p fp32|fp16|int8
```
The -p option can be used to choose the precision: fp32 (float32), fp16 (mixed precision) or int8 (quantization). The -o option will generate optimized model, which is required for fp16 or int8.

Here we use a pretrained model as example:

In [6]:
from onnxruntime.transformers.gpt2_helper import Gpt2Helper, MyGPT2LMHeadModel
from transformers import AutoConfig
import torch

model_name_or_path = "gpt2"
config = AutoConfig.from_pretrained(model_name_or_path, cache_dir=cache_dir)
model = MyGPT2LMHeadModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
device = torch.device("cpu")
model.eval().to(device)

print(model.config)

num_attention_heads = model.config.n_head
hidden_size = model.config.n_embd
num_layer = model.config.n_layer

Collecting transformers
  Downloading transformers-4.11.3-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 3.0 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp39-cp39-macosx_10_9_x86_64.whl (197 kB)
[K     |████████████████████████████████| 197 kB 5.5 MB/s 
[?25hCollecting sacremoses
  Using cached sacremoses-0.0.46-py3-none-any.whl (895 kB)
Collecting filelock
  Using cached filelock-3.3.1-py3-none-any.whl (9.7 kB)
Collecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.19-py3-none-any.whl (56 kB)
[K     |████████████████████████████████| 56 kB 2.9 MB/s 
[?25hCollecting regex!=2019.12.17
  Using cached regex-2021.10.23-cp39-cp39-macosx_10_9_x86_64.whl (288 kB)
Collecting tqdm>=4.27
  Using cached tqdm-4.62.3-py2.py3-none-any.whl (76 kB)
Collecting packaging>=20.0
  Using cached packaging-21.0-py3-none-any.whl (40 kB)
Collecting requests
  Using cached requests-2.26.0-py2.py3-none-any.whl (62 kB)
Collecting tokenizers<0.11,>=0.1

Downloading: 100%|██████████| 665/665 [00:00<00:00, 185kB/s]
Downloading: 100%|██████████| 523M/523M [02:39<00:00, 3.44MB/s]


GPT2Config {
  "_name_or_path": "gpt2",
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "initializer_range": 0.02,
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 12,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.11.3",
  "use_cache": true,
  "vocab_size": 50257
}



In [7]:
onnx_model_path = "gpt2.onnx"
Gpt2Helper.export_onnx(model, device, onnx_model_path) # add parameter use_external_data_format=True when model size > 2 GB

  if batch_size <= 0:
  past_key, past_value = layer_past
  attn_weights = attn_weights / (float(value.size(-1)) ** 0.5)


## PyTorch Inference using Huggingface Transformers##

In the following, we will use an example input to get the output from PyTorch for comparison purpose.
For the first inference, there is no any past state. We can prepare empty state for input.

In [None]:
from transformers import AutoTokenizer

EXAMPLE_Text = ['best hotel in bay area', 'here is an example of gpt2 model']

def get_tokenizer(model_name_or_path, cache_dir):
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, cache_dir=cache_dir)
    tokenizer.padding_side = "left"
    tokenizer.pad_token = tokenizer.eos_token
    #okenizer.add_special_tokens({'pad_token': '[PAD]'})
    return tokenizer

def get_example_inputs(prompt_text=EXAMPLE_Text):    
    tokenizer = get_tokenizer(model_name_or_path, cache_dir)
    encodings_dict = tokenizer.batch_encode_plus(prompt_text, padding=True)

    input_ids = torch.tensor(encodings_dict['input_ids'], dtype=torch.int64)
    attention_mask = torch.tensor(encodings_dict['attention_mask'], dtype=torch.float32)
    position_ids = (attention_mask.long().cumsum(-1) - 1)
    position_ids.masked_fill_(position_ids < 0, 0)

    #Empty Past State for generating first word
    empty_past = []
    batch_size = input_ids.size(0)
    sequence_length = input_ids.size(1)
    past_shape = [2, batch_size, num_attention_heads, 0, hidden_size // num_attention_heads]
    for i in range(num_layer):
        empty_past.append(torch.empty(past_shape).type(torch.float32).to(device))
       
    return input_ids, attention_mask, position_ids, empty_past


from transformers import GPT2LMHeadModel
torch_model = GPT2LMHeadModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
device = torch.device("cpu")
torch_model.eval().to(device)

input_ids, attention_mask, position_ids, empty_past = get_example_inputs()
print("input_ids", input_ids)
print("attention_mask", attention_mask)
print("position_ids", position_ids)

Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at gpt2 and are newly initialized: ['h.0.attn.masked_bias', 'h.1.attn.masked_bias', 'h.2.attn.masked_bias', 'h.3.attn.masked_bias', 'h.4.attn.masked_bias', 'h.5.attn.masked_bias', 'h.6.attn.masked_bias', 'h.7.attn.masked_bias', 'h.8.attn.masked_bias', 'h.9.attn.masked_bias', 'h.10.attn.masked_bias', 'h.11.attn.masked_bias', 'lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Downloading: 100%|██████████| 1.04M/1.04M [00:01<00:00, 689kB/s]
Downloading: 100%|██████████| 456k/456k [00:01<00:00, 394kB/s]

input_ids tensor([[50256, 50256, 50256, 50256, 13466,  7541,   287, 15489,  1989],
        [ 1456,   318,   281,  1672,   286,   308,   457,    17,  2746]])
attention_mask tensor([[0., 0., 0., 0., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1.]])
position_ids tensor([[0, 0, 0, 0, 0, 1, 2, 3, 4],
        [0, 1, 2, 3, 4, 5, 6, 7, 8]])





In [None]:
with torch.no_grad():
    torch_output = torch_model(input_ids, past=empty_past, attention_mask=attention_mask, position_ids=position_ids)

## ONNX Runtime Inference ##

We can use ONNX Runtime to inference. The inputs are dictionary with name and numpy array as value, and the output is list of numpy array. Note that both input and output are in CPU. When you run the inference in GPU, it will involve data copy between CPU and GPU for input and output.

Let's create an inference session for ONNX Runtime given the exported ONNX model, and see the output.

In [None]:
import onnxruntime
import numpy

input_ids, attention_mask, position_ids, empty_past = get_example_inputs()

onnx_model_path = "gpt2.onnx"
session = onnxruntime.InferenceSession(onnx_model_path)
ort_inputs = {'input_ids': numpy.ascontiguousarray(input_ids.cpu().numpy()),
              'attention_mask' : numpy.ascontiguousarray(attention_mask.cpu().numpy()),
              'position_ids': numpy.ascontiguousarray(position_ids.cpu().numpy())
             }
print(ort_inputs)
for i, past_i in enumerate(empty_past):
    ort_inputs[f'past_{i}'] = numpy.ascontiguousarray(past_i.cpu().numpy())
ort_outputs = session.run(None, ort_inputs)

{'input_ids': array([[50256, 50256, 50256, 50256, 13466,  7541,   287, 15489,  1989],
       [ 1456,   318,   281,  1672,   286,   308,   457,    17,  2746]],
      dtype=int64), 'attention_mask': array([[0., 0., 0., 0., 1., 1., 1., 1., 1.],
       [1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32), 'position_ids': array([[0, 0, 0, 0, 0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4, 5, 6, 7, 8]], dtype=int64)}


We can compare the outputs from PyTorch and ONNX Runtime. Logits are very close (max difference is 1E-4).

In [None]:
logits_masked_diff = (torch_output[0] - ort_outputs[0]) * attention_mask.unsqueeze(2)
max_logits_diff = logits_masked_diff.abs().max()
print("max logits diff (ignored padding)", max_logits_diff)

max logits diff (ignored padding) tensor(6.8665e-05)


## ONNX Runtime Inference with IO Binding ##

To avoid data copy for input and output, ONNX Runtime also supports IO Binding. User could provide some buffer for input and outputs. For GPU inference, the buffer can be in GPU to reduce memory copy between CPU and GPU. This is helpful for high performance inference in GPU. For GPT-2, IO Binding might help the performance when batch size or (past) sequence length is large.

In [None]:
def inference_with_io_binding(session, config, input_ids, position_ids, attention_mask, past):
    output_shapes = Gpt2Helper.get_output_shapes(batch_size=input_ids.size(0),
                                                 past_sequence_length=past[0].size(3),
                                                 sequence_length=input_ids.size(1),
                                                 config=config)
    print(output_shapes)
    output_buffers = Gpt2Helper.get_output_buffers(output_shapes, device)

    print(output_buffers)
    io_binding = Gpt2Helper.prepare_io_binding(session, input_ids, position_ids, attention_mask, past,
                                               output_buffers, output_shapes)
    session.run_with_iobinding(io_binding)

    outputs = Gpt2Helper.get_outputs_from_io_binding_buffer(session, output_buffers, output_shapes,
                                                            return_numpy=False)
    return outputs

We can see that the result is exactly same with/without IO Binding:

In [None]:
input_ids, attention_mask, position_ids, empty_past = get_example_inputs()
outputs = inference_with_io_binding(session, config, input_ids, position_ids, attention_mask, empty_past)
for i in range(len(outputs)):
    assert torch.eq(outputs[i], torch.from_numpy(ort_outputs[i])).all()
print("IO Binding result is good")

{'logits': [2, 9, 50257], 'present_0': [2, 2, 12, 9, 64], 'present_1': [2, 2, 12, 9, 64], 'present_2': [2, 2, 12, 9, 64], 'present_3': [2, 2, 12, 9, 64], 'present_4': [2, 2, 12, 9, 64], 'present_5': [2, 2, 12, 9, 64], 'present_6': [2, 2, 12, 9, 64], 'present_7': [2, 2, 12, 9, 64], 'present_8': [2, 2, 12, 9, 64], 'present_9': [2, 2, 12, 9, 64], 'present_10': [2, 2, 12, 9, 64], 'present_11': [2, 2, 12, 9, 64]}
{'logits': tensor([8.4078e-45, 9.8091e-45, 1.1210e-44,  ..., 7.6294e-06, 1.5259e-05,
        3.8147e-05]), 'present_0': tensor([ 2.7551e-39,  7.7592e-37,         nan,  ...,  7.3708e-43,
        -9.1699e+27,  7.3708e-43]), 'present_1': tensor([-0.2771,  2.7640, -0.6938,  ...,  0.1836,  0.2120,  1.4961]), 'present_2': tensor([-0.0141, -0.1919,  0.1306,  ..., -0.8329, -0.3928, -0.2892]), 'present_3': tensor([ 0.2841,  0.0346, -0.8563,  ..., -0.1179,  0.1191,  0.3991]), 'present_4': tensor([-0.1280, -0.7493,  0.6054,  ...,  0.1108,  0.0243, -0.0911]), 'present_5': tensor([ 0.0877, -0.1

## Batch Text Generation ##

Here is an example for text generation using ONNX Runtime or PyTorch. For ONNX Runtime, IO Binding is used for better performance.

In [None]:
def test_generation(tokenizer, input_text, ort_session=None, num_tokens_to_produce = 30):
    use_onnxruntime = (ort_session is not None)
    print("Text generation using", "OnnxRuntime" if use_onnxruntime else "PyTorch", "...")
    eos_token_id = tokenizer.eos_token_id
    
    input_ids, attention_mask, position_ids, past = get_example_inputs(input_text)
    batch_size = input_ids.size(0)

    has_eos = torch.zeros(batch_size, dtype=torch.bool)

    all_token_ids = input_ids.clone()

    for step in range(num_tokens_to_produce):
        if ort_session is not None:
            outputs = inference_with_io_binding(ort_session, config, input_ids, position_ids, attention_mask, past)
        else:
            outputs = torch_model(input_ids, attention_mask=attention_mask, position_ids=position_ids, past=past)  

        next_token_logits = outputs[0][:, -1, :]
        # Greedy approach is used here. You can easily extend it to use beam search and sampling to pick next tokens.
        next_tokens = torch.argmax(next_token_logits, dim=-1)

        has_eos = has_eos | (next_tokens == eos_token_id)
        tokens_to_add = next_tokens.masked_fill(has_eos, eos_token_id)
        all_token_ids = torch.cat([all_token_ids, tokens_to_add.unsqueeze(-1)], dim=-1)

        # Update input_ids, attention_mask, position_ids and past
        input_ids = tokens_to_add.clone().detach().reshape([batch_size, 1]).to(device)    
        position_ids = (position_ids[:,-1] + 1).reshape(batch_size,1)
        attention_mask = torch.cat([attention_mask, torch.ones([batch_size, 1]).type_as(attention_mask)], 1).to(device)    

        past = []
        if not use_onnxruntime:
            past = list(outputs[1]) # past in torch output is tuple
        else:
            for i in range(num_layer):
                past_i = torch.from_numpy(outputs[i + 1]) if isinstance(outputs[i + 1], numpy.ndarray) else outputs[i + 1].clone().detach()
                past.append(past_i.to(device))

        if torch.all(has_eos):
            break

    for i, output in enumerate(all_token_ids):
        print("------------")
        print(tokenizer.decode(output, skip_special_tokens=True))

In [None]:
tokenizer = get_tokenizer(model_name_or_path, cache_dir)
input_text = EXAMPLE_Text
test_generation(tokenizer, input_text, ort_session=session)

Text generation using OnnxRuntime ...
------------
best hotel in bay area.

The hotel is located in the historic Bayview neighborhood of San Francisco.

The hotel is open daily from 9 a.m.
------------
here is an example of gpt2 model.

The gpt2 model is a simple, but powerful, way to generate a GPT2-like data structure. It is a


Next, we use PyTorch to run again and we can see that the result is exactly same.

In [None]:
test_generation(tokenizer, input_text)

Text generation using PyTorch ...
------------
best hotel in bay area.

The hotel is located in the historic Bayview neighborhood of San Francisco.

The hotel is open daily from 9 a.m.
------------
here is an example of gpt2 model.

The gpt2 model is a simple, but powerful, way to generate a GPT2-like data structure. It is a


## Int8 Quantization ##
Next, we will apply dynamic quantization to the model. We optimize the model before quantization to get better performance.

Note that text generation result from fp32 and int8 models could be quite different. User shall evaluate the precision metric for your application for both fp32 and int8 models. If the quality of int8 model result is acceptable, you will be glad to find that it is faster than fp32 model in inference. 

Note that you can leverage [quantization aware training (QAT)](https://pytorch.org/blog/introduction-to-quantization-on-pytorch/) for accuracy improvement if needed.

In [None]:
from onnxruntime.transformers.quantize_helper import QuantizeHelper

optimized_fp32_model_path = "gpt2_fp32.onnx"
quantized_int8_model_path = "gpt2_int8.onnx"
Gpt2Helper.optimize_onnx("gpt2.onnx", optimized_fp32_model_path, False, model.config.num_attention_heads, model.config.hidden_size)
QuantizeHelper.quantize_onnx_model(optimized_fp32_model_path, quantized_int8_model_path)

         Please use quantize_static for static quantization, quantize_dynamic for dynamic quantization.


In [None]:
session_int8 = onnxruntime.InferenceSession(quantized_int8_model_path)
input_text = ['bert model optimization']
test_generation(tokenizer, input_text, ort_session=session_int8, num_tokens_to_produce=14)

Text generation using OnnxRuntime ...
------------
bert model optimization, and the NLP model is a generalizable and robust model.


## Benchmark ##
There is a tool benchmark_gpt2.py, which can be used to measure the performance of GPT-2 by PyTorch, ONNX Runtime without/with IO Binding.

In [None]:
!{sys.executable} -m onnxruntime.transformers.benchmark_gpt2 -m gpt2 -o

ATen/Parallel:
	at::get_num_threads() : 12
	at::get_num_interop_threads() : 6
OpenMP 2019
	omp_get_max_threads() : 12
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191125 for Intel(R) 64 architecture applications
	mkl_get_max_threads() : 12
Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
std::thread::hardware_concurrency() : 12
Environment variables:
	OMP_NUM_THREADS : [not set]
	MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP



2020-09-30 18:44:40.720277: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
Arguments:Namespace(batch_sizes=[1], cache_dir='.\\cache_models', include_copy_output_latency=False, model_class='GPT2LMHeadModel', model_name_or_path='gpt2', onnx_dir='.\\onnx_models', optimize_onnx=True, past_sequence_lengths=[8, 16, 32, 64, 128, 256], precision=<Precision.FLOAT32: 'fp32'>, result_csv=None, test_times=100, thread_num=-1, torchscript=False, use_gpu=False, validate_onnx=False, verbose=False)
PyTorch Version:1.6.0+cpu
Transformers Version:3.0.2
Onnxruntime Version:1.5.1
Shapes: input_ids=torch.Size([1, 1]) past=torch.Size([2, 1, 12, 1, 64]) output=torch.Size([1, 1, 50257]) present=torch.Size([2, 1, 12, 2, 64])
  assert batch_size > 0, "batch_size has to be defined and > 0"
  w = w / (float(v.size(-1)) ** 0.5)
  mask = self.bias[:, :, ns - nd : ns, :ns]
Fused LayerNormalization count: 25
Fused FastGelu count: 12
Fused Attention(

In [None]:
!{sys.executable} -m onnxruntime.transformers.benchmark_gpt2 -m gpt2 -o --precision int8

ATen/Parallel:
	at::get_num_threads() : 12
	at::get_num_interop_threads() : 6
OpenMP 2019
	omp_get_max_threads() : 12
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191125 for Intel(R) 64 architecture applications
	mkl_get_max_threads() : 12
Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
std::thread::hardware_concurrency() : 12
Environment variables:
	OMP_NUM_THREADS : [not set]
	MKL_NUM_THREADS : [not set]
ATen parallel backend: OpenMP

         Please use quantize_static for static quantization, quantize_dynamic for dynamic quantization.


2020-09-30 18:47:09.756025: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
Arguments:Namespace(batch_sizes=[1], cache_dir='.\\cache_models', include_copy_output_latency=False, model_class='GPT2LMHeadModel', model_name_or_path='gpt2', onnx_dir='.\\onnx_models', optimize_onnx=True, past_sequence_lengths=[8, 16, 32, 64, 128, 256], precision=<Precision.INT8: 'int8'>, result_csv=None, test_times=100, thread_num=-1, torchscript=False, use_gpu=False, validate_onnx=False, verbose=False)
PyTorch Version:1.6.0+cpu
Transformers Version:3.0.2
Onnxruntime Version:1.5.1
Shapes: input_ids=torch.Size([1, 1]) past=torch.Size([2, 1, 12, 1, 64]) output=torch.Size([1, 1, 50257]) present=torch.Size([2, 1, 12, 2, 64])
  assert batch_size > 0, "batch_size has to be defined and > 0"
  w = w / (float(v.size(-1)) ** 0.5)
  mask = self.bias[:, :, ns - nd : ns, :ns]
Fused LayerNormalization count: 25
Fused FastGelu count: 12
Fused Attention(wit

We can see that quantized model has significant speed up (close to 2x).

### Test Environment ###
The following is the hardware of the test machine, and software version:

In [None]:
!{sys.executable} -m onnxruntime.transformers.machine_info --silent

{
  "gpu": {
    "driver_version": "451.67",
    "devices": [
      {
        "memory_total": 8589934592,
        "memory_available": 8480882688,
        "name": "GeForce GTX 1070"
      }
    ]
  },
  "cpu": {
    "brand": "Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz",
    "cores": 6,
    "logical_cores": 12,
    "hz": "3.1920 GHz",
    "l2_cache": "1536 KB",
    "flags": [
      "3dnow",
      "3dnowprefetch",
      "abm",
      "acpi",
      "adx",
      "aes",
      "apic",
      "avx",
      "avx2",
      "bmi1",
      "bmi2",
      "clflush",
      "clflushopt",
      "cmov",
      "cx16",
      "cx8",
      "de",
      "dtes64",
      "dts",
      "erms",
      "est",
      "f16c",
      "fma",
      "fpu",
      "fxsr",
      "hle",
      "ht",
      "hypervisor",
      "ia64",
      "invpcid",
      "lahf_lm",
      "mca",
      "mce",
      "mmx",
      "movbe",
      "mpx",
      "msr",
      "mtrr",
      "osxsave",
      "pae",
      "pat",
      "pbe",
      "pcid",
      "pc

2020-09-30 18:49:40.600527: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library cudart64_101.dll
