OnnxT5 slower than Pytorch #23

GRIGORR · 2021-10-13T19:26:25Z

Hi. I have created an OnnxT5 model (non quantized) as shown in Readme. But OnnxT5 is slower than original Huggingface T5 10-20%. Could you share how the latency difference shown in repo was obtained? Thanks

piEsposito · 2021-11-11T22:27:31Z

Same here.

sworddish · 2022-01-20T07:38:50Z

same here too

Ki6an · 2022-01-20T14:12:20Z

can you provide the device specifications and code you are using to test the speed?

pramodith · 2022-03-06T22:37:37Z

Hi, I'm seeing the same problem, it seems like the Quantized Onnx version is faster than the pytorch model when I run it using a batch size of 1. However, with a batch size of 32 I'm seeing that its much slower. I'm using fastt==0.1.2, transformers=4.10.0 and pytorch==1.7.1. Is there something wrong with my setup?

jasontian6666 · 2022-03-09T21:06:37Z

In my case, even the quantized version is slower than pytorch when the input sequence length >100 tokens. Not sure if this is as expected.

JoeREISys · 2022-05-16T22:24:09Z

I experienced this as well. I'm quite disappointed.

from fastT5 import get_onnx_model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
from timeit import default_timer as timer

py_model_name = 'Salesforce/mixqg-large'
model_name = 'mixqg-large'
custom_output_path = './onnx_t5'
model = get_onnx_model(model_name, custom_output_path)

py_model = AutoModelForSeq2SeqLM.from_pretrained(py_model_name)
tokenizer = AutoTokenizer.from_pretrained(py_model_name)

# this is also the batch size
num_texts = 4                             # Number of input texts to decode
num_beams = 1                             # Number of beams per input text
max_encoder_length = 768                   # Maximum input token length
max_decoder_length = 768                   # Maximum output token length

def infer(model, tokenizer, text):

    # Truncate and pad the max length to ensure that the token size is compatible with fixed-sized encoder (Not necessary for pure CPU execution)
    batch = tokenizer(text, max_length=max_decoder_length, truncation=True, padding='max_length', return_tensors="pt")
    output = model.generate(**batch, max_length=max_decoder_length, num_beams=num_beams, num_return_sequences=num_beams)
    results = [tokenizer.decode(t, skip_special_tokens=True) for t in output]

    print('Texts:')
    for i, summary in enumerate(results):
        print(i + 1, summary)

seq_0 = "Speed bumps are designed to make drivers to slow down. \n Speed bumps are designed to make drivers to slow down. Going over a typical speed bump at 5 miles per hour results in a gentle bounce, while hitting one at 20 delivers a sizable jolt. It's natural to assume that hitting a speed bump at 60mph would deliver a proportionally larger jolt, but it probably wouldn't."
seq_1 = "Going over a typical speed bump at 5 miles per hour results in a gentle bounce, while hitting one at 20 delivers a sizable jolt. \n Speed bumps are designed to make drivers to slow down. Going over a typical speed bump at 5 miles per hour results in a gentle bounce, while hitting one at 20 delivers a sizable jolt. It's natural to assume that hitting a speed bump at 60mph would deliver a proportionally larger jolt, but it probably wouldn't."
seq_2 = "Toyota was by far the most in-demand manufacturer of 2020, totalling over 8.5 million car sales last year. \n Toyota was by far the most in-demand manufacturer of 2020, totalling over 8.5 million car sales last year. They also out-sold rivals Volkswagen by 3.4 million, which equates to just under 10,000 more sales every day and almost 400 more per hour."
seq_3 = "They also out-sold rivals Volkswagen by 3.4 million, which equates to just under 10,000 more sales every day and almost 400 more per hour. \n Toyota was by far the most in-demand manufacturer of 2020, totalling over 8.5 million car sales last year. They also out-sold rivals Volkswagen by 3.4 million, which equates to just under 10,000 more sales every day and almost 400 more per hour."

start = timer()
infer(model, tokenizer, [seq_0, seq_1, seq_2, seq_3])
end = timer()
print("Onnx time:", end - start)

start = timer()
infer(py_model, tokenizer, [seq_0, seq_1, seq_2, seq_3])
end = timer()

print("PyTorch time:", end - start)

Output:

Texts:
1 What do speed bumps do?
2 What does a speed bump do to a driver?
3 What car manufacturer sold the most cars in 2020?
4 How much more did Toyota sell than Volkswagen in 2020?
Pytorch time: 7.330018510000627

Texts:
1 What do speed bumps cause?
2 What does a 20 mph speed bump do?
3 What car manufacturer sold the most cars in 2020?
4 How much more did Toyota sell than Volkswagen in 2020?
Onnx time: 14.700341603999732

Ki6an · 2022-05-17T04:26:03Z

@JoeREISys I ran the same script in colab, I'm getting the following results. maybe it's the device issue.

Downloading: 100%
1.43k/1.43k [00:00<00:00, 37.2kB/s]
Downloading: 100%
2.75G/2.75G [01:02<00:00, 52.8MB/s]
Exporting to onnx... |################################| 3/3
Quantizing... |################################| 3/3
Setting up onnx model...
Done!
Downloading: 100%
1.92k/1.92k [00:00<00:00, 39.7kB/s]
Downloading: 100%
773k/773k [00:00<00:00, 3.16kB/s]
Downloading: 100%
1.32M/1.32M [00:00<00:00, 3.07MB/s]
Downloading: 100%
1.74k/1.74k [00:00<00:00, 50.9kB/s]
Texts:
1 What do speed bumps cause?
2 What does a speed bump do to a driver?
3 What car manufacturer sold the most cars in 2020?
4 How much more did Toyota sell than their rival in 2020?
Onnx time: 37.83371752200014
Texts:
1 What do speed bumps do?
2 What does a speed bump do to a driver?
3 What car manufacturer sold the most cars in 2020?
4 How much more did Toyota sell than Volkswagen in 2020?
PyTorch time: 54.744560202

ierezell · 2022-05-19T19:10:47Z

Hi there, first thanks a lot for this repo!
I experimented with huggingface/optimum which is really nice but they do not support text2text for now (beam search is a beast).

So I tried this repo (on GPU) and got 0.282s for OnnxT5 and 0.160 for a HuggingFace Pipeline... so twice as slow for Onnx...
I got the approximately the same magnitude on CPU.

For the OnnxT5 I followed the readme :
Note that for GPU I changed the code (in ort_settings.py) to make it work for CUDAExecutionProvider (with onnxruntime-gpu and model is loaded on GPU, confirmed with nvidia-smi).

DEFAULT_GENERATOR_OPTIONS = {
'max_length': 128, 'min_length': 2, 'early_stopping': True,
'num_beams': 3, 'temperature': 1.0, 'num_return_sequences': 3,
'top_k': 50, 'top_p': 1.0, 'repetition_penalty': 2.0,  'length_penalty': 1.0
}

OnnxT5

from fastT5 import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 'mrm8488/t5-base-finetuned-question-generation-ap'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
t_input = "answer: bananas context: I like to eat bananas for breakfast"
token = tokenizer(t_input, return_tensors='pt')

start = perf_counter()
tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               **DEFAULT_GENERATOR_OPTIONS)
print(perf_counter()-start)
output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

Huggingface pipeline :

input_texts = ["answer: bananas context: I like to eat bananas for breakfast"]
inputs = tokenizer.batch_encode_plus(
              input_texts, return_tensors='pt', add_special_tokens=True,
              padding=True, truncation=True
          )
inputs = inputs.to("cuda")
start = perf_counter()
output = model.generate(
                input_ids=inputs['input_ids'],
                attention_mask=inputs["attention_mask"],
                 **DEFAULT_GENERATOR_OPTIONS
             )
print(perf_counter()-start)
all_sentences= tokenizer.batch_decode(output, skip_special_tokens=True)
print(all_sentences)

Conclusion, even with optimization of ONNX and on GPU the model is twice as slow.
Note that with optimum with a TokenClassifier model I got a 10x improvement.

Thanks in advance for any help,
Have a great day.

JoeREISys · 2022-05-19T19:42:14Z

@JoeREISys I ran the same script in colab, I'm getting the following results. maybe it's the device issue.

@Ki6an I was using R6i instances - I will retry with C6g and C6i instances in AWS. As a side note, could there be performance gains for exporting the generate implementation as a Torch script module with the encoder and decoder submodules that are traced or can ONNX not handle traced sub modules?

JoeREISys · 2022-05-19T19:50:16Z

So I tried this repo (on GPU) and got 0.282s for OnnxT5 and 0.160 for a HuggingFace Pipeline... so twice as slow for Onnx...

@ierezell It's mentioned in documentation that this repo is not optimized for CUDA. GPU performance is expected to be same or worse than HF Pipeline against GPU. GPU implementation is in progress but ONNX T5 optimizations don't exist yet for GPU.

ierezell · 2022-05-19T19:58:43Z

@JoeREISys, I updated my comment, I had the same issue on CPU (I tried GPU by any chance that it would improve...)

xingenju · 2023-01-10T05:24:11Z

get_onnx_model

Hi Ki6an,
I meet the same issue here, seems fastT5 on my mac PC is faster. But on AWS P2 Instance is slower. Could you please help make sure on which machine configure can fastT5 works?

Oxi84 · 2023-01-14T00:01:02Z

For me the same thing, it is slower around 10 percent, i run batch size around 10-15 beam size is 4 and sequence lenght is on average 15-20.

Probably the best optimization you can do is to run multiple batches.

Oxi84 · 2023-01-18T20:35:57Z

I tried on another CPU and now it is 2x slower (without quantisation) than Pytorch with the same settings as above:
i run batch size around 10-15 beam size is 4 and sequence lenght is on average 15-20.

Oxi84 · 2023-01-20T15:12:50Z

It does wok faster when using smaller batches and when using less cores. It is probably optimal to divide all cpu cores using pytorch thread number set and then use a few different flask server for interface as t5fast work amazing on one core but second core does not give any speedup.

jayiitp · 2023-06-10T18:42:51Z

@ierezell l It will be a helpful if you could share the script of ort_settings.py for gpu.I tried the method you have desribed but i am getting error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OnnxT5 slower than Pytorch #23

OnnxT5 slower than Pytorch #23

GRIGORR commented Oct 13, 2021

piEsposito commented Nov 11, 2021

sworddish commented Jan 20, 2022

Ki6an commented Jan 20, 2022

pramodith commented Mar 6, 2022

jasontian6666 commented Mar 9, 2022

JoeREISys commented May 16, 2022 •

edited

Loading

Ki6an commented May 17, 2022 •

edited

Loading

ierezell commented May 19, 2022 •

edited

Loading

JoeREISys commented May 19, 2022 •

edited

Loading

JoeREISys commented May 19, 2022 •

edited

Loading

ierezell commented May 19, 2022

xingenju commented Jan 10, 2023

Oxi84 commented Jan 14, 2023

Oxi84 commented Jan 18, 2023

Oxi84 commented Jan 20, 2023

jayiitp commented Jun 10, 2023 •

edited

Loading

OnnxT5 slower than Pytorch #23

OnnxT5 slower than Pytorch #23

Comments

GRIGORR commented Oct 13, 2021

piEsposito commented Nov 11, 2021

sworddish commented Jan 20, 2022

Ki6an commented Jan 20, 2022

pramodith commented Mar 6, 2022

jasontian6666 commented Mar 9, 2022

JoeREISys commented May 16, 2022 • edited Loading

Ki6an commented May 17, 2022 • edited Loading

ierezell commented May 19, 2022 • edited Loading

JoeREISys commented May 19, 2022 • edited Loading

JoeREISys commented May 19, 2022 • edited Loading

ierezell commented May 19, 2022

xingenju commented Jan 10, 2023

Oxi84 commented Jan 14, 2023

Oxi84 commented Jan 18, 2023

Oxi84 commented Jan 20, 2023

jayiitp commented Jun 10, 2023 • edited Loading

JoeREISys commented May 16, 2022 •

edited

Loading

Ki6an commented May 17, 2022 •

edited

Loading

ierezell commented May 19, 2022 •

edited

Loading

JoeREISys commented May 19, 2022 •

edited

Loading

JoeREISys commented May 19, 2022 •

edited

Loading

jayiitp commented Jun 10, 2023 •

edited

Loading