Different behaviour when extending this project to Bart #7

HenryDashwood · 2021-04-06T15:43:21Z

Hello there. This is a really fantastic project. I'm trying to extend your work to Bart but I've run into some strange behaviour.

I've made a Colab notebook to illustrate the problem. Specifically when converting Bart to ONNX, the encoder_hidden_states input does not get included in the ONNX model's graph. As you can see from the notebook though, it works perfectly for T5.

I realise this is out of scope for the fastT5 project but thought someone who comes across this issue might have experienced a similar problem and be able to help. This may also be useful to know in case you have plans to expand this project to include models like Bart in the future.

The text was updated successfully, but these errors were encountered:

Ki6an · 2021-04-06T19:06:56Z

thank you!

@tobigue was able to export mbart to onnx, he might be able to help.

tobigue · 2021-04-06T21:07:10Z

Hey, I also came across this when trying to adapt the fastT5 code for converting (m)BART to onnx and I think it is due to the fact that the encoder_hidden_states are passed as key_value_states into the MBartAttention down the line and are not used in case past_key_value is given.

So my guess was, that it is not included in the exported graph for that reason.
I'll try to share my progress in a Colab notebook tomorrow.

HenryDashwood · 2021-04-06T21:46:51Z

Ah that's really good to know. Thought I was going mad for a bit! Please do keep us updated!

tobigue · 2021-04-07T14:24:24Z

@HenryDashwood this is my current progress when converting mbart to onnx:
https://colab.research.google.com/drive/1b4gCQJbfdr2nEKyb5xvbjeBrTvkelG9L?usp=sharing

Currently the ONNX model gives a (slightly) different output for the same input than the pytorch model and I'm not sure why.
You can see that at the very end of the notebook.

Also, as I had to use the external data format (the mBART model export failed when I did not, because apparently the export is >2GB for some reason, even though T5-large fits into <2GB with fastT5), quantizing the model is not straight forward, as ONNXQuantizer uses infer_shapes, but the external data format is only supported by infer_shapes_path. Also, there is a problem when loading models that are not saved in ./, because f.name is used instead of f.
I changed those places in the onnx/onnxruntime code and was able to run the quantization, however, when trying to load the quantized models I get this:

[ONNXRuntimeError] : 1 : FAIL : Load model from models/mbart-large-en-ro/decoder-quantized/model.onnx failed:This is an invalid model. Error: Duplicate definition of name (decoder.embed_tokens.weight).

If you have any insights on the differing outputs, or how to make the quantization work, or manage to export and quantize a BART model in another way, sharing is much appreciated. :)

Cheers!

tobigue · 2021-04-08T21:49:47Z

The reason the init decoder was so large was that it exported the token embeddings twice. With do_constant_folding=False it only exports them once and then the model can be exported without using the external data format. I also looked at the mBART implementation again and saw that they do not do the rescaling of the output (no * (self.model_dim ** -0.5)), but do add a final_logits_bias. I adjusted the notebook accordingly and will try quantization tomorrow.

tobigue · 2021-04-09T15:29:48Z

I got the quantization for the exported model to work, but only using onnxruntime==1.6.0. See this issue.

Ki6an · 2021-04-09T16:34:20Z

cool!
I also had some issues with 1.7.0, while using the onnxruntime==1.7.0 for quantizing it created extra models here's the issue . by applying optimize_model=False was able to fix it.

Ki6an · 2021-04-09T16:43:22Z

constant folding replaces some of the operations that have all constant inputs, not clear why it creates embedding twice in bart. in t5 i did not face any issue with using do_constant_folding=True

tobigue · 2021-04-09T16:50:26Z

It only happens for the init_decoder and I saw that in fastT5 you do not do constant folding for the init decoder (only for encoder and decoder). https://github.com/Ki6an/fastT5/blob/master/fastT5/onnx_exporter.py#L196

If you also export the init decoder with constant folding and use the external data format (or inspect the model with netron), we might see the same behaviour.

Ki6an · 2021-04-09T17:02:40Z

also I noticed that in the notebook

        input_names = [x.name for x in self.decoder.get_inputs()]
        inputs = [
            input_ids.cpu().numpy(),
            attention_mask.cpu().numpy(),
        ] + [
            tensor.cpu().numpy() for tensor in flat_past_key_values
        ]

        decoder_inputs = dict(zip(input_names, inputs))
        decoder_outputs = self.decoder.run(None, decoder_inputs)

in the above code you are using two for loops & zip it might affect the model speed during the inference.
fastT5 code for comparison :

    past_key_values = {
        f"pkv_{i}": pkv.cpu().numpy() for i, pkv in enumerate(flat_past_key_values)
    }

    decoder_outputs = self.decoder.run(None, {**decoder_inputs, **past_key_values})

I've not tested how much effect it might have.

Ki6an · 2021-04-09T17:14:30Z

It only happens for the init_decoder and I saw that in fastT5 you do not do constant folding for the init decoder (only for encoder and decoder). https://github.com/Ki6an/fastT5/blob/master/fastT5/onnx_exporter.py#L196

you're right I did not notice it before but after testing got these results.

# t5-large => 
# init_decoder = 1.75G with do_constant_folding=True
# init_decoder = 1.62G with do_constant_folding=False

just a small difference. and for smaller models it's negligible :)

tobigue · 2021-04-09T19:46:21Z

just a small difference. and for smaller models it's negligible :)

yeah, the English models have a quite small vocabulary compared to the multilingual models like mT5 and mBART.

You are right about the loops + zip, I would suspect that this should be quite fast though, because we are only iterating over ~50 items. When everything works as expected maybe just naming the inputs from 1 to X is the easier and faster way. 👍

HenryDashwood · 2021-04-12T14:36:16Z

Nice work! Interestingly I have been able to quantize Bart with onnxruntime 1.7.0 on my Mac without setting optimize_model to False

tobigue · 2021-04-13T07:05:14Z

Interesting that you don't get the Duplicate definition of name (decoder.embed_tokens.weight) error with 1.7.0. Did you do something different when building the simplified models?

I think optimization works in your case, because the vocabulary of BART is much smaller than mBART, so duplicating the embeddings will probably not result in a model >2GB.

HenryDashwood · 2021-04-14T21:03:29Z

yeah i suspect that's right. I haven't had to use external_data_format or anything like that. How are you doing in terms of speed? I'm not really seeing any speedups in my code which is probably a bug i've introduced because when i run your notebook, i do. Here's mine: https://colab.research.google.com/drive/1e3b9_a6UvNjJWemqubYA7YKxC7vYrvAX?usp=sharing

Worien · 2021-04-22T11:18:03Z

@HenryDashwood Hi, I'm trying to export bart-large-cnn model for summarization to onnx and tried to use your colab, but stuck with error on the export step:
RuntimeError: Unable to cast from non-held to held instance (T& to Holder<T>) (compile in debug mode for type information)

Did you face it and how did you fix it?

HenryDashwood · 2021-04-22T12:37:15Z

Very odd error that I also get sometimes and my only answer at this point is... have you tried running the cell again? That usually sorts it

Worien · 2021-04-26T14:13:18Z

@HenryDashwood thanks, I was able to run it, but I have a question. Is there a difference when you run the model through generate func or with

ort_session = onnxruntime.InferenceSession("super_resolution.onnx")
ort_outs = ort_session.run(None, ort_inputs)

?

HenryDashwood · 2021-04-27T13:58:05Z

If I understand your answer correctly, we do both. ort_session.run() gets called inside generate to get predictions from the model. But we need more than just a single set of predictions do do something like beam search. So generate() takes care of that as well

anshoomehra · 2021-05-12T17:07:55Z

@HenryDashwood great work, I have been struggling on BART Summarization to ONNX & this gives me a ray of hope... [ My earlier tries to onnx never gave final inference summaries similar to non-onnx versions. I am pretty sure, I made some errors refactoring... ]

I downloaded the collab & running it locally as-is without any modifications, it's throwing an error stated below, curious if any particular versions of libraries I should be on?

RuntimeError: THPVariable_Check(tuple_elem) INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1614378098133/work/torch/csrc/jit/passes/onnx/shape_type_inference.cpp":676, please report a bug to PyTorch.

On a side note, can the similar be possibly achieved by JIT Libraries export? I tried that as well & after making model export & Triton Server work, failing at Inference ... triton-inference-server/server#2846

anshoomehra · 2021-05-12T18:26:23Z

Resolved " RuntimeError: THPVariable_Check(tuple_elem) INTERNAL ASSERT FAILED " by upgrading to torch 1.8.1

anshoomehra · 2021-05-12T18:53:17Z

@HenryDashwood et al am I reading it right:

Outcome of your analysis is that ONNX and Quantized are slower than PyTorch ??
If so, what is the best way out to get better performance? I am clocking 900ms to 1.4 sec on T4 per call. I am working on a use case that requires this capability to be enabled in real-time, which is where this study to optimize with the goal to make it 500ms or less. Any suggestions are much appreciated.
I know very little about ONNX and JIT exports but my goal was to export these models, host them on Triton Server with T4s & hopefully get the boost as end goal. Not sure if I am on right path ?

HenryDashwood · 2021-05-14T09:49:11Z

HI @anshoomehra sorry it took a while to get back. I haven't done much ONNX work in the last couple of weeks so I'm not really sure why I didn't get a speedup. I'll get back to you if I work it out though!

anshoomehra · 2021-05-17T14:27:02Z

@HenryDashwood appreciate the acknowledgment! Yes, please do continue (if time permits), yours is the only implementation that I have seen working without losing the final output quality & if ONNX brings down the inference time will be significant value addition.

I am dedicating the next few days on it & if I have any breakthroughs shall share back ... Thanks much!!

anshoomehra · 2021-05-19T19:16:31Z

@HenryDashwood I have been able to make some progress with ONNX, however, shifting to GPUs seems challenging.

I have made ONNX runtime working to engage GPUs using provider=['CUDAExecutionProvider'] (This step itself was issue because of conflicting libraries, we had to install onnxruntime-gpu and uninstall onnx, onnxruntime to make this work)

Now, it seems the way graph is exported some variables are still on CPU causing below error, I tried making model and tokenizer to device cuda along with exported version using CUDAExecutionProvider but it still fails, would you have any immediate thought on what code fix we may need to resolve this??

HenryDashwood · 2021-05-25T11:18:55Z

Did you try explicitly setting the device of token["input_ids"] and token["attention_mask"]? That's the only thing I can think of.

You might be interested in following this work as it makes progress huggingface/transformers#11786

ayushch3 · 2021-05-26T17:53:50Z

@HenryDashwood @tobigue I am looking at your implementation of mbart on onnx and wanted to see if you had any feedback on how to make this work for mBart-50 many to many: https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt

Specifically, in the generate method, it passes forced_bos_token_id as first generated token to translate into target language. Could you provide any pointers where this parameter can be passed in your implementation: https://colab.research.google.com/drive/1b4gCQJbfdr2nEKyb5xvbjeBrTvkelG9L?usp=sharing#scrollTo=z5j0jM4CRJHj

HenryDashwood · 2021-05-27T10:18:38Z

Presumably it would be an extra argument passed to the decoder, and would be added to the list of decoder inputs?

sgiorgis-atypon · 2021-09-08T15:52:35Z

Hello @anshoomehra did you manage to solve the issues with the GPU shifting?

sidsharma72 · 2022-02-09T12:58:57Z

@HenryDashwood @Ki6an when replicating your colab notebook

I am facing the following error:

This seems like an issue similar to the following one: #18

Following are the library versions I am using:

torch - 1.10.0
transformers - 4.11.3
onnx - 1.10.2
onnxruntime - 1.7.0

TypeError: forward() got an unexpected keyword argument 'cross_attn_head_mask'

Ki6an · 2022-02-11T09:21:26Z

@sidsharma72 we had the same issue with T5, was able to resolve it in this commit

make sure to add the cross_attn_head_mask=None, argument in the forward method of OnnxMBart in the notebook, and try again.

sidsharma72 · 2022-02-11T13:07:29Z

Thanks for the response - I have been able to fix the issue.

mohanvamsibu-kore · 2022-02-15T12:41:39Z

Hey, @HenryDashwood @tobigue I was able to use the collab for the model - lidiya/bart-large-xsum-samsum. ONNX was considerably better on CPU i.e, 4 secs whereas Pytorch was 11 secs but, over GPU ONNX was taking ~5 secs whereas Pytorch was ~500 ms
I have changed all the tensors to GPU & converted all the models to GPU. Can you please let me know if this is expected?

tobigue · 2022-02-15T13:23:46Z

Hey @mohanvamsibu-kore, I don't have experience with ONNX on GPU, but from what I understand the most efficient way to run stuff on GPU is by using TensorRT. They have an example for T5, which ich pretty similar to BART, here: https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace

will-wiki · 2022-02-16T13:35:05Z

@tobigue @HenryDashwood @Ki6an Thank you very much for your work, very helpful! Now I can convert bart model to onnx model, and the output of the two is consistent, but I would like to ask whether you have tried to deploy onnx model to Tensorrt, so far I have been able to run on Tensorrt, but the result of tensorRT is not consistent with onnx model :(

HaukurPall mentioned this issue Mar 28, 2022

How to go about utilizing MBART for conditional generation with beam search in ONNXRuntime with TensorRT/CUDA huggingface/transformers#15871

Closed

Ki6an pinned this issue May 20, 2022

Different behaviour when extending this project to Bart #7

Different behaviour when extending this project to Bart #7

Comments

HenryDashwood commented Apr 6, 2021

Ki6an commented Apr 6, 2021

tobigue commented Apr 6, 2021 • edited

HenryDashwood commented Apr 6, 2021

tobigue commented Apr 7, 2021 • edited

tobigue commented Apr 8, 2021

tobigue commented Apr 9, 2021

Ki6an commented Apr 9, 2021

Ki6an commented Apr 9, 2021

tobigue commented Apr 9, 2021

Ki6an commented Apr 9, 2021

Ki6an commented Apr 9, 2021 • edited

tobigue commented Apr 9, 2021 • edited

HenryDashwood commented Apr 12, 2021

tobigue commented Apr 13, 2021

HenryDashwood commented Apr 14, 2021

Worien commented Apr 22, 2021

HenryDashwood commented Apr 22, 2021

Worien commented Apr 26, 2021

HenryDashwood commented Apr 27, 2021

anshoomehra commented May 12, 2021 • edited

anshoomehra commented May 12, 2021

anshoomehra commented May 12, 2021 • edited

HenryDashwood commented May 14, 2021

anshoomehra commented May 17, 2021

anshoomehra commented May 19, 2021 • edited

HenryDashwood commented May 25, 2021

ayushch3 commented May 26, 2021

HenryDashwood commented May 27, 2021

sgiorgis-atypon commented Sep 8, 2021

sidsharma72 commented Feb 9, 2022 • edited

Ki6an commented Feb 11, 2022

sidsharma72 commented Feb 11, 2022

mohanvamsibu-kore commented Feb 15, 2022 • edited

tobigue commented Feb 15, 2022

will-wiki commented Feb 16, 2022 • edited

tobigue commented Apr 6, 2021 •

edited

tobigue commented Apr 7, 2021 •

edited

Ki6an commented Apr 9, 2021 •

edited

tobigue commented Apr 9, 2021 •

edited

anshoomehra commented May 12, 2021 •

edited

anshoomehra commented May 12, 2021 •

edited

anshoomehra commented May 19, 2021 •

edited

sidsharma72 commented Feb 9, 2022 •

edited

mohanvamsibu-kore commented Feb 15, 2022 •

edited

will-wiki commented Feb 16, 2022 •

edited