Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different behaviour when extending this project to Bart #7

Open
HenryDashwood opened this issue Apr 6, 2021 · 35 comments
Open

Different behaviour when extending this project to Bart #7

HenryDashwood opened this issue Apr 6, 2021 · 35 comments

Comments

@HenryDashwood
Copy link

Hello there. This is a really fantastic project. I'm trying to extend your work to Bart but I've run into some strange behaviour.

I've made a Colab notebook to illustrate the problem. Specifically when converting Bart to ONNX, the encoder_hidden_states input does not get included in the ONNX model's graph. As you can see from the notebook though, it works perfectly for T5.

I realise this is out of scope for the fastT5 project but thought someone who comes across this issue might have experienced a similar problem and be able to help. This may also be useful to know in case you have plans to expand this project to include models like Bart in the future.

@Ki6an
Copy link
Owner

Ki6an commented Apr 6, 2021

thank you!

@tobigue was able to export mbart to onnx, he might be able to help.

@tobigue
Copy link

tobigue commented Apr 6, 2021

Hey, I also came across this when trying to adapt the fastT5 code for converting (m)BART to onnx and I think it is due to the fact that the encoder_hidden_states are passed as key_value_states into the MBartAttention down the line and are not used in case past_key_value is given.

So my guess was, that it is not included in the exported graph for that reason.
I'll try to share my progress in a Colab notebook tomorrow.

@HenryDashwood
Copy link
Author

Ah that's really good to know. Thought I was going mad for a bit! Please do keep us updated!

@tobigue
Copy link

tobigue commented Apr 7, 2021

@HenryDashwood this is my current progress when converting mbart to onnx:
https://colab.research.google.com/drive/1b4gCQJbfdr2nEKyb5xvbjeBrTvkelG9L?usp=sharing

Currently the ONNX model gives a (slightly) different output for the same input than the pytorch model and I'm not sure why.
You can see that at the very end of the notebook.

Also, as I had to use the external data format (the mBART model export failed when I did not, because apparently the export is >2GB for some reason, even though T5-large fits into <2GB with fastT5), quantizing the model is not straight forward, as ONNXQuantizer uses infer_shapes, but the external data format is only supported by infer_shapes_path. Also, there is a problem when loading models that are not saved in ./, because f.name is used instead of f.
I changed those places in the onnx/onnxruntime code and was able to run the quantization, however, when trying to load the quantized models I get this:

[ONNXRuntimeError] : 1 : FAIL : Load model from models/mbart-large-en-ro/decoder-quantized/model.onnx failed:This is an invalid model. Error: Duplicate definition of name (decoder.embed_tokens.weight).

If you have any insights on the differing outputs, or how to make the quantization work, or manage to export and quantize a BART model in another way, sharing is much appreciated. :)

Cheers!

@tobigue
Copy link

tobigue commented Apr 8, 2021

The reason the init decoder was so large was that it exported the token embeddings twice. With do_constant_folding=False it only exports them once and then the model can be exported without using the external data format. I also looked at the mBART implementation again and saw that they do not do the rescaling of the output (no * (self.model_dim ** -0.5)), but do add a final_logits_bias. I adjusted the notebook accordingly and will try quantization tomorrow.

@tobigue
Copy link

tobigue commented Apr 9, 2021

I got the quantization for the exported model to work, but only using onnxruntime==1.6.0. See this issue.

@Ki6an
Copy link
Owner

Ki6an commented Apr 9, 2021

cool!
I also had some issues with 1.7.0, while using the onnxruntime==1.7.0 for quantizing it created extra models here's the issue . by applying optimize_model=False was able to fix it.

@Ki6an
Copy link
Owner

Ki6an commented Apr 9, 2021

constant folding replaces some of the operations that have all constant inputs, not clear why it creates embedding twice in bart. in t5 i did not face any issue with using do_constant_folding=True

@tobigue
Copy link

tobigue commented Apr 9, 2021

It only happens for the init_decoder and I saw that in fastT5 you do not do constant folding for the init decoder (only for encoder and decoder). https://github.com/Ki6an/fastT5/blob/master/fastT5/onnx_exporter.py#L196

If you also export the init decoder with constant folding and use the external data format (or inspect the model with netron), we might see the same behaviour.

@Ki6an
Copy link
Owner

Ki6an commented Apr 9, 2021

also I noticed that in the notebook

        input_names = [x.name for x in self.decoder.get_inputs()]
        inputs = [
            input_ids.cpu().numpy(),
            attention_mask.cpu().numpy(),
        ] + [
            tensor.cpu().numpy() for tensor in flat_past_key_values
        ]

        decoder_inputs = dict(zip(input_names, inputs))
        decoder_outputs = self.decoder.run(None, decoder_inputs)

in the above code you are using two for loops & zip it might affect the model speed during the inference.
fastT5 code for comparison :

    past_key_values = {
        f"pkv_{i}": pkv.cpu().numpy() for i, pkv in enumerate(flat_past_key_values)
    }

    decoder_outputs = self.decoder.run(None, {**decoder_inputs, **past_key_values})

I've not tested how much effect it might have.

@Ki6an
Copy link
Owner

Ki6an commented Apr 9, 2021

It only happens for the init_decoder and I saw that in fastT5 you do not do constant folding for the init decoder (only for encoder and decoder). https://github.com/Ki6an/fastT5/blob/master/fastT5/onnx_exporter.py#L196

you're right I did not notice it before but after testing got these results.

# t5-large => 
# init_decoder = 1.75G with do_constant_folding=True
# init_decoder = 1.62G with do_constant_folding=False

just a small difference. and for smaller models it's negligible :)

@tobigue
Copy link

tobigue commented Apr 9, 2021

just a small difference. and for smaller models it's negligible :)

yeah, the English models have a quite small vocabulary compared to the multilingual models like mT5 and mBART.

You are right about the loops + zip, I would suspect that this should be quite fast though, because we are only iterating over ~50 items. When everything works as expected maybe just naming the inputs from 1 to X is the easier and faster way. 👍

@HenryDashwood
Copy link
Author

Nice work! Interestingly I have been able to quantize Bart with onnxruntime 1.7.0 on my Mac without setting optimize_model to False

@tobigue
Copy link

tobigue commented Apr 13, 2021

Interesting that you don't get the Duplicate definition of name (decoder.embed_tokens.weight) error with 1.7.0. Did you do something different when building the simplified models?

I think optimization works in your case, because the vocabulary of BART is much smaller than mBART, so duplicating the embeddings will probably not result in a model >2GB.

@HenryDashwood
Copy link
Author

yeah i suspect that's right. I haven't had to use external_data_format or anything like that. How are you doing in terms of speed? I'm not really seeing any speedups in my code which is probably a bug i've introduced because when i run your notebook, i do. Here's mine: https://colab.research.google.com/drive/1e3b9_a6UvNjJWemqubYA7YKxC7vYrvAX?usp=sharing

@Worien
Copy link

Worien commented Apr 22, 2021

@HenryDashwood Hi, I'm trying to export bart-large-cnn model for summarization to onnx and tried to use your colab, but stuck with error on the export step:
RuntimeError: Unable to cast from non-held to held instance (T& to Holder<T>) (compile in debug mode for type information)

Did you face it and how did you fix it?

@HenryDashwood
Copy link
Author

Very odd error that I also get sometimes and my only answer at this point is... have you tried running the cell again? That usually sorts it

@Worien
Copy link

Worien commented Apr 26, 2021

@HenryDashwood thanks, I was able to run it, but I have a question. Is there a difference when you run the model through generate func or with

ort_session = onnxruntime.InferenceSession("super_resolution.onnx")
ort_outs = ort_session.run(None, ort_inputs)

?

@HenryDashwood
Copy link
Author

If I understand your answer correctly, we do both. ort_session.run() gets called inside generate to get predictions from the model. But we need more than just a single set of predictions do do something like beam search. So generate() takes care of that as well

@anshoomehra
Copy link

anshoomehra commented May 12, 2021

@HenryDashwood great work, I have been struggling on BART Summarization to ONNX & this gives me a ray of hope... [ My earlier tries to onnx never gave final inference summaries similar to non-onnx versions. I am pretty sure, I made some errors refactoring... ]

I downloaded the collab & running it locally as-is without any modifications, it's throwing an error stated below, curious if any particular versions of libraries I should be on?

RuntimeError: THPVariable_Check(tuple_elem) INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1614378098133/work/torch/csrc/jit/passes/onnx/shape_type_inference.cpp":676, please report a bug to PyTorch.

image

On a side note, can the similar be possibly achieved by JIT Libraries export? I tried that as well & after making model export & Triton Server work, failing at Inference ... triton-inference-server/server#2846

@anshoomehra
Copy link

Resolved " RuntimeError: THPVariable_Check(tuple_elem) INTERNAL ASSERT FAILED " by upgrading to torch 1.8.1

@anshoomehra
Copy link

anshoomehra commented May 12, 2021

@HenryDashwood et al am I reading it right:

  1. Outcome of your analysis is that ONNX and Quantized are slower than PyTorch ??

  2. If so, what is the best way out to get better performance? I am clocking 900ms to 1.4 sec on T4 per call. I am working on a use case that requires this capability to be enabled in real-time, which is where this study to optimize with the goal to make it 500ms or less. Any suggestions are much appreciated.

  3. I know very little about ONNX and JIT exports but my goal was to export these models, host them on Triton Server with T4s & hopefully get the boost as end goal. Not sure if I am on right path ?

@HenryDashwood
Copy link
Author

HI @anshoomehra sorry it took a while to get back. I haven't done much ONNX work in the last couple of weeks so I'm not really sure why I didn't get a speedup. I'll get back to you if I work it out though!

@anshoomehra
Copy link

@HenryDashwood appreciate the acknowledgment! Yes, please do continue (if time permits), yours is the only implementation that I have seen working without losing the final output quality & if ONNX brings down the inference time will be significant value addition.

I am dedicating the next few days on it & if I have any breakthroughs shall share back ... Thanks much!!

@anshoomehra
Copy link

anshoomehra commented May 19, 2021

@HenryDashwood I have been able to make some progress with ONNX, however, shifting to GPUs seems challenging.

I have made ONNX runtime working to engage GPUs using provider=['CUDAExecutionProvider'] (This step itself was issue because of conflicting libraries, we had to install onnxruntime-gpu and uninstall onnx, onnxruntime to make this work)

Now, it seems the way graph is exported some variables are still on CPU causing below error, I tried making model and tokenizer to device cuda along with exported version using CUDAExecutionProvider but it still fails, would you have any immediate thought on what code fix we may need to resolve this??

image

image

image

@HenryDashwood
Copy link
Author

Did you try explicitly setting the device of token["input_ids"] and token["attention_mask"]? That's the only thing I can think of.

You might be interested in following this work as it makes progress huggingface/transformers#11786

@ayushch3
Copy link

@HenryDashwood @tobigue I am looking at your implementation of mbart on onnx and wanted to see if you had any feedback on how to make this work for mBart-50 many to many: https://huggingface.co/facebook/mbart-large-50-many-to-many-mmt

Specifically, in the generate method, it passes forced_bos_token_id as first generated token to translate into target language. Could you provide any pointers where this parameter can be passed in your implementation: https://colab.research.google.com/drive/1b4gCQJbfdr2nEKyb5xvbjeBrTvkelG9L?usp=sharing#scrollTo=z5j0jM4CRJHj

@HenryDashwood
Copy link
Author

Presumably it would be an extra argument passed to the decoder, and would be added to the list of decoder inputs?

@sgiorgis-atypon
Copy link

Hello @anshoomehra did you manage to solve the issues with the GPU shifting?

@sidsharma72
Copy link

sidsharma72 commented Feb 9, 2022

@HenryDashwood @Ki6an when replicating your colab notebook

I am facing the following error:

image

This seems like an issue similar to the following one: #18

Following are the library versions I am using:

torch - 1.10.0
transformers - 4.11.3
onnx - 1.10.2
onnxruntime - 1.7.0

TypeError: forward() got an unexpected keyword argument 'cross_attn_head_mask'

@Ki6an
Copy link
Owner

Ki6an commented Feb 11, 2022

@sidsharma72 we had the same issue with T5, was able to resolve it in this commit

make sure to add the cross_attn_head_mask=None, argument in the forward method of OnnxMBart in the notebook, and try again.

@sidsharma72
Copy link

Thanks for the response - I have been able to fix the issue.

@mohanvamsibu-kore
Copy link

mohanvamsibu-kore commented Feb 15, 2022

Hey, @HenryDashwood @tobigue I was able to use the collab for the model - lidiya/bart-large-xsum-samsum. ONNX was considerably better on CPU i.e, 4 secs whereas Pytorch was 11 secs but, over GPU ONNX was taking ~5 secs whereas Pytorch was ~500 ms
I have changed all the tensors to GPU & converted all the models to GPU. Can you please let me know if this is expected?

@tobigue
Copy link

tobigue commented Feb 15, 2022

Hey @mohanvamsibu-kore, I don't have experience with ONNX on GPU, but from what I understand the most efficient way to run stuff on GPU is by using TensorRT. They have an example for T5, which ich pretty similar to BART, here: https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace

@will-wiki
Copy link

will-wiki commented Feb 16, 2022

@tobigue @HenryDashwood @Ki6an Thank you very much for your work, very helpful! Now I can convert bart model to onnx model, and the output of the two is consistent, but I would like to ask whether you have tried to deploy onnx model to Tensorrt, so far I have been able to run on Tensorrt, but the result of tensorRT is not consistent with onnx model :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants