New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
enabling context_fmha_type in enc_dec example #444
Comments
Hi @robosina , I think you're referrring to "fused mha during context phase". Right, it's deliberately not enabled for enc_dec model for the following reason: encoder-decoder models (especially T5 and Flan-T5. BART doesn't have it) has relative attention bias, and more so we have implemented implicit relative attention bias in decoder attention during generation phase. Such feature is not compatible with contextFMHA at this point --> that's why we're doing unfused MHA path for encoder's attention & for decoder's attention in context phase. For decoder's attention in generation phase we're doing the fused Masked MHA though. My suggestion and questions would be:
|
@symphonylyh, thank you for responding,
Currently, I am developing a custom model to explore if trt-llm can enhance its speed. And, this model does not feature any relative attention bias.
Indeed, it belongs to the BART family. I encountered the errors I previously mentioned upon enabling it.
Yes, the encoder input length is significantly big. It's same as to receiving input from an image(input length is 5000) I am researching this issue and will keep you informed of any noteworthy developments here. Also, at the moment, I am not witnessing any performance improvements over the PyTorch implementation. If you have any insights or advice, I would greatly appreciate your input. My foundational architecture and execution originated from the enc_dec example. I've dedicated some time to redevelop my model in alignment with it. The output tokens so far are identical with the PyTorch version but lacking any advancement in speed. Additionally, I should note that my maximum token size is 160. Additionally, since the input length of my encoder is quite large, this would likely affect the performance of the cross attention mechanism, correct? As a result, I should expect to observe reduced performance in both the encoder and the decoder. |
@robosina , Regarding perf, I have two comments:
|
@symphonylyh , Before proceeding and testing this issue, allow me to review the dev branch you referenced. In one of my previous issues, I saw a problem with the main branch, which is why I haven't tested it yet. However, I appreciate your suggestion and will check out that branch for testing. Thank you for your comprehensive response. I will examine this further and provide you with an update soon. |
@robosina sounds good! My recommendation would be, don't stay away from the dev |
Let me close that issue. Feel free to reopen if needed. |
I noticed that the context phase has not been implemented in the encoder-decoder (enc_dec) example. To enable the context phase in this example, I used the following line of code:
Following the modification, and after the successful building of the engines, I encountered several errors during runtime. Could you please clarify if the context phase is actually not intended to be added to the encoder-decoder (enc_dec) example, or am I facing a bug? or I need to do something else here?
The text was updated successfully, but these errors were encountered: