enabling context_fmha_type in enc_dec example #444

robosina · 2023-11-20T23:30:32Z

I noticed that the context phase has not been implemented in the encoder-decoder (enc_dec) example. To enable the context phase in this example, I used the following line of code:

network.plugin_config.set_context_fmha(ContextFMHAType.enabled)

Following the modification, and after the successful building of the engines, I encountered several errors during runtime. Could you please clarify if the context phase is actually not intended to be added to the encoder-decoder (enc_dec) example, or am I facing a bug? or I need to do something else here?

terminate called after throwing an instance of 'tensorrt_llm::common::TllmException'
  what():  [TensorRT-LLM][ERROR] CUDA runtime error in cublasGemmStridedBatchedEx(getCublasHandle(), transa, transb, m, n, k, alpha, A, AType, lda, strideA, B, BType, ldb, strideB, beta, C, CType, ldc, strideC, batchCount, computeType, mAType == CUDA_R_32F ? CUBLAS_GEMM_DEFAULT : CUBLAS_GEMM_DEFAULT_TENSOR_OP): CUBLAS_STATUS_INVALID_VALUE (/src/tensorrt_llm/cpp/tensorrt_llm/common/cublasMMWrapper.cpp:189)
1       0x7fc8b30e7cde /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x8ecde) [0x7fc8b30e7cde]
2       0x7fc8b3128a0f /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xcfa0f) [0x7fc8b3128a0f]
3       0x7fc8b30f28b3 /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x998b3) [0x7fc8b30f28b3]
4       0x7fc8b31002ea /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xa72ea) [0x7fc8b31002ea]
5       0x7fc8b3101ebb /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xa8ebb) [0x7fc8b3101ebb]
6       0x7fc8b30fcb9d tensorrt_llm::plugins::GPTAttentionPlugin::enqueue(nvinfer1::PluginTensorDesc const*, nvinfer1::PluginTensorDesc const*, void const* const*, void* const*, void*, CUstream_st*) + 189
7       0x7fca01b39fc9 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10aefc9) [0x7fca01b39fc9]
8       0x7fca01afce04 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1071e04) [0x7fca01afce04]
9       0x7fca01afe9a0 /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10739a0) [0x7fca01afe9a0]
10      0x7fca1049dc60 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x9dc60) [0x7fca1049dc60]
11      0x7fca10442ea3 /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x42ea3) [0x7fca10442ea3]
12      0x565423d04e0e python3(+0x15fe0e) [0x565423d04e0e]
13      0x565423cfb5eb _PyObject_MakeTpCall + 603
14      0x565423d137bb python3(+0x16e7bb) [0x565423d137bb]
15      0x565423cf38a2 _PyEval_EvalFrameDefault + 24914
16      0x565423d0570c _PyFunction_Vectorcall + 124
17      0x565423cedf52 _PyEval_EvalFrameDefault + 2050
18      0x565423d0570c _PyFunction_Vectorcall + 124
19      0x565423cedf52 _PyEval_EvalFrameDefault + 2050
20      0x565423d0570c _PyFunction_Vectorcall + 124
21      0x565423cedf52 _PyEval_EvalFrameDefault + 2050
22      0x565423d0570c _PyFunction_Vectorcall + 124
23      0x565423d14192 PyObject_Call + 290
24      0x565423cf02c1 _PyEval_EvalFrameDefault + 11121
25      0x565423d134e1 python3(+0x16e4e1) [0x565423d134e1]
26      0x565423cef0d1 _PyEval_EvalFrameDefault + 6529
27      0x565423d134e1 python3(+0x16e4e1) [0x565423d134e1]
28      0x565423cef0d1 _PyEval_EvalFrameDefault + 6529
29      0x565423ddee56 python3(+0x239e56) [0x565423ddee56]
30      0x565423ddecf6 PyEval_EvalCode + 134
31      0x565423e097d8 python3(+0x2647d8) [0x565423e097d8]
32      0x565423e030bb python3(+0x25e0bb) [0x565423e030bb]
33      0x565423e09525 python3(+0x264525) [0x565423e09525]
34      0x565423e08a08 _PyRun_SimpleFileObject + 424
35      0x565423e08653 _PyRun_AnyFileObject + 67
36      0x565423dfb41e Py_RunMain + 702
37      0x565423dd1cad Py_BytesMain + 45
38      0x7fca10efcd90 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fca10efcd90]
39      0x7fca10efce40 __libc_start_main + 128
40      0x565423dd1ba5 _start + 37
[367ca7c49632:221381] *** Process received signal ***
[367ca7c49632:221381] Signal: Aborted (6)
[367ca7c49632:221381] Signal code:  (-6)
[367ca7c49632:221381] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x42520)[0x7fca10f15520]
[367ca7c49632:221381] [ 1] /usr/lib/x86_64-linux-gnu/libc.so.6(pthread_kill+0x12c)[0x7fca10f69a7c]
[367ca7c49632:221381] [ 2] /usr/lib/x86_64-linux-gnu/libc.so.6(raise+0x16)[0x7fca10f15476]
[367ca7c49632:221381] [ 3] /usr/lib/x86_64-linux-gnu/libc.so.6(abort+0xd3)[0x7fca10efb7f3]
[367ca7c49632:221381] [ 4] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa2b9e)[0x7fc9fddecb9e]
[367ca7c49632:221381] [ 5] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7fc9fddf820c]
[367ca7c49632:221381] [ 6] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xad1e9)[0x7fc9fddf71e9]
[367ca7c49632:221381] [ 7] /usr/lib/x86_64-linux-gnu/libstdc++.so.6(__gxx_personality_v0+0x99)[0x7fc9fddf7959]
[367ca7c49632:221381] [ 8] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(+0x16884)[0x7fca109a4884]
[367ca7c49632:221381] [ 9] /usr/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Resume+0x12d)[0x7fca109a52dd]
[367ca7c49632:221381] [10] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x3b184)[0x7fc8b3094184]
[367ca7c49632:221381] [11] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0x998b3)[0x7fc8b30f28b3]
[367ca7c49632:221381] [12] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xa72ea)[0x7fc8b31002ea]
[367ca7c49632:221381] [13] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(+0xa8ebb)[0x7fc8b3101ebb]
[367ca7c49632:221381] [14] /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/libnvinfer_plugin_tensorrt_llm.so(_ZN12tensorrt_llm7plugins18GPTAttentionPlugin7enqueueEPKN8nvinfer116PluginTensorDescES5_PKPKvPKPvSA_P11CUstream_st+0xbd)[0x7fc8b30fcb9d]
[367ca7c49632:221381] [15] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10aefc9)[0x7fca01b39fc9]
[367ca7c49632:221381] [16] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x1071e04)[0x7fca01afce04]
[367ca7c49632:221381] [17] /usr/local/tensorrt/lib/libnvinfer.so.9(+0x10739a0)[0x7fca01afe9a0]
[367ca7c49632:221381] [18] /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x9dc60)[0x7fca1049dc60]
[367ca7c49632:221381] [19] /usr/local/lib/python3.10/dist-packages/tensorrt/tensorrt.so(+0x42ea3)[0x7fca10442ea3]
[367ca7c49632:221381] [20] python3(+0x15fe0e)[0x565423d04e0e]
[367ca7c49632:221381] [21] python3(_PyObject_MakeTpCall+0x25b)[0x565423cfb5eb]
[367ca7c49632:221381] [22] python3(+0x16e7bb)[0x565423d137bb]
[367ca7c49632:221381] [23] python3(_PyEval_EvalFrameDefault+0x6152)[0x565423cf38a2]
[367ca7c49632:221381] [24] python3(_PyFunction_Vectorcall+0x7c)[0x565423d0570c]
[367ca7c49632:221381] [25] python3(_PyEval_EvalFrameDefault+0x802)[0x565423cedf52]
[367ca7c49632:221381] [26] python3(_PyFunction_Vectorcall+0x7c)[0x565423d0570c]
[367ca7c49632:221381] [27] python3(_PyEval_EvalFrameDefault+0x802)[0x565423cedf52]
[367ca7c49632:221381] [28] python3(_PyFunction_Vectorcall+0x7c)[0x565423d0570c]
[367ca7c49632:221381] [29] python3(_PyEval_EvalFrameDefault+0x802)[0x565423cedf52]
[367ca7c49632:221381] *** End of error message ***

The text was updated successfully, but these errors were encountered:

symphonylyh · 2023-11-21T01:48:03Z

Hi @robosina , I think you're referrring to "fused mha during context phase". Right, it's deliberately not enabled for enc_dec model for the following reason: encoder-decoder models (especially T5 and Flan-T5. BART doesn't have it) has relative attention bias, and more so we have implemented implicit relative attention bias in decoder attention during generation phase. Such feature is not compatible with contextFMHA at this point --> that's why we're doing unfused MHA path for encoder's attention & for decoder's attention in context phase. For decoder's attention in generation phase we're doing the fused Masked MHA though.

My suggestion and questions would be:

Which model are you working on? If it's BART family, in principle this can be enabled by the contextFMHA flag, and we would investigate if it doesn't. If it's T5 family, we need to disable contextFMHA for now.
The motivation behind your experiment is because you have very long encoder input length? If not, say your input length is short <1024 & output length is long >1024, probably the performance impact would be less significant because encoder just got run once?

robosina · 2023-11-21T08:07:23Z

@symphonylyh, thank you for responding,

encoder-decoder models (especially T5 and Flan-T5. BART doesn't have it) has relative attention bias

Currently, I am developing a custom model to explore if trt-llm can enhance its speed. And, this model does not feature any relative attention bias.

If it's BART family

Indeed, it belongs to the BART family. I encountered the errors I previously mentioned upon enabling it.

The motivation behind your experiment is because you have very long encoder input length?

Yes, the encoder input length is significantly big. It's same as to receiving input from an image(input length is 5000)

I am researching this issue and will keep you informed of any noteworthy developments here.

Also, at the moment, I am not witnessing any performance improvements over the PyTorch implementation. If you have any insights or advice, I would greatly appreciate your input. My foundational architecture and execution originated from the enc_dec example. I've dedicated some time to redevelop my model in alignment with it. The output tokens so far are identical with the PyTorch version but lacking any advancement in speed. Additionally, I should note that my maximum token size is 160.

Additionally, since the input length of my encoder is quite large, this would likely affect the performance of the cross attention mechanism, correct? As a result, I should expect to observe reduced performance in both the encoder and the decoder.

symphonylyh · 2023-11-22T01:27:36Z

@robosina ,
For BART family, we're enabling it and plan to add it to the enc_dec example soon. We'll investigate the contextFMHA path you point out and try our best to enable it for BART when released.

Regarding perf, I have two comments:

between release/0.5.0 and the latest main branch, we have added some perf fix with this announcement (in case you don't know, main branch is considered as dev branch which we update more frequently and release/x.x.x is about monthly) -- which version are your implementation based on? Can you ensure to you sync all the changes from the latestmain, especially on the tensorrt_llm/runtime/generation.py?
You're right, when input length is long, the contextFMHA vs. unfused perf gap will become more significant. But that gap will mainly be for the encoder part. For decoder's cross attention, we're already doing fusion within the Masked MHA. There is also a context phase in decoder's self attention too which has the contextFMHA vs. unfused diff. However, for most enc-dec model, there is no such decoder context phase because decoder starts the generation from a single start id instead of a sequence, unless the user specifically uses decoder_forced_input_ids (which is like giving decoder some prompt).

robosina · 2023-11-22T18:14:16Z

@symphonylyh , Before proceeding and testing this issue, allow me to review the dev branch you referenced. In one of my previous issues, I saw a problem with the main branch, which is why I haven't tested it yet. However, I appreciate your suggestion and will check out that branch for testing. Thank you for your comprehensive response. I will examine this further and provide you with an update soon.

symphonylyh · 2023-11-22T20:09:36Z

@robosina sounds good! My recommendation would be, don't stay away from the dev main branch just because a previous blocking issue. The release branch is updated ~monthly but the dev branch is updated ~weekly, and often times we will fix bugs and release new features on the dev branch (e.g. enc-dec has been improved and extended a lot comparing dev vs. release/0.5.0), then they're collated to monthly release later.

jdemouth-nvidia · 2023-11-23T10:50:22Z

Let me close that issue. Feel free to reopen if needed.

byshiue assigned symphonylyh Nov 21, 2023

byshiue added the triaged Issue has been triaged by maintainers label Nov 21, 2023

jdemouth-nvidia closed this as completed Nov 23, 2023

symphonylyh mentioned this issue Nov 24, 2023

Will the setting "context_fmha" support "cross_attention" ? #461

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enabling context_fmha_type in enc_dec example #444

enabling context_fmha_type in enc_dec example #444

robosina commented Nov 20, 2023

symphonylyh commented Nov 21, 2023

robosina commented Nov 21, 2023 •

edited

symphonylyh commented Nov 22, 2023

robosina commented Nov 22, 2023

symphonylyh commented Nov 22, 2023 •

edited

jdemouth-nvidia commented Nov 23, 2023

enabling context_fmha_type in enc_dec example #444

enabling context_fmha_type in enc_dec example #444

Comments

robosina commented Nov 20, 2023

symphonylyh commented Nov 21, 2023

robosina commented Nov 21, 2023 • edited

symphonylyh commented Nov 22, 2023

robosina commented Nov 22, 2023

symphonylyh commented Nov 22, 2023 • edited

jdemouth-nvidia commented Nov 23, 2023

robosina commented Nov 21, 2023 •

edited

symphonylyh commented Nov 22, 2023 •

edited