-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Inference]add blha_get_max_len op & modify block_multihead_attention op #64246
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
>>> batch_size = paddle.ones(shape=[bsz]) | ||
>>> max_enc_len_this_time, max_dec_len_this_time = paddle.incubate.nn.functional.blha_get_max_len(seq_lens_encoder, seq_lens_decoder, batch_size) | ||
""" | ||
if in_dynamic_mode(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
应该in_dynamic_or_pir_mode
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已修改
谢谢,已收到!
|
hi, running sample code in btw: change to api no longer requires my approval since #64212 , but please feel free to assign me if you think my review helps. |
Thanks for review. PR-CI-Static-Check raised NotImplementedError in because there is no implementation of FlashAttention on it, not PR's fault. |
I think we have two options:
|
PR Category
Inference
PR Types
New features
Description
Pcard-71500
新增算子:blha_get_max_len,输入为seq_lens_encoder、seq_lens_decoder、bsz,输出为max_enc_len_this_time、max_dec_len_this_time
使用示例:
修改block_multihead_attention,新增两个可选参数max_enc_len_this_time和max_dec_len_this_time,在传入时不在kernel内部计算