[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM-ATOM#399
[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM-ATOM#399kliuae-amd wants to merge 28 commits intomainfrom
Conversation
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
|
@kliuae could you please help fix the CI issue? |
atom/model_ops/attention_mla.py
Outdated
| # The kernel operates on non-padded inputs. Hence, pre-compiling | ||
| # triton kernel to avoid runtime compilation for unseen batch sizes | ||
| # Pre-compile for batch sizes 1 to 1024 to cover most use-cases. | ||
| # On DS-R1, this step adds roughly 50s to the model loading time. |
There was a problem hiding this comment.
I think it's a good idea here. how about the other triton kernel?
cc @valarLip @ZhangLirong-amd @ganyi1996ppo @zejunchen-zejun Can you help comment on this feature?
There was a problem hiding this comment.
vLLM mainline precompiles the BMM kernel with different M before executing the model, I think we can leverage the Kuanfu's code here
There was a problem hiding this comment.
i don't like this part and i don't think we need these
|
Hi, @kliuae-amd |
|
Lots of changes on this PR. It's the first PR to support Sparse MLA in OOT backend. So we really need your help to review the code to keep the ATOM architecture elegant. It's worth a reviewing on this PR. @valarLip @ZhangLirong-amd @ganyi1996ppo |
atom/model_ops/attention_mla.py
Outdated
| total=max_batch_size, | ||
| ) | ||
|
|
||
| for m in pre_compilation_list: |
There was a problem hiding this comment.
1024 iteration calls torch.empty() twice here, it there any better method here? Can we use a buffer to slice?
There was a problem hiding this comment.
Slicing from a buffer would result in different compiled caches from what non-sliced inputs expect at runtime, since for this kernel, tensor strides are passed to the kernel as int scalars, and triton can try to specialize them. Though this is performed at initialization and it does not contribute to runtime load.
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
|
@valarLip @ZhangLirong-amd @zejunchen-zejun @ganyi1996ppo @whx-sjtu Do you have any chance to take a look at this PR again? Any further change need? |
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
|
|
||
|
|
||
| @AiterBackendDecoratorForPluginMode | ||
| class AiterMLASparseBackend(AttentionBackend): |
There was a problem hiding this comment.
maybe move these stuff to atom/plugin/attention_mla.py
There was a problem hiding this comment.
Done, moved these modules to attention backends for vllm plugin.
atom/model_ops/attention_mla.py
Outdated
| # The kernel operates on non-padded inputs. Hence, pre-compiling | ||
| # triton kernel to avoid runtime compilation for unseen batch sizes | ||
| # Pre-compile for batch sizes 1 to 1024 to cover most use-cases. | ||
| # On DS-R1, this step adds roughly 50s to the model loading time. |
There was a problem hiding this comment.
i don't like this part and i don't think we need these
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
| --async-scheduling \ | ||
| --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \ | ||
| --no-enable-prefix-caching \ | ||
| --block-size 1 |
There was a problem hiding this comment.
@ganyi1996ppo could you please help comment, do we need block-size 1 here?
There was a problem hiding this comment.
oot seems only support block-size 1?
There was a problem hiding this comment.
Yes currently only block_size 1 is supported for sparse mla oot
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Motivation
Following #126, this PR enables sparse MLA in ATOM's vLLM plugin mode, adding support for GLM-5 models that uses index-based topk sparse attention.
Technical Details
Test Plan
Accuracy test with lm_eval
Server command:
Test Result
lm_eval command
Model: zai-org/GLM-5-FP8
ATOM
vLLM Plugin (bf16 kv cache)
vLLM Plugin (fp8 kv cache)
Performance on MI300X, TP8
Submission Checklist