Skip to content

[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM-ATOM#399

Open
kliuae-amd wants to merge 28 commits intomainfrom
plugin_sparse_mla
Open

[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM-ATOM#399
kliuae-amd wants to merge 28 commits intomainfrom
plugin_sparse_mla

Conversation

@kliuae-amd
Copy link
Copy Markdown
Contributor

Motivation

Following #126, this PR enables sparse MLA in ATOM's vLLM plugin mode, adding support for GLM-5 models that uses index-based topk sparse attention.

Technical Details

  • Add Indexer and Sparse MLA backends for vLLM OOT plugin
  • Register GLM-5 as supported models

Test Plan

Accuracy test with lm_eval

Server command:

ATOM_DISABLE_VLLM_PLUGIN=0 \
ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=0 \
VLLM_LOGGING_LEVEL=DEBUG \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
vllm serve /path/to/GLM-5-FP8/ \
  -tp 8 \
  --max-num-seqs 1024 \
  --gpu-memory-utilization 0.9 \
  --no-enable-prefix-caching \
  --disable-uvicorn-access-log \
  --trust-remote-code \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --kv-cache-dtype {auto,fp8} \
  --block-size 1

Test Result

lm_eval command

lm_eval --model local-completions   --model_args model=/path/to/GLM-5-FP8/,base_url=http://localhost:8000/v1/completions   --batch_size 100  --tasks gsm8k --num_fewshot 20

Model: zai-org/GLM-5-FP8

ATOM

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|_  |0.9538|_  |0.0058|
|     |       |strict-match    |     5|exact_match|_  |0.9515|_  |0.0059|

vLLM Plugin (bf16 kv cache)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|_  |0.953|_  |0.0058|
|     |       |strict-match    |    20|exact_match|_  |0.953|_  |0.0058|

vLLM Plugin (fp8 kv cache)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|_  |0.9439|_  |0.0063|
|     |       |strict-match    |    20|exact_match|_  |0.9439|_  |0.0063|

Performance on MI300X, TP8

ISL/OSL Concurrency KV Cache vLLM Plugin Req/s ATOM Req/s vLLM Plugin over ATOM (Req/s) vLLM Plugin Total tok/s ATOM Total tok/s vLLM Plugin over ATOM (tok/s)
1k/1k 128 bf16 2.06 2.02 +1.98% 4224.31 4137.09 +2.11%
1k/1k 64 bf16 1.40 1.36 +2.94% 2874.85 2784.97 +3.23%
1k/1k 128 fp8 2.14 2.23 -4.04% 4383.43 4568.58 -4.05%
1k/1k 64 fp8 1.44 1.43 +0.70% 2938.97 2935.14 +0.13%

Submission Checklist

kliuae added 8 commits March 20, 2026 07:01
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@wuhuikx
Copy link
Copy Markdown
Contributor

wuhuikx commented Mar 25, 2026

@kliuae could you please help fix the CI issue?

@wuhuikx wuhuikx requested a review from ganyi1996ppo March 25, 2026 05:48
# The kernel operates on non-padded inputs. Hence, pre-compiling
# triton kernel to avoid runtime compilation for unseen batch sizes
# Pre-compile for batch sizes 1 to 1024 to cover most use-cases.
# On DS-R1, this step adds roughly 50s to the model loading time.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea here. how about the other triton kernel?

cc @valarLip @ZhangLirong-amd @ganyi1996ppo @zejunchen-zejun Can you help comment on this feature?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vLLM mainline precompiles the BMM kernel with different M before executing the model, I think we can leverage the Kuanfu's code here

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't like this part and i don't think we need these

kliuae added 2 commits March 25, 2026 09:05
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@zejunchen-zejun
Copy link
Copy Markdown
Contributor

Hi, @kliuae-amd
Wonderful work, could we have a recipe for GLM5 OOT?

@wuhuikx wuhuikx requested a review from valarLip March 26, 2026 08:39
@wuhuikx
Copy link
Copy Markdown
Contributor

wuhuikx commented Mar 26, 2026

Lots of changes on this PR. It's the first PR to support Sparse MLA in OOT backend. So we really need your help to review the code to keep the ATOM architecture elegant. It's worth a reviewing on this PR. @valarLip @ZhangLirong-amd @ganyi1996ppo

total=max_batch_size,
)

for m in pre_compilation_list:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1024 iteration calls torch.empty() twice here, it there any better method here? Can we use a buffer to slice?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slicing from a buffer would result in different compiled caches from what non-sliced inputs expect at runtime, since for this kernel, tensor strides are passed to the kernel as int scalars, and triton can try to specialize them. Though this is performed at initialization and it does not contribute to runtime load.

kliuae and others added 7 commits March 26, 2026 17:26
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@wuhuikx
Copy link
Copy Markdown
Contributor

wuhuikx commented Apr 3, 2026

@valarLip @ZhangLirong-amd @zejunchen-zejun @ganyi1996ppo @whx-sjtu Do you have any chance to take a look at this PR again? Any further change need?

kliuae added 3 commits April 6, 2026 15:02
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>


@AiterBackendDecoratorForPluginMode
class AiterMLASparseBackend(AttentionBackend):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move these stuff to atom/plugin/attention_mla.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, moved these modules to attention backends for vllm plugin.

# The kernel operates on non-padded inputs. Hence, pre-compiling
# triton kernel to avoid runtime compilation for unseen batch sizes
# Pre-compile for batch sizes 1 to 1024 to cover most use-cases.
# On DS-R1, this step adds roughly 50s to the model loading time.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't like this part and i don't think we need these

kliuae added 4 commits April 8, 2026 07:07
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@kliuae-amd kliuae-amd requested a review from valarLip April 8, 2026 08:20
--async-scheduling \
--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
--no-enable-prefix-caching \
--block-size 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ganyi1996ppo could you please help comment, do we need block-size 1 here?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oot seems only support block-size 1?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes currently only block_size 1 is supported for sparse mla oot

wuhuikx and others added 2 commits April 9, 2026 13:10
Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>
@wuhuikx wuhuikx changed the title [Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM OOT Plugin [Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM-ATOM Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants