[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM-ATOM by kliuae-amd · Pull Request #399 · ROCm/ATOM

kliuae-amd · 2026-03-24T11:45:27Z

Motivation

Following #126, this PR enables sparse MLA in ATOM's vLLM plugin mode, adding support for GLM-5 models that uses index-based topk sparse attention.

Technical Details

Add Indexer and Sparse MLA backends for vLLM OOT plugin
Register GLM-5 as supported models

Test Plan

Accuracy test with lm_eval

Server command:

ATOM_DISABLE_VLLM_PLUGIN=0 \
ATOM_DISABLE_VLLM_PLUGIN_ATTENTION=0 \
VLLM_LOGGING_LEVEL=DEBUG \
VLLM_USE_V1=1 \
VLLM_ROCM_USE_AITER=1 \
vllm serve /path/to/GLM-5-FP8/ \
  -tp 8 \
  --max-num-seqs 1024 \
  --gpu-memory-utilization 0.9 \
  --no-enable-prefix-caching \
  --disable-uvicorn-access-log \
  --trust-remote-code \
  --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
  --kv-cache-dtype {auto,fp8} \
  --block-size 1

Test Result

lm_eval command

lm_eval --model local-completions   --model_args model=/path/to/GLM-5-FP8/,base_url=http://localhost:8000/v1/completions   --batch_size 100  --tasks gsm8k --num_fewshot 20

Model: zai-org/GLM-5-FP8

ATOM

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|_  |0.9538|_  |0.0058|
|     |       |strict-match    |     5|exact_match|_  |0.9515|_  |0.0059|

vLLM Plugin (bf16 kv cache)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|_  |0.953|_  |0.0058|
|     |       |strict-match    |    20|exact_match|_  |0.953|_  |0.0058|

vLLM Plugin (fp8 kv cache)

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|    20|exact_match|_  |0.9439|_  |0.0063|
|     |       |strict-match    |    20|exact_match|_  |0.9439|_  |0.0063|

Performance on MI300X, TP8

ISL/OSL	Concurrency	KV Cache	vLLM Plugin Req/s	ATOM Req/s	vLLM Plugin over ATOM (Req/s)	vLLM Plugin Total tok/s	ATOM Total tok/s	vLLM Plugin over ATOM (tok/s)
1k/1k	128	bf16	2.06	2.02	+1.98%	4224.31	4137.09	+2.11%
1k/1k	64	bf16	1.40	1.36	+2.94%	2874.85	2784.97	+3.23%
1k/1k	128	fp8	2.14	2.23	-4.04%	4383.43	4568.58	-4.05%
1k/1k	64	fp8	1.44	1.43	+0.70%	2938.97	2935.14	+0.13%

Submission Checklist

Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

wuhuikx · 2026-03-25T05:25:03Z

@kliuae could you please help fix the CI issue?

atom/plugin/vllm/model_wrapper.py

atom/plugin/vllm/register.py

wuhuikx · 2026-03-25T07:54:52Z

atom/model_ops/attention_mla.py

+                # The kernel operates on non-padded inputs. Hence, pre-compiling
+                # triton kernel to avoid runtime compilation for unseen batch sizes
+                # Pre-compile for batch sizes 1 to 1024 to cover most use-cases.
+                # On DS-R1, this step adds roughly 50s to the model loading time.


I think it's a good idea here. how about the other triton kernel?

cc @valarLip @ZhangLirong-amd @ganyi1996ppo @zejunchen-zejun Can you help comment on this feature?

vLLM mainline precompiles the BMM kernel with different M before executing the model, I think we can leverage the Kuanfu's code here

i don't like this part and i don't think we need these

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

zejunchen-zejun · 2026-03-25T12:58:05Z

Hi, @kliuae-amd
Wonderful work, could we have a recipe for GLM5 OOT?

wuhuikx · 2026-03-26T08:41:22Z

Lots of changes on this PR. It's the first PR to support Sparse MLA in OOT backend. So we really need your help to review the code to keep the ATOM architecture elegant. It's worth a reviewing on this PR. @valarLip @ZhangLirong-amd @ganyi1996ppo

atom/model_ops/attention_mla.py

ZhangLirong-amd · 2026-03-26T09:08:13Z

atom/model_ops/attention_mla.py

+                        total=max_batch_size,
+                    )
+
+                for m in pre_compilation_list:


1024 iteration calls torch.empty() twice here, it there any better method here? Can we use a buffer to slice?

Slicing from a buffer would result in different compiled caches from what non-sliced inputs expect at runtime, since for this kernel, tensor strides are passed to the kernel as int scalars, and triton can try to specialize them. Though this is performed at initialization and it does not contribute to runtime load.

atom/model_ops/attention_mla.py

atom/plugin/attention_mla.py

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

wuhuikx · 2026-04-03T11:08:04Z

@valarLip @ZhangLirong-amd @zejunchen-zejun @ganyi1996ppo @whx-sjtu Do you have any chance to take a look at this PR again? Any further change need?

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

valarLip · 2026-04-07T03:46:38Z

atom/model_ops/attentions/aiter_mla.py

+
+
+@AiterBackendDecoratorForPluginMode
+class AiterMLASparseBackend(AttentionBackend):


maybe move these stuff to atom/plugin/attention_mla.py

Done, moved these modules to attention backends for vllm plugin.

valarLip · 2026-04-07T03:47:51Z

atom/model_ops/attention_mla.py

+                # The kernel operates on non-padded inputs. Hence, pre-compiling
+                # triton kernel to avoid runtime compilation for unseen batch sizes
+                # Pre-compile for batch sizes 1 to 1024 to cover most use-cases.
+                # On DS-R1, this step adds roughly 50s to the model loading time.


i don't like this part and i don't think we need these

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

wuhuikx · 2026-04-08T08:22:50Z

recipes/atom_vllm/GLM-5.md

+    --async-scheduling \
+    --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}' \
+    --no-enable-prefix-caching \
+    --block-size 1


@ganyi1996ppo could you please help comment, do we need block-size 1 here?

oot seems only support block-size 1?

Yes currently only block_size 1 is supported for sparse mla oot

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

kliuae added 8 commits March 20, 2026 07:01

add sparse mla support for vllm plugin

16799d3

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

remove redundant metadata

da94028

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

inject sparse indexer methods

7d3c44b

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

support bf16 kv cache only

8c01a47

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

sync main

935ca09

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

disable persistent mla for fp8 kvcache

4911f42

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

clean up

cce218a

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

clean up

0f02329

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

wuhuikx requested review from ZhangLirong-amd and zejunchen-zejun March 25, 2026 05:22

wuhuikx reviewed Mar 25, 2026

View reviewed changes

atom/plugin/vllm/model_wrapper.py Show resolved Hide resolved

atom/plugin/vllm/register.py Outdated Show resolved Hide resolved

wuhuikx requested a review from ganyi1996ppo March 25, 2026 05:48

wuhuikx reviewed Mar 25, 2026

View reviewed changes

kliuae added 2 commits March 25, 2026 09:05

format

343b4ef

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

format

7747dd0

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

wuhuikx requested a review from valarLip March 26, 2026 08:39

zejunchen-zejun reviewed Mar 26, 2026

View reviewed changes

atom/model_ops/attention_mla.py Outdated Show resolved Hide resolved

ZhangLirong-amd reviewed Mar 26, 2026

View reviewed changes

zejunchen-zejun reviewed Mar 26, 2026

View reviewed changes

atom/model_ops/attention_mla.py Show resolved Hide resolved

zejunchen-zejun reviewed Mar 26, 2026

View reviewed changes

atom/plugin/attention_mla.py Outdated Show resolved Hide resolved

kliuae and others added 7 commits March 26, 2026 17:26

add sparse mla marker and tqdm

9cfaf45

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Merge main

1ce5029

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

add glm5 recipe

4837ca5

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

merge main

0ebc9bb

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

merge main

a2e74a4

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

keep indexer not converted

eb7ca29

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

Merge branch 'main' into plugin_sparse_mla

995a402

Merge branch 'main' into plugin_sparse_mla

d7a3d93

kliuae added 3 commits April 6, 2026 15:02

forward compat

54540ec

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

merge main

c2ed1e4

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

format

f6fc57e

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

kliuae-amd mentioned this pull request Apr 6, 2026

[Feat][Plugin] Enable DeepSeek-V3.2 for vLLM OOT Plugin #494

Draft

1 task

valarLip requested changes Apr 7, 2026

View reviewed changes

kliuae added 4 commits April 8, 2026 07:07

move sparse mla modules to plugin

9829c52

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

clean up

b532a6d

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

format

5aa7def

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

merge main

05fb61d

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

kliuae-amd requested a review from valarLip April 8, 2026 08:20

wuhuikx reviewed Apr 8, 2026

View reviewed changes

wuhuikx and others added 2 commits April 9, 2026 13:10

Merge branch 'main' into plugin_sparse_mla

6a90fb7

bump vllm version to v0.19.0

8a52d69

Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com>

wuhuikx changed the title ~~[Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM OOT Plugin~~ [Feat][Plugin] Enable Sparse MLA and GLM-5 for vLLM-ATOM Apr 11, 2026

Merge branch 'main' into plugin_sparse_mla

5c34d91

whx-sjtu mentioned this pull request Apr 11, 2026

[Feature] Support GLM-5 MTP for vLLM Pluggin. #544

Draft



		@AiterBackendDecoratorForPluginMode
		class AiterMLASparseBackend(AttentionBackend):

Conversation

kliuae-amd commented Mar 24, 2026

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

Uh oh!

wuhuikx commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zejunchen-zejun commented Mar 25, 2026

Uh oh!

wuhuikx commented Mar 26, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wuhuikx commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

wuhuikx commented Apr 3, 2026 •

edited

Loading