[Refactor][ATOM-vLLM][Attention] Refactor ATOM-vLLM Attention by zejunchen-zejun · Pull Request #750 · ROCm/ATOM

zejunchen-zejun · 2026-05-11T11:15:35Z

This PR refactor the attention architecture for ATOM-vLLM. Here is the RFC: #758

Accuracy:

mode	model case name	acc value	result / core error
atom-test	MiniMax-M2.7	0.897650	passed
atom-test	gpt-oss-120b (2 GPUs)	0.892343	passed
atom-test	DeepSeek-R1-0528-FP4	0.940864	passed
atom-test	DeepSeek-R1-0528-FP4 MTP	0.940864	passed
atom-test	GLM-5.1-MXFP4	0.886277	passed
atom-test	GLM-5.1-MXFP4 MTP	0.893101	passed
atom-test	Kimi-K2.5-MXFP4	0.937832	passed
atom-test	Qwen3.5-397B-A17B-FP8	0.858984	passed
atom-test	Qwen3.5-397B-A17B-FP8 MTP	0.854435	passed
atom-test	Qwen3.5-397B-A17B-MXFP4	0.827142	passed
atom-test	Qwen3.5-397B-A17B-MXFP4 MTP	0.815011	passed
atom-test	Llama-3.3-70B-Instruct-MXFP4-Preview	0.906748	passed
atom-test	MiniMax-M2.7-MXFP4	0.897650	passed
atom-test	DeepSeek-R1-0528	0.946171	passed
atom-test	DeepSeek-R1-0528 MTP	0.949962	passed
atom-test	DeepSeek-V4-Pro	0.949962	passed
atom-test	DeepSeek-V4-Pro MTP	0.956027	passed
atom-test	GLM-5-FP8	0.934041	passed
atom-test	GLM-5.1-FP8	0.893859	passed
atom-test	Kimi-K2.5-MXFP4 Eagle3	0.928734	passed
atom-test	Qwen3-235B-A22B-Instruct-2507-FP8	0.893101	passed
atom-test	Qwen3-235B-A22B-Instruct-2507-MXFP4	0.874147	passed
atom-test	Qwen3-Next-80B-A3B-Thinking	0.705080	passed
atom-test	Qwen3.5-397B-A17B	0.839272	passed
atom-vllm-test	Qwen3.5-35B-A3B-FP8 TP2	0.802123	passed
atom-vllm-test	Kimi-K2-Thinking-MXFP4 TP4	0.935557	passed
atom-vllm-test	DeepSeek-R1-FP8 TP8	0.946171	passed
atom-vllm-accuracy-validation	MiniMax-M2.5 TP2	0.929492	passed
atom-vllm-accuracy-validation	DeepSeek-V3.2-FP8 TP4	0.952995	passed
atom-vllm-accuracy-validation	GLM-4.7-FP8 MTP TP4	0.945413	passed
atom-vllm-accuracy-validation	Kimi-K2.5-MXFP4 TP4	0.935557	passed
atom-vllm-accuracy-validation	Qwen3-Next-80B-A3B-Instruct-FP8-MTP TP4	0.817285	passed
atom-vllm-accuracy-validation	Qwen3-Next-80B-A3B-Instruct-FP8 TP1	0.824109	passed
atom-vllm-accuracy-validation	gpt-oss-120b TP1	0.884761	passed
atom-vllm-accuracy-validation	GLM-4.7-FP8 TP8	0.943897	passed
atom-vllm-accuracy-validation	DeepSeek-R1-0528-MXFP4 TP8	0.940106	passed
atom-vllm-accuracy-validation	Qwen3.5-397B-A17B-FP8 TP4	0.828658	passed
atom-vllm-accuracy-validation	Qwen3.5-397B-A17B-MXFP4 TP4	0.850644	passed
atom-vllm-accuracy-validation	Llama-3.1-8B-Instruct TP1	0.763457	passed
atom-vllm-accuracy-validation	GLM-5.1-FP8 TP8	0.940864	passed
atom-vllm-accuracy-validation	Meta-Llama-3.1-405B-Instruct-FP8 TP8	FAILED	AITER GEMM A16W16 asm: B must be Bf16 or Fp16, got fp8 - same as main branch
atom-vllm-accuracy-validation	Qwen3-235B-A22B-Instruct-2507-FP8 TP8+EP8	0.896133	passed
atom-vllm-accuracy-validation	Qwen3.5-397B-A17B TP8	0.856710	passed
atom-vllm-accuracy-validation	GLM4.7 TP4 MTP	0.945413	passed
atom-vllm-accuracy-validation	DeepseekV3.2 TP4 MTP	0.951478	passed

Performance:

model	ISL	OSL	C	target TTFT (ms)	ref TTFT (ms)	TTFT ratio	target TPOT (ms)	ref TPOT (ms)	TPOT ratio	target tok/s	ref tok/s	tok/s ratio
DeepSeek-V3.2 FP8 TP8 (AW)	1000	100	4	191.810	165.205	1.1610	13.549	14.188	0.9549	2866.011	2799.385	1.0238
DeepSeek-V3.2 FP8 TP8 (AW)	1000	100	16	548.827	586.060	0.9365	15.835	16.392	0.9660	8303.899	7960.113	1.0432
DeepSeek-V3.2 FP8 TP8 (AW)	1000	100	64	1287.020	1211.786	1.0621	31.677	33.459	0.9467	15887.511	15530.246	1.0230
DeepSeek-V3.2 FP8 TP8 (AW)	5000	500	4	716.283	716.028	1.0004	14.293	14.944	0.9564	2802.453	2691.363	1.0413
DeepSeek-V3.2 FP8 TP8 (AW)	5000	500	16	1794.509	1954.756	0.9180	19.378	20.061	0.9659	7672.828	7351.777	1.0437
DeepSeek-V3.2 FP8 TP8 (AW)	5000	500	64	2968.983	2988.041	0.9936	42.946	44.588	0.9632	14414.222	13933.617	1.0345
DeepSeek-V3.2 FP8 TP8 (AW)	10000	1000	4	1300.597	1328.347	0.9791	14.690	15.329	0.9583	2753.716	2643.605	1.0417
DeepSeek-V3.2 FP8 TP8 (AW)	10000	1000	16	3139.281	2718.458	1.1548	20.531	22.048	0.9312	7439.199	7102.136	1.0475
Kimi-K2.5-MXFP4 TP4 (MET)	1024	1024	4	118.011	189.830	0.6217	8.904	8.987	0.9908	864.054	849.252	1.0174
Kimi-K2.5-MXFP4 TP4 (MET)	1024	1024	16	147.477	163.491	0.9021	13.870	14.000	0.9907	2248.009	2225.695	1.0100
Kimi-K2.5-MXFP4 TP4 (MET)	1024	1024	64	260.491	232.016	1.1227	23.389	24.012	0.9741	5282.662	5149.047	1.0259
Kimi-K2.5-MXFP4 TP4 (MET)	8192	1024	4	326.367	268.828	1.2140	9.623	9.623	1.0000	3521.238	3544.580	0.9934
Kimi-K2.5-MXFP4 TP4 (MET)	8192	1024	16	467.643	523.202	0.8938	15.859	15.807	1.0033	8578.083	8577.464	1.0001
Kimi-K2.5-MXFP4 TP4 (MET)	8192	1024	64	1102.470	968.481	1.1384	32.052	32.078	0.9992	17045.570	17115.564	0.9959
MiniMax-M2.5 TP2 (AW)	1000	100	4	149.691	153.531	0.9750	11.318	11.686	0.9685	3460.323	3355.319	1.0313
MiniMax-M2.5 TP2 (AW)	1000	100	16	315.194	329.133	0.9577	16.801	16.217	1.0360	8784.884	9092.237	0.9662
MiniMax-M2.5 TP2 (AW)	1000	100	64	823.167	611.189	1.3468	30.642	31.338	0.9778	18203.173	18913.041	0.9625
MiniMax-M2.5 TP2 (AW)	5000	500	4	430.848	446.388	0.9652	12.459	13.157	0.9469	3308.629	3136.914	1.0547
MiniMax-M2.5 TP2 (AW)	5000	500	16	997.702	1030.166	0.9685	18.264	19.030	0.9598	8698.623	8356.078	1.0410
MiniMax-M2.5 TP2 (AW)	5000	500	64	1694.690	1578.686	1.0735	37.317	37.999	0.9820	17307.591	17116.183	1.0112
MiniMax-M2.5 TP2 (AW)	10000	1000	4	782.573	771.495	1.0144	13.966	14.705	0.9497	2985.661	2845.276	1.0493
MiniMax-M2.5 TP2 (AW)	10000	1000	16	1739.725	1454.144	1.1964	20.271	21.105	0.9605	8000.188	7797.082	1.0260
MiniMax-M2.5 TP4 (AW)	1000	100	4	121.288	122.789	0.9878	10.857	11.520	0.9424	3674.230	3480.444	1.0557
MiniMax-M2.5 TP4 (AW)	1000	100	16	241.797	224.300	1.0780	13.774	14.128	0.9749	10856.180	10825.633	1.0028
MiniMax-M2.5 TP4 (AW)	1000	100	64	534.281	525.831	1.0161	21.516	20.961	1.0265	26218.130	27020.177	0.9703
MiniMax-M2.5 TP4 (AW)	5000	500	4	296.570	280.703	1.0565	11.991	12.808	0.9362	3502.130	3296.826	1.0623
MiniMax-M2.5 TP4 (AW)	5000	500	16	625.342	628.822	0.9945	14.734	15.773	0.9341	11023.858	10348.110	1.0653
MiniMax-M2.5 TP4 (AW)	5000	500	64	1470.764	1047.958	1.4035	22.999	24.840	0.9259	27149.733	26142.966	1.0385
MiniMax-M2.5 TP4 (AW)	10000	1000	4	493.436	510.947	0.9657	13.500	14.517	0.9299	3146.815	2930.252	1.0739
MiniMax-M2.5 TP4 (AW)	10000	1000	16	1125.619	1052.983	1.0690	16.319	17.679	0.9231	10093.530	9386.688	1.0753
gpt-oss-120b TP1 (MET)	1024	1024	4	221.232	64.683	3.4202	4.400	4.366	1.0078	1683.058	1755.842	0.9585
gpt-oss-120b TP1 (MET)	1024	1024	16	84.198	113.865	0.7395	6.064	6.442	0.9414	5118.304	4797.600	1.0668
gpt-oss-120b TP1 (MET)	1024	1024	64	133.073	506.669	0.2626	9.413	10.319	0.9122	13048.273	11502.707	1.1344
gpt-oss-120b TP1 (MET)	8192	1024	4	155.372	124.010	1.2529	4.551	4.495	1.0124	7440.434	7580.865	0.9815
gpt-oss-120b TP1 (MET)	8192	1024	16	238.877	259.477	0.9206	6.585	6.640	0.9917	20423.008	20270.075	1.0075
gpt-oss-120b TP1 (MET)	8192	1024	64	479.096	523.035	0.9160	13.410	13.486	0.9943	40648.508	40289.726	1.0089
gpt-oss-120b TP2 (AW)	1000	100	4	160.746	189.158	0.8498	4.467	4.803	0.9302	7285.317	6616.429	1.1011
gpt-oss-120b TP2 (AW)	1000	100	16	217.329	498.099	0.4363	5.475	4.935	1.1094	23048.770	17819.736	1.2934
gpt-oss-120b TP2 (AW)	1000	100	64	987.988	1277.777	0.7732	14.902	16.151	0.9227	28459.595	24451.526	1.1639
gpt-oss-120b TP2 (AW)	5000	500	4	191.956	215.537	0.8906	3.661	3.836	0.9543	10892.630	10327.397	1.0547
gpt-oss-120b TP2 (AW)	5000	500	16	445.523	619.888	0.7187	4.976	5.240	0.9496	30003.312	27191.671	1.1034
gpt-oss-120b TP2 (AW)	5000	500	64	1061.295	820.410	1.2936	9.625	10.360	0.9290	59835.729	58685.730	1.0196
gpt-oss-120b TP2 (AW)	10000	1000	4	329.607	358.053	0.9206	3.745	3.993	0.9378	10805.485	10119.606	1.0678
gpt-oss-120b TP2 (AW)	10000	1000	16	815.211	779.199	1.0462	5.072	5.606	0.9048	29862.303	27573.939	1.0830
gpt-oss-120b TP8 (AW)	1000	100	4	200.069	222.561	0.8989	3.304	3.554	0.9298	8330.988	7655.160	1.0883
gpt-oss-120b TP8 (AW)	1000	100	16	133.407	469.634	0.2841	3.703	3.577	1.0351	34953.427	21320.936	1.6394
gpt-oss-120b TP8 (AW)	1000	100	64	220.301	420.512	0.5239	6.500	6.483	1.0027	79499.313	65494.306	1.2138
gpt-oss-120b TP8 (AW)	5000	500	4	137.229	284.586	0.4822	2.941	3.022	0.9732	13671.523	12271.706	1.1141
gpt-oss-120b TP8 (AW)	5000	500	16	237.306	272.157	0.8719	3.819	4.244	0.8997	40726.704	36493.533	1.1160
gpt-oss-120b TP8 (AW)	5000	500	64	681.872	708.788	0.9620	5.712	6.229	0.9170	98719.189	91813.936	1.0752
gpt-oss-120b TP8 (AW)	10000	1000	4	193.068	258.147	0.7479	3.058	3.364	0.9091	13522.024	12158.427	1.1122
gpt-oss-120b TP8 (AW)	10000	1000	16	445.195	395.815	1.1248	3.881	4.167	0.9314	40225.783	38443.004	1.0464
Qwen3-Next FP8 TP1 (AW)	1000	100	4	118.397	134.602	0.8796	5.221	5.926	0.8811	6915.721	6097.421	1.1342
Qwen3-Next FP8 TP1 (AW)	1000	100	16	245.085	309.724	0.7913	7.625	7.653	0.9962	17532.344	16421.825	1.0676
Qwen3-Next FP8 TP1 (AW)	1000	100	64	611.394	664.342	0.9203	14.527	13.732	1.0579	33895.313	34572.320	0.9804
Qwen3-Next FP8 TP1 (AW)	5000	500	4	286.613	267.322	1.0722	5.276	6.162	0.8561	7533.591	6581.901	1.1446
Qwen3-Next FP8 TP1 (AW)	5000	500	16	808.739	780.316	1.0364	7.259	8.036	0.9032	19847.769	18363.911	1.0808
Qwen3-Next FP8 TP1 (AW)	5000	500	64	2278.998	1959.056	1.1633	14.480	15.738	0.9200	36963.863	35846.447	1.0312
Qwen3-Next FP8 TP1 (AW)	10000	1000	4	521.919	526.609	0.9911	5.482	6.368	0.8609	7333.530	6387.626	1.1481
Qwen3-Next FP8 TP1 (AW)	10000	1000	16	1398.791	1458.320	0.9592	7.632	8.124	0.9395	19491.285	18378.452	1.0606
Qwen3-Next FP8 TP1 (MET)	1024	1024	4	88.832	70.514	1.2598	5.395	5.213	1.0350	1413.593	1467.754	0.9631
Qwen3-Next FP8 TP1 (MET)	1024	1024	16	108.621	104.135	1.0431	7.613	6.689	1.1380	4074.062	4600.737	0.8855
Qwen3-Next FP8 TP1 (MET)	1024	1024	64	173.022	159.059	1.0878	13.061	10.923	1.1957	9420.281	9462.649	0.996
Qwen3-Next FP8 TP1 (MET)	8192	1024	4	190.233	157.673	1.2065	5.560	5.549	1.0020	6085.826	6114.170	0.9954
Qwen3-Next FP8 TP1 (MET)	8192	1024	16	307.567	262.706	1.1708	8.252	7.531	1.0957	16274.086	17778.607	0.9154
Qwen3-Next FP8 TP1 (MET)	8192	1024	64	647.506	554.209	1.1683	16.996	15.648	1.0862	32020.956	31777.138	1.0207

P0 atom-vLLM Performance Regression Check

Branch head: 776e0b3
Image: docker.io/rocm/atom-dev:vllm-v0.19.0-nightly_20260526

Model	Case	Target TPOT ms	Native TPOT ms	TPOT Ratio	Target total token/s	Native total token/s	Token/s Ratio
DeepSeek-V3.2 FP8 MTP TP4 (AW)	1k/100 con4	10.137	9.912	0.978x	3486.740	3505.564	0.995x
DeepSeek-V3.2 FP8 MTP TP4 (AW)	1k/100 con8	13.554	13.283	0.980x	5465.788	5418.222	1.009x
DeepSeek-V3.2 FP8 TP4 (AW)	1k/100 con4	14.816	15.034	1.015x	2679.705	2631.569	1.018x
DeepSeek-V3.2 FP8 TP4 (AW)	1k/100 con8	19.873	20.074	1.010x	3993.358	3980.111	1.003x
Qwen3-Next-80B-A3B-Instruct-FP8 TP1 (MET)	1k/100 con4	6.239	6.693	1.073x	5773.805	5672.996	1.018x
Qwen3-Next-80B-A3B-Instruct-FP8 TP1 (MET)	1k/100 con8	8.583	8.659	1.009x	8983.642	8869.306	1.013x
gpt-oss-120b TP1 (MET)	1k/100 con4	5.585	5.408	0.968x	5844.730	5769.323	1.013x
gpt-oss-120b TP1 (MET)	1k/100 con8	6.593	7.202	1.092x	11695.984	10529.120	1.111x

Remaining perf test

atom-vllm

Qwen3-Next FP8 TP1 (MET) 1k/1k con64 - no regression
Qwen3-Next FP8 TP1 (MET) 8k/1k con16 - no regression

File Layout:

File	Responsibility
`backend.py`	Backend descriptors (MHA, MLA, SparseMLA, SparseIndexer, GDN); each maps to its builder and impl class
`layer.py`	Factory: dispatches to MHA/MLA/SparseMLA layer based on model config
`layer_common.py`	Shared helper: registers layer into vLLM `static_forward_context`
`layer_mha.py`	`AttentionForVllmMHA`: MHA attention layer (forward, KV cache, scales)
`layer_mla.py`	`AttentionForVllmMLA`: MLA attention layer + helper functions (reorg_kvcache, triton BMM wrappers)
`layer_sparse_mla.py`	Sparse MLA: indexer/cache decorators, sparse seqlen triton kernel
`layer_gdn.py`	`GatedDeltaNet` attention layer
`metadata.py`	All metadata dataclasses + builders (MHA, MLA, SparseMLA, SparseIndexer)
`ops.py`	`torch.compile` custom ops (`atom_vllm_mha_attention`, `atom_vllm_mla_attention`)
`__init__.py`	Package init

Code:

Category	OLD file	OLD name	NEW file	NEW name
MHA impl	`atom/plugin/attention_mha.py`	`PagedAttentionImplDecoratorForPluginMode`	`atom/plugin/vllm/attention/layer_mha.py`	`AttentionForVllmMHA`
MLA impl (dense)	`atom/plugin/attention_mla.py`	`MLAAttentionImplDecoratorForPluginMode`	`atom/plugin/vllm/attention/layer_mla.py`	`AttentionForVllmMLA`
MLA impl (sparse)	`atom/plugin/attention_mla_sparse.py`	`MLASparseAttentionImplDecoratorForPluginMode`	`atom/plugin/vllm/attention/layer_mla.py`	`AttentionForVllmSparseMLA`
GDN impl	`atom/plugin/vllm/attention_backend/attention_gdn.py`	`GatedDeltaNet`	`atom/plugin/vllm/attention/layer_gdn.py`	`GatedDeltaNet`
Entry factory	`atom/model_ops/paged_attention.py` (vllm branch in PagedAttention)	`PagedAttention`	`atom/plugin/vllm/attention/layer.py`	`AttentionForVllm`
MHA backend	`atom/model_ops/attentions/aiter_attention.py`	`AiterBackend` (reused native)	`atom/plugin/vllm/attention/backend.py`	`AiterMhaBackendForVllm`
MLA backend (dense)	`atom/model_ops/attentions/aiter_mla.py`	`AiterMLABackend` (reused native)	`atom/plugin/vllm/attention/backend.py`	`AiterMlaBackendForVllm`
MLA backend (sparse)	`atom/plugin/vllm/attention_backend/mla_sparse.py`	`AiterMLASparseBackend`	`atom/plugin/vllm/attention/backend.py`	`AiterSparseMlaBackendForVllm`
MLA backend (sparse indexer)	`atom/plugin/vllm/attention_backend/mla_sparse.py`	`AiterMLASparseIndexerBackend`	`atom/plugin/vllm/attention/backend.py`	`AiterSparseMlaIndexerBackendForVllm`
GDN backend	`atom/plugin/vllm/attention_backend/gdn_attn.py`	`GDNAttentionBackend`	`atom/plugin/vllm/attention/backend.py`	`GDNAttentionBackend`
MHA metadata	`atom/plugin/attention.py`	`AiterFlashAttentionMetadataForPluginMode`	`atom/plugin/vllm/attention/metadata.py`	`AiterMhaMetadataForVllm`
MHA phase metadata	`atom/plugin/attention.py`	`AiterFlashAttentionPhaseMetadata`	`atom/plugin/vllm/attention/metadata.py`	`AiterMhaPhaseMetadata`
MHA chunk-prefill metadata	`atom/plugin/attention.py`	`AiterFlashAttentionChunkPrefillMetadata`	`atom/plugin/vllm/attention/metadata.py`	`AiterChunkPrefillMetadata`
MLA metadata	`atom/plugin/attention.py`	`AiterMLACommonMetadataForPluginMode`	`atom/plugin/vllm/attention/metadata.py`	`AiterMlaMetadataForVllm`
MLA decode metadata	`atom/plugin/attention.py`	`AiterMLADecodeMetadataForPluginMode`	`atom/plugin/vllm/attention/metadata.py`	`AiterMlaDecodeMetadataForVllm`
MLA prefill metadata	`atom/plugin/attention.py`	`AiterMLACommonPrefillMetadataForPluginMode`	`atom/plugin/vllm/attention/metadata.py`	`AiterMlaPrefillMetadataForVllm`
MLA chunked-context metadata	`atom/plugin/attention.py`	`AiterMLAChunkedContextMetadataForPluginMode`	`atom/plugin/vllm/attention/metadata.py`	`AiterMlaChunkedContextMetadataForVllm`
MLA sparse metadata	`atom/plugin/attention.py`	`AiterMLASparseMetadataForPluginMode`	`atom/plugin/vllm/attention/metadata.py`	`AiterMlaSparseMetadataForVllm`
Sparse indexer metadata	`atom/plugin/attention.py`	`vllmDeepseekV32IndexerMetadata`	`atom/plugin/vllm/attention/metadata.py`	`AiterMlaSparseIndexerMetadataForVllm`
MHA builder	`atom/plugin/attention.py`	`vllmAttentionMetadataBuilderMethods`	`atom/plugin/vllm/attention/metadata.py`	`AiterMhaMetadataBuilderForVllm`
MLA builder	`atom/plugin/attention.py`	`vllmMLAAttentionMetadataBuilderMethods`	`atom/plugin/vllm/attention/metadata.py`	`AiterMlaMetadataBuilderForVllm`
MLA sparse builder	`atom/plugin/attention.py`	`vllmMLASparseAttentionMetadataBuilderMethods`	`atom/plugin/vllm/attention/metadata.py`	`AiterMlaSparseMetadataBuilder`
MLA sparse indexer builder	`atom/plugin/attention.py`	`vllmMLASparseIndexerAttentionMetadataBuilderMethods`	`atom/plugin/vllm/attention/metadata.py`	`AiterMlaSparseIndexerMetadataBuilder`
MHA forward dispatch	`atom/plugin/attention.py`	`unified_attention_with_output_base_for_plugin_mode`	`atom/plugin/vllm/attention/ops.py`	`torch.ops.aiter.atom_vllm_mha_attention`
MLA forward dispatch	`atom/plugin/attention.py`	`unified_attention_with_output_base_for_plugin_mode`	`atom/plugin/vllm/attention/ops.py`	`torch.ops.aiter.atom_vllm_mla_attention`
Indexer decorator	`atom/plugin/attention_mla_sparse.py`	`IndexerDecoratorForPluginMode`	`atom/plugin/vllm/attention/layer_sparse_mla.py`	`IndexerDecoratorForPluginMode`
Indexer cache decorator	`atom/plugin/attention_mla_sparse.py`	`DeepseekV32IndexerCacheDecoratorForPluginMode`	`atom/plugin/vllm/attention/layer_sparse_mla.py`	`DeepseekV32IndexerCacheDecoratorForPluginMode`
MoE decorator	`atom/plugin/moe.py`	`FusedMoEDecoratorForPluginMode`	`atom/plugin/vllm/moe.py`	`FusedMoEDecoratorForPluginMode`
vLLM MLA patch	`atom/plugin/vllm/mla_patch.py`	`patch_vllm_mla_attention`	—	removed (no longer needed)

wuhuikx · 2026-05-19T12:10:23Z

@zejunchen-zejun Please resolve the conflict. Is this PR ready for review?

Copilot

Pull request overview

This PR refactors ATOM’s attention integration to clearly separate native ATOM, ATOM-vLLM plugin, and ATOM-SGLang plugin attention paths. It removes decorator/monkey-patch driven behavior in favor of explicit mode dispatch + vLLM-owned attention-layer implementations, and drops the now-unsupportable “disable only attention” fallback flag.

Changes:

Introduces a frontend Attention dispatcher (atom.model_ops.base_attention.Attention) that selects the correct attention implementation per runtime mode (native / vLLM / SGLang).
Replaces prior vLLM attention patching/decorators with a dedicated atom/plugin/vllm/attention/ stack (layers, backends, metadata, custom ops) and removes ATOM_DISABLE_VLLM_PLUGIN_ATTENTION + PluginConfig.vllm_use_atom_attention.
Updates tests and documentation/recipes to match the new plugin behavior and env-flag semantics.

Reviewed changes

Copilot reviewed 46 out of 47 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/test_envs.py	Removes the deprecated `ATOM_DISABLE_VLLM_PLUGIN_ATTENTION` env var expectations/tests.
tests/plugin/test_plugin_env_flags.py	Simplifies plugin-disable behavior test to only cover `ATOM_DISABLE_VLLM_PLUGIN`.
tests/plugin/test_plugin_config_translation.py	Removes translation expectations tied to `vllm_use_atom_attention`.
recipes/atom_vllm/Qwen3Next.md	Documents the new “no attention-only disable” behavior (note: currently placed inside a bash block).
recipes/atom_vllm/Qwen3.5.md	Adds guidance about full plugin disable vs attention-only disable.
recipes/atom_vllm/Llama.md	Removes usage/docs of the removed attention-only disable flag.
pyproject.toml	Removes obsolete entry-point comment referencing the removed flag.
docs/vllm_plugin_backend_guide.md	Updates lifecycle/architecture docs to reflect new vLLM attention layers/backends layout and semantics.
docs/rfc_attention_refactor.md	Adds an RFC-style doc describing the refactor motivation and architecture.
docs/rfc_attention_refactor_atom_vllm_sglang.md	Adds an alternate RFC doc covering the same refactor at a high level.
docs/review_comment_for_attn_refactor.md	Adds an internal review notes document for the refactor.
docs/environment_variables.md	Removes the deprecated attention-only disable env var from the catalog.
docs/atom_vllm_attention_refactor_plan.md	Adds a refactor planning/architecture document (CN).
docs/atom_vllm_attention_architecture_analysis.md	Adds a detailed analysis document of old vs new attention architecture (CN).
atom/utils/forward_context.py	Removes plugin-only `plugin_metadata` plumbing from `AttentionMetaData`.
atom/utils/envs.py	Removes parsing of `ATOM_DISABLE_VLLM_PLUGIN_ATTENTION`.
atom/plugin/vllm/register.py	Removes MLA patching hook and attention-only disable handling.
atom/plugin/vllm/platform.py	Stops overriding vLLM attention backend selection; documents that attention is owned by ATOM vLLM layers.
atom/plugin/vllm/moe.py	Moves vLLM-only MoE naming adaptation into `atom.plugin.vllm`.
atom/plugin/vllm/mla_patch.py	Deletes legacy vLLM MLA patching module.
atom/plugin/vllm/attention/ops.py	Adds ATOM-owned vLLM custom ops (`atom_vllm_mha_attention`, `atom_vllm_mla_attention`) and marks them as splitting ops.
atom/plugin/vllm/attention/mla_sparse_impl.py	Removes legacy plugin-metadata assumptions and aligns sparse indexer path with new metadata types/backends.
atom/plugin/vllm/attention/mla_impl.py	Adds MLA helper utilities (e.g., fused GEMM imports, `reorg_kvcache`).
atom/plugin/vllm/attention/metadata.py	Adds vLLM-specific metadata dataclasses and helper utilities for MHA/MLA/sparse/indexer.
atom/plugin/vllm/attention/layer.py	Adds `AttentionForVllm` factory and ensures custom ops are registered via import side-effect.
atom/plugin/vllm/attention/layer_mla.py	Implements vLLM MLA layer(s) using native `MLAAttention` for weight processing + vLLM `AttentionLayerBase` contract for execution.
atom/plugin/vllm/attention/layer_mha.py	Implements vLLM MHA layer implementing `AttentionLayerBase`, using ATOM kernels + ATOM custom ops.
atom/plugin/vllm/attention/layer_common.py	Adds shared vLLM layer helpers (kv-cache dtype init, static context registration, default scale init).
atom/plugin/vllm/attention/init.py	Adds package docstring; avoids importing heavy submodules by default.
atom/plugin/vllm/attention_backend/mla_sparse.py	Refactors sparse MLA backends/builders to explicit vLLM-facing classes and metadata builders.
atom/plugin/sglang/attention.py	Adds `AttentionForSGLang` wrapper as the SGLang attention entrypoint for the dispatcher.
atom/plugin/register.py	Makes `set_attn_cls()` a compatibility no-op (logs only); attention selection now happens in the dispatcher.
atom/plugin/config.py	Removes `vllm_use_atom_attention` from `PluginConfig` and translation.
atom/models/deepseek_v2.py	Updates imports to point at new sparse MLA/indexer integration location.
atom/model_ops/paged_attention.py	Renames native attention layer to `Attention` and asserts it’s not instantiated in plugin mode.
atom/model_ops/moe.py	Updates import path for the vLLM-only MoE decorator.
atom/model_ops/base_attention.py	Adds mode-dispatching `Attention` constructor; adds wrappers for PA kernels; removes redundant `layer=` passing.
atom/model_ops/attentions/triton_mha.py	Removes dependency on mutable `atom.model_ops.Attention`; always returns `PagedAttentionImpl`.
atom/model_ops/attentions/aiter_mla.py	Removes plugin decorators/branching; restores native-only backend naming and builder behavior.
atom/model_ops/attentions/aiter_attention.py	Removes plugin decorators/branching; simplifies impl selection; uses proper `super()`.
atom/model_ops/attention_mla.py	Removes plugin-mode decorator injection and plugin forward branching; consolidates on native `forward_impl`.
atom/model_ops/attention_mha.py	Removes plugin-mode decorator injection and plugin branching; consolidates on native `forward_impl`.
atom/model_ops/init.py	Stops exporting mutable attention symbols; exports only the frontend dispatcher `Attention`.
atom/config.py	Adds ATOM vLLM attention ops to the default splitting ops list.
.claude/commands/atom-vllm-benchmark-guide.md	Updates benchmark guide to remove references to the deprecated attention-only disable flag.

Comments suppressed due to low confidence (1)

atom/plugin/vllm/attention/layer_mha.py:896

get_kv_cache_spec() always returns SlidingWindowSpec because self.sliding_window is set to -1 when per_layer_sliding_window is None, and the check only tests is not None. This can generate an invalid/meaningless sliding-window KV cache spec for the default case. Consider storing None when sliding window is disabled, or checking self.sliding_window > 0 (or != -1) before returning SlidingWindowSpec.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 41 out of 42 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

atom/plugin/vllm/attention/layer_mha.py:896

self.sliding_window is initialized to -1 when sliding window is disabled, but get_kv_cache_spec() checks only is not None, so it will always return SlidingWindowSpec (including for sliding_window=-1). This likely misinforms vLLM KV-cache allocation for non-sliding-window models. Treat -1 (and possibly None) as the disabled case and return FullAttentionSpec then.

wuhuikx · 2026-05-21T03:23:11Z

I want to hold on this and merge DS-V3.2-MTP, GLM-4.7-MTP first. I'm afraid there will be conflict.

whx-sjtu · 2026-05-21T08:02:15Z

Can we also refactor the atom config part? I think it's better to make set_current_atom_config a contextmanager and get_current_atom_config can only take effect inside the context created by set_current_atom_config, just like vLLM. Then we don't have to pass atom_config into forward_context anymore, which looks really ugly. @zejunchen-zejun

zejunchen-zejun · 2026-05-22T03:04:22Z

Can we also refactor the atom config part? I think it's better to make set_current_atom_config a contextmanager and get_current_atom_config can only take effect inside the context created by set_current_atom_config, just like vLLM. Then we don't have to pass atom_config into forward_context anymore, which looks really ugly. @zejunchen-zejun

Make perfect sense, the current atom-vllm config has 2 risky points:

fetch config from the global stateful singleton, for both main model and draft model
the global atom-vllm is passed through the forward context
I agree to refactor the atom-vllm config by using with-context, but it will not happen in this PR. This PR could focus on attention refactor. Next PR we can refactor the atom-vllm config.

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

remove legacy code and comment add FIXME for one legacy method used by atom-sgl Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

move gdn into atom/plugin/vllm/attention remove folder atom/plugin/vllm/attention_backend Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

as atom-vllm doesn't need it for now Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

fix missing kv dtype for sparse MLA Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

remove the multi inheritance and inline the attention methods Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

instead of deprecating it Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

zejunchen-zejun · 2026-06-01T13:41:33Z

ATOM native GPTOSS：Accuracy test : 0.8749052312357847 < 0.88
ATOM native Qwen3.5： Accuracy test : 0.8233510235026535 < 0.835

zejunchen-zejun force-pushed the zejun/refact_attn_0511 branch from 3b061e9 to b832aee Compare May 15, 2026 05:30

zejunchen-zejun changed the title ~~[feat][Attention Refactor] Reconstruct the Attention arch~~ [feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention arch May 15, 2026

zejunchen-zejun changed the title ~~[feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention arch~~ [feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention Arch May 15, 2026

zejunchen-zejun mentioned this pull request May 15, 2026

[ATOM-Plugin][Attention] Refactor ATOM-Plugin Attention Architecture #758

Open

PerryZhang01 reviewed May 19, 2026

View reviewed changes

Comment thread atom/model_ops/base_attention.py

wuhuikx requested review from ganyi1996ppo and whx-sjtu May 19, 2026 12:09

zejunchen-zejun force-pushed the zejun/refact_attn_0511 branch from 5ca6fbc to 174c96e Compare May 19, 2026 13:30

zejunchen-zejun changed the title ~~[feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention Arch~~ [Refactor][ATOM-vLLM][Attention] Refactor ATOM-vLLM Attention May 20, 2026

zejunchen-zejun marked this pull request as ready for review May 20, 2026 04:58

Copilot AI review requested due to automatic review settings May 20, 2026 04:58

Copilot started reviewing on behalf of zejunchen-zejun May 20, 2026 05:00 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread recipes/atom_vllm/Qwen3Next.md

zejunchen-zejun requested review from ZhangLirong-amd, ZhiweiYan-96, Copilot, valarLip, wuhuikx and zhuyuhua-v May 20, 2026 05:31

zejunchen-zejun force-pushed the zejun/refact_attn_0511 branch from e53ccff to 41c09ac Compare May 20, 2026 07:09

Copilot started reviewing on behalf of zejunchen-zejun May 20, 2026 07:11 View session

Copilot AI reviewed May 20, 2026

View reviewed changes

Comment thread atom/plugin/vllm/attention/backend.py Outdated

Comment thread atom/plugin/vllm/attention/metadata.py

ZhangLirong-amd reviewed May 20, 2026

View reviewed changes

Comment thread atom/plugin/vllm/attention/ops.py Outdated

whx-sjtu requested changes May 21, 2026

View reviewed changes

Comment thread atom/plugin/vllm/attention/backend.py Outdated

ganyi1996ppo reviewed May 21, 2026

View reviewed changes

Comment thread atom/model_ops/base_attention.py

zejunchen-zejun marked this pull request as draft May 22, 2026 02:38

zejunchen-zejun added 23 commits May 31, 2026 14:45

refine attn metadata name

bcccc4b

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

[rebase] finish rebase main 0519

8968f4c

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

make lint happy

c440a60

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

fix import module failure

5e44987

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

find back missing code after rebase

0b7ec35

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

remove new added doc file

ba1a1d2

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

use mark split decorator

d3dbc15

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

put all atom-vllm related metadata builders into metadata.py

d3e7782

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

make lint happy

0bd3d93

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

fix rebase conflict

db17460

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

make lint happy

bc45a1e

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

add new metadata name to MTP allow list

9992f4e

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

retrieve back code change

d490065

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

remove plugin_mode name

f06cc93

remove legacy code and comment add FIXME for one legacy method used by atom-sgl Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

refine the plugin attn folder

34bda1e

move gdn into atom/plugin/vllm/attention remove folder atom/plugin/vllm/attention_backend Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

make lint happy

9928d39

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

remove splitting path for atom-vllm

5be91bb

as atom-vllm doesn't need it for now Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

fix force assign max qo len to 1

423f783

fix missing kv dtype for sparse MLA Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

refine the attention file in atom-vllm folder

1adbdc6

remove the multi inheritance and inline the attention methods Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

change the sparse mla metadata builder class name

d74471f

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

retrieve missing code

dc2c6bc

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

remove context info and finish rebase

72f7b76

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

finish rebase and port the main code to new arch

f9edca3

instead of deprecating it Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

zejunchen-zejun force-pushed the zejun/refact_attn_0511 branch from 35c0aad to f9edca3 Compare May 31, 2026 06:45

whx-sjtu mentioned this pull request Jun 1, 2026

[ATOM-vLLM] Upgrade vLLM version to v0.22.0 #1006

Open

valarLip approved these changes Jun 1, 2026

View reviewed changes

ganyi1996ppo approved these changes Jun 1, 2026

View reviewed changes

zejunchen-zejun merged commit f03f845 into main Jun 1, 2026
54 of 65 checks passed

zejunchen-zejun deleted the zejun/refact_attn_0511 branch June 1, 2026 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Refactor][ATOM-vLLM][Attention] Refactor ATOM-vLLM Attention#750

[Refactor][ATOM-vLLM][Attention] Refactor ATOM-vLLM Attention#750
zejunchen-zejun merged 33 commits into
mainfrom
zejun/refact_attn_0511

zejunchen-zejun commented May 11, 2026 •

edited

Loading

Uh oh!

Uh oh!

wuhuikx commented May 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuhuikx commented May 21, 2026

Uh oh!

Uh oh!

whx-sjtu commented May 21, 2026

Uh oh!

Uh oh!

zejunchen-zejun commented May 22, 2026 •

edited

Loading

Uh oh!

zejunchen-zejun commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Conversation

zejunchen-zejun commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This PR refactor the attention architecture for ATOM-vLLM. Here is the RFC: #758

Accuracy:

Performance:

P0 atom-vLLM Performance Regression Check

Remaining perf test

File Layout:

Code:

Uh oh!

Uh oh!

wuhuikx commented May 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wuhuikx commented May 21, 2026

Uh oh!

Uh oh!

whx-sjtu commented May 21, 2026

Uh oh!

Uh oh!

zejunchen-zejun commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zejunchen-zejun commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

zejunchen-zejun commented May 11, 2026 •

edited

Loading

zejunchen-zejun commented May 22, 2026 •

edited

Loading