Skip to content

[Refactor][ATOM-vLLM][Attention] Refactor ATOM-vLLM Attention#750

Merged
zejunchen-zejun merged 33 commits into
mainfrom
zejun/refact_attn_0511
Jun 1, 2026
Merged

[Refactor][ATOM-vLLM][Attention] Refactor ATOM-vLLM Attention#750
zejunchen-zejun merged 33 commits into
mainfrom
zejun/refact_attn_0511

Conversation

@zejunchen-zejun
Copy link
Copy Markdown
Collaborator

@zejunchen-zejun zejunchen-zejun commented May 11, 2026

This PR refactor the attention architecture for ATOM-vLLM. Here is the RFC: #758

Accuracy:

mode model case name acc value result / core error
atom-test MiniMax-M2.7 0.897650 passed
atom-test gpt-oss-120b (2 GPUs) 0.892343 passed
atom-test DeepSeek-R1-0528-FP4 0.940864 passed
atom-test DeepSeek-R1-0528-FP4 MTP 0.940864 passed
atom-test GLM-5.1-MXFP4 0.886277 passed
atom-test GLM-5.1-MXFP4 MTP 0.893101 passed
atom-test Kimi-K2.5-MXFP4 0.937832 passed
atom-test Qwen3.5-397B-A17B-FP8 0.858984 passed
atom-test Qwen3.5-397B-A17B-FP8 MTP 0.854435 passed
atom-test Qwen3.5-397B-A17B-MXFP4 0.827142 passed
atom-test Qwen3.5-397B-A17B-MXFP4 MTP 0.815011 passed
atom-test Llama-3.3-70B-Instruct-MXFP4-Preview 0.906748 passed
atom-test MiniMax-M2.7-MXFP4 0.897650 passed
atom-test DeepSeek-R1-0528 0.946171 passed
atom-test DeepSeek-R1-0528 MTP 0.949962 passed
atom-test DeepSeek-V4-Pro 0.949962 passed
atom-test DeepSeek-V4-Pro MTP 0.956027 passed
atom-test GLM-5-FP8 0.934041 passed
atom-test GLM-5.1-FP8 0.893859 passed
atom-test Kimi-K2.5-MXFP4 Eagle3 0.928734 passed
atom-test Qwen3-235B-A22B-Instruct-2507-FP8 0.893101 passed
atom-test Qwen3-235B-A22B-Instruct-2507-MXFP4 0.874147 passed
atom-test Qwen3-Next-80B-A3B-Thinking 0.705080 passed
atom-test Qwen3.5-397B-A17B 0.839272 passed
atom-vllm-test Qwen3.5-35B-A3B-FP8 TP2 0.802123 passed
atom-vllm-test Kimi-K2-Thinking-MXFP4 TP4 0.935557 passed
atom-vllm-test DeepSeek-R1-FP8 TP8 0.946171 passed
atom-vllm-accuracy-validation MiniMax-M2.5 TP2 0.929492 passed
atom-vllm-accuracy-validation DeepSeek-V3.2-FP8 TP4 0.952995 passed
atom-vllm-accuracy-validation GLM-4.7-FP8 MTP TP4 0.945413 passed
atom-vllm-accuracy-validation Kimi-K2.5-MXFP4 TP4 0.935557 passed
atom-vllm-accuracy-validation Qwen3-Next-80B-A3B-Instruct-FP8-MTP TP4 0.817285 passed
atom-vllm-accuracy-validation Qwen3-Next-80B-A3B-Instruct-FP8 TP1 0.824109 passed
atom-vllm-accuracy-validation gpt-oss-120b TP1 0.884761 passed
atom-vllm-accuracy-validation GLM-4.7-FP8 TP8 0.943897 passed
atom-vllm-accuracy-validation DeepSeek-R1-0528-MXFP4 TP8 0.940106 passed
atom-vllm-accuracy-validation Qwen3.5-397B-A17B-FP8 TP4 0.828658 passed
atom-vllm-accuracy-validation Qwen3.5-397B-A17B-MXFP4 TP4 0.850644 passed
atom-vllm-accuracy-validation Llama-3.1-8B-Instruct TP1 0.763457 passed
atom-vllm-accuracy-validation GLM-5.1-FP8 TP8 0.940864 passed
atom-vllm-accuracy-validation Meta-Llama-3.1-405B-Instruct-FP8 TP8 FAILED AITER GEMM A16W16 asm: B must be Bf16 or Fp16, got fp8 - same as main branch
atom-vllm-accuracy-validation Qwen3-235B-A22B-Instruct-2507-FP8 TP8+EP8 0.896133 passed
atom-vllm-accuracy-validation Qwen3.5-397B-A17B TP8 0.856710 passed
atom-vllm-accuracy-validation GLM4.7 TP4 MTP 0.945413 passed
atom-vllm-accuracy-validation DeepseekV3.2 TP4 MTP 0.951478 passed

Performance:

model ISL OSL C target TTFT (ms) ref TTFT (ms) TTFT ratio target TPOT (ms) ref TPOT (ms) TPOT ratio target tok/s ref tok/s tok/s ratio
DeepSeek-V3.2 FP8 TP8 (AW) 1000 100 4 191.810 165.205 1.1610 13.549 14.188 0.9549 2866.011 2799.385 1.0238
DeepSeek-V3.2 FP8 TP8 (AW) 1000 100 16 548.827 586.060 0.9365 15.835 16.392 0.9660 8303.899 7960.113 1.0432
DeepSeek-V3.2 FP8 TP8 (AW) 1000 100 64 1287.020 1211.786 1.0621 31.677 33.459 0.9467 15887.511 15530.246 1.0230
DeepSeek-V3.2 FP8 TP8 (AW) 5000 500 4 716.283 716.028 1.0004 14.293 14.944 0.9564 2802.453 2691.363 1.0413
DeepSeek-V3.2 FP8 TP8 (AW) 5000 500 16 1794.509 1954.756 0.9180 19.378 20.061 0.9659 7672.828 7351.777 1.0437
DeepSeek-V3.2 FP8 TP8 (AW) 5000 500 64 2968.983 2988.041 0.9936 42.946 44.588 0.9632 14414.222 13933.617 1.0345
DeepSeek-V3.2 FP8 TP8 (AW) 10000 1000 4 1300.597 1328.347 0.9791 14.690 15.329 0.9583 2753.716 2643.605 1.0417
DeepSeek-V3.2 FP8 TP8 (AW) 10000 1000 16 3139.281 2718.458 1.1548 20.531 22.048 0.9312 7439.199 7102.136 1.0475
Kimi-K2.5-MXFP4 TP4 (MET) 1024 1024 4 118.011 189.830 0.6217 8.904 8.987 0.9908 864.054 849.252 1.0174
Kimi-K2.5-MXFP4 TP4 (MET) 1024 1024 16 147.477 163.491 0.9021 13.870 14.000 0.9907 2248.009 2225.695 1.0100
Kimi-K2.5-MXFP4 TP4 (MET) 1024 1024 64 260.491 232.016 1.1227 23.389 24.012 0.9741 5282.662 5149.047 1.0259
Kimi-K2.5-MXFP4 TP4 (MET) 8192 1024 4 326.367 268.828 1.2140 9.623 9.623 1.0000 3521.238 3544.580 0.9934
Kimi-K2.5-MXFP4 TP4 (MET) 8192 1024 16 467.643 523.202 0.8938 15.859 15.807 1.0033 8578.083 8577.464 1.0001
Kimi-K2.5-MXFP4 TP4 (MET) 8192 1024 64 1102.470 968.481 1.1384 32.052 32.078 0.9992 17045.570 17115.564 0.9959
MiniMax-M2.5 TP2 (AW) 1000 100 4 149.691 153.531 0.9750 11.318 11.686 0.9685 3460.323 3355.319 1.0313
MiniMax-M2.5 TP2 (AW) 1000 100 16 315.194 329.133 0.9577 16.801 16.217 1.0360 8784.884 9092.237 0.9662
MiniMax-M2.5 TP2 (AW) 1000 100 64 823.167 611.189 1.3468 30.642 31.338 0.9778 18203.173 18913.041 0.9625
MiniMax-M2.5 TP2 (AW) 5000 500 4 430.848 446.388 0.9652 12.459 13.157 0.9469 3308.629 3136.914 1.0547
MiniMax-M2.5 TP2 (AW) 5000 500 16 997.702 1030.166 0.9685 18.264 19.030 0.9598 8698.623 8356.078 1.0410
MiniMax-M2.5 TP2 (AW) 5000 500 64 1694.690 1578.686 1.0735 37.317 37.999 0.9820 17307.591 17116.183 1.0112
MiniMax-M2.5 TP2 (AW) 10000 1000 4 782.573 771.495 1.0144 13.966 14.705 0.9497 2985.661 2845.276 1.0493
MiniMax-M2.5 TP2 (AW) 10000 1000 16 1739.725 1454.144 1.1964 20.271 21.105 0.9605 8000.188 7797.082 1.0260
MiniMax-M2.5 TP4 (AW) 1000 100 4 121.288 122.789 0.9878 10.857 11.520 0.9424 3674.230 3480.444 1.0557
MiniMax-M2.5 TP4 (AW) 1000 100 16 241.797 224.300 1.0780 13.774 14.128 0.9749 10856.180 10825.633 1.0028
MiniMax-M2.5 TP4 (AW) 1000 100 64 534.281 525.831 1.0161 21.516 20.961 1.0265 26218.130 27020.177 0.9703
MiniMax-M2.5 TP4 (AW) 5000 500 4 296.570 280.703 1.0565 11.991 12.808 0.9362 3502.130 3296.826 1.0623
MiniMax-M2.5 TP4 (AW) 5000 500 16 625.342 628.822 0.9945 14.734 15.773 0.9341 11023.858 10348.110 1.0653
MiniMax-M2.5 TP4 (AW) 5000 500 64 1470.764 1047.958 1.4035 22.999 24.840 0.9259 27149.733 26142.966 1.0385
MiniMax-M2.5 TP4 (AW) 10000 1000 4 493.436 510.947 0.9657 13.500 14.517 0.9299 3146.815 2930.252 1.0739
MiniMax-M2.5 TP4 (AW) 10000 1000 16 1125.619 1052.983 1.0690 16.319 17.679 0.9231 10093.530 9386.688 1.0753
gpt-oss-120b TP1 (MET) 1024 1024 4 221.232 64.683 3.4202 4.400 4.366 1.0078 1683.058 1755.842 0.9585
gpt-oss-120b TP1 (MET) 1024 1024 16 84.198 113.865 0.7395 6.064 6.442 0.9414 5118.304 4797.600 1.0668
gpt-oss-120b TP1 (MET) 1024 1024 64 133.073 506.669 0.2626 9.413 10.319 0.9122 13048.273 11502.707 1.1344
gpt-oss-120b TP1 (MET) 8192 1024 4 155.372 124.010 1.2529 4.551 4.495 1.0124 7440.434 7580.865 0.9815
gpt-oss-120b TP1 (MET) 8192 1024 16 238.877 259.477 0.9206 6.585 6.640 0.9917 20423.008 20270.075 1.0075
gpt-oss-120b TP1 (MET) 8192 1024 64 479.096 523.035 0.9160 13.410 13.486 0.9943 40648.508 40289.726 1.0089
gpt-oss-120b TP2 (AW) 1000 100 4 160.746 189.158 0.8498 4.467 4.803 0.9302 7285.317 6616.429 1.1011
gpt-oss-120b TP2 (AW) 1000 100 16 217.329 498.099 0.4363 5.475 4.935 1.1094 23048.770 17819.736 1.2934
gpt-oss-120b TP2 (AW) 1000 100 64 987.988 1277.777 0.7732 14.902 16.151 0.9227 28459.595 24451.526 1.1639
gpt-oss-120b TP2 (AW) 5000 500 4 191.956 215.537 0.8906 3.661 3.836 0.9543 10892.630 10327.397 1.0547
gpt-oss-120b TP2 (AW) 5000 500 16 445.523 619.888 0.7187 4.976 5.240 0.9496 30003.312 27191.671 1.1034
gpt-oss-120b TP2 (AW) 5000 500 64 1061.295 820.410 1.2936 9.625 10.360 0.9290 59835.729 58685.730 1.0196
gpt-oss-120b TP2 (AW) 10000 1000 4 329.607 358.053 0.9206 3.745 3.993 0.9378 10805.485 10119.606 1.0678
gpt-oss-120b TP2 (AW) 10000 1000 16 815.211 779.199 1.0462 5.072 5.606 0.9048 29862.303 27573.939 1.0830
gpt-oss-120b TP8 (AW) 1000 100 4 200.069 222.561 0.8989 3.304 3.554 0.9298 8330.988 7655.160 1.0883
gpt-oss-120b TP8 (AW) 1000 100 16 133.407 469.634 0.2841 3.703 3.577 1.0351 34953.427 21320.936 1.6394
gpt-oss-120b TP8 (AW) 1000 100 64 220.301 420.512 0.5239 6.500 6.483 1.0027 79499.313 65494.306 1.2138
gpt-oss-120b TP8 (AW) 5000 500 4 137.229 284.586 0.4822 2.941 3.022 0.9732 13671.523 12271.706 1.1141
gpt-oss-120b TP8 (AW) 5000 500 16 237.306 272.157 0.8719 3.819 4.244 0.8997 40726.704 36493.533 1.1160
gpt-oss-120b TP8 (AW) 5000 500 64 681.872 708.788 0.9620 5.712 6.229 0.9170 98719.189 91813.936 1.0752
gpt-oss-120b TP8 (AW) 10000 1000 4 193.068 258.147 0.7479 3.058 3.364 0.9091 13522.024 12158.427 1.1122
gpt-oss-120b TP8 (AW) 10000 1000 16 445.195 395.815 1.1248 3.881 4.167 0.9314 40225.783 38443.004 1.0464
Qwen3-Next FP8 TP1 (AW) 1000 100 4 118.397 134.602 0.8796 5.221 5.926 0.8811 6915.721 6097.421 1.1342
Qwen3-Next FP8 TP1 (AW) 1000 100 16 245.085 309.724 0.7913 7.625 7.653 0.9962 17532.344 16421.825 1.0676
Qwen3-Next FP8 TP1 (AW) 1000 100 64 611.394 664.342 0.9203 14.527 13.732 1.0579 33895.313 34572.320 0.9804
Qwen3-Next FP8 TP1 (AW) 5000 500 4 286.613 267.322 1.0722 5.276 6.162 0.8561 7533.591 6581.901 1.1446
Qwen3-Next FP8 TP1 (AW) 5000 500 16 808.739 780.316 1.0364 7.259 8.036 0.9032 19847.769 18363.911 1.0808
Qwen3-Next FP8 TP1 (AW) 5000 500 64 2278.998 1959.056 1.1633 14.480 15.738 0.9200 36963.863 35846.447 1.0312
Qwen3-Next FP8 TP1 (AW) 10000 1000 4 521.919 526.609 0.9911 5.482 6.368 0.8609 7333.530 6387.626 1.1481
Qwen3-Next FP8 TP1 (AW) 10000 1000 16 1398.791 1458.320 0.9592 7.632 8.124 0.9395 19491.285 18378.452 1.0606
Qwen3-Next FP8 TP1 (MET) 1024 1024 4 88.832 70.514 1.2598 5.395 5.213 1.0350 1413.593 1467.754 0.9631
Qwen3-Next FP8 TP1 (MET) 1024 1024 16 108.621 104.135 1.0431 7.613 6.689 1.1380 4074.062 4600.737 0.8855
Qwen3-Next FP8 TP1 (MET) 1024 1024 64 173.022 159.059 1.0878 13.061 10.923 1.1957 9420.281 9462.649 0.996
Qwen3-Next FP8 TP1 (MET) 8192 1024 4 190.233 157.673 1.2065 5.560 5.549 1.0020 6085.826 6114.170 0.9954
Qwen3-Next FP8 TP1 (MET) 8192 1024 16 307.567 262.706 1.1708 8.252 7.531 1.0957 16274.086 17778.607 0.9154
Qwen3-Next FP8 TP1 (MET) 8192 1024 64 647.506 554.209 1.1683 16.996 15.648 1.0862 32020.956 31777.138 1.0207

P0 atom-vLLM Performance Regression Check

Branch head: 776e0b3
Image: docker.io/rocm/atom-dev:vllm-v0.19.0-nightly_20260526

Model Case Target TPOT ms Native TPOT ms TPOT Ratio Target total token/s Native total token/s Token/s Ratio
DeepSeek-V3.2 FP8 MTP TP4 (AW) 1k/100 con4 10.137 9.912 0.978x 3486.740 3505.564 0.995x
DeepSeek-V3.2 FP8 MTP TP4 (AW) 1k/100 con8 13.554 13.283 0.980x 5465.788 5418.222 1.009x
DeepSeek-V3.2 FP8 TP4 (AW) 1k/100 con4 14.816 15.034 1.015x 2679.705 2631.569 1.018x
DeepSeek-V3.2 FP8 TP4 (AW) 1k/100 con8 19.873 20.074 1.010x 3993.358 3980.111 1.003x
Qwen3-Next-80B-A3B-Instruct-FP8 TP1 (MET) 1k/100 con4 6.239 6.693 1.073x 5773.805 5672.996 1.018x
Qwen3-Next-80B-A3B-Instruct-FP8 TP1 (MET) 1k/100 con8 8.583 8.659 1.009x 8983.642 8869.306 1.013x
gpt-oss-120b TP1 (MET) 1k/100 con4 5.585 5.408 0.968x 5844.730 5769.323 1.013x
gpt-oss-120b TP1 (MET) 1k/100 con8 6.593 7.202 1.092x 11695.984 10529.120 1.111x

Remaining perf test

atom-vllm

  • Qwen3-Next FP8 TP1 (MET) 1k/1k con64 - no regression
  • Qwen3-Next FP8 TP1 (MET) 8k/1k con16 - no regression

File Layout:

File Responsibility
backend.py Backend descriptors (MHA, MLA, SparseMLA, SparseIndexer, GDN); each maps to its builder and impl class
layer.py Factory: dispatches to MHA/MLA/SparseMLA layer based on model config
layer_common.py Shared helper: registers layer into vLLM static_forward_context
layer_mha.py AttentionForVllmMHA: MHA attention layer (forward, KV cache, scales)
layer_mla.py AttentionForVllmMLA: MLA attention layer + helper functions (reorg_kvcache, triton BMM wrappers)
layer_sparse_mla.py Sparse MLA: indexer/cache decorators, sparse seqlen triton kernel
layer_gdn.py GatedDeltaNet attention layer
metadata.py All metadata dataclasses + builders (MHA, MLA, SparseMLA, SparseIndexer)
ops.py torch.compile custom ops (atom_vllm_mha_attention, atom_vllm_mla_attention)
__init__.py Package init

Code:

Category OLD file OLD name NEW file NEW name
MHA impl atom/plugin/attention_mha.py PagedAttentionImplDecoratorForPluginMode atom/plugin/vllm/attention/layer_mha.py AttentionForVllmMHA
MLA impl (dense) atom/plugin/attention_mla.py MLAAttentionImplDecoratorForPluginMode atom/plugin/vllm/attention/layer_mla.py AttentionForVllmMLA
MLA impl (sparse) atom/plugin/attention_mla_sparse.py MLASparseAttentionImplDecoratorForPluginMode atom/plugin/vllm/attention/layer_mla.py AttentionForVllmSparseMLA
GDN impl atom/plugin/vllm/attention_backend/attention_gdn.py GatedDeltaNet atom/plugin/vllm/attention/layer_gdn.py GatedDeltaNet
Entry factory atom/model_ops/paged_attention.py (vllm branch in PagedAttention) PagedAttention atom/plugin/vllm/attention/layer.py AttentionForVllm
MHA backend atom/model_ops/attentions/aiter_attention.py AiterBackend (reused native) atom/plugin/vllm/attention/backend.py AiterMhaBackendForVllm
MLA backend (dense) atom/model_ops/attentions/aiter_mla.py AiterMLABackend (reused native) atom/plugin/vllm/attention/backend.py AiterMlaBackendForVllm
MLA backend (sparse) atom/plugin/vllm/attention_backend/mla_sparse.py AiterMLASparseBackend atom/plugin/vllm/attention/backend.py AiterSparseMlaBackendForVllm
MLA backend (sparse indexer) atom/plugin/vllm/attention_backend/mla_sparse.py AiterMLASparseIndexerBackend atom/plugin/vllm/attention/backend.py AiterSparseMlaIndexerBackendForVllm
GDN backend atom/plugin/vllm/attention_backend/gdn_attn.py GDNAttentionBackend atom/plugin/vllm/attention/backend.py GDNAttentionBackend
MHA metadata atom/plugin/attention.py AiterFlashAttentionMetadataForPluginMode atom/plugin/vllm/attention/metadata.py AiterMhaMetadataForVllm
MHA phase metadata atom/plugin/attention.py AiterFlashAttentionPhaseMetadata atom/plugin/vllm/attention/metadata.py AiterMhaPhaseMetadata
MHA chunk-prefill metadata atom/plugin/attention.py AiterFlashAttentionChunkPrefillMetadata atom/plugin/vllm/attention/metadata.py AiterChunkPrefillMetadata
MLA metadata atom/plugin/attention.py AiterMLACommonMetadataForPluginMode atom/plugin/vllm/attention/metadata.py AiterMlaMetadataForVllm
MLA decode metadata atom/plugin/attention.py AiterMLADecodeMetadataForPluginMode atom/plugin/vllm/attention/metadata.py AiterMlaDecodeMetadataForVllm
MLA prefill metadata atom/plugin/attention.py AiterMLACommonPrefillMetadataForPluginMode atom/plugin/vllm/attention/metadata.py AiterMlaPrefillMetadataForVllm
MLA chunked-context metadata atom/plugin/attention.py AiterMLAChunkedContextMetadataForPluginMode atom/plugin/vllm/attention/metadata.py AiterMlaChunkedContextMetadataForVllm
MLA sparse metadata atom/plugin/attention.py AiterMLASparseMetadataForPluginMode atom/plugin/vllm/attention/metadata.py AiterMlaSparseMetadataForVllm
Sparse indexer metadata atom/plugin/attention.py vllmDeepseekV32IndexerMetadata atom/plugin/vllm/attention/metadata.py AiterMlaSparseIndexerMetadataForVllm
MHA builder atom/plugin/attention.py vllmAttentionMetadataBuilderMethods atom/plugin/vllm/attention/metadata.py AiterMhaMetadataBuilderForVllm
MLA builder atom/plugin/attention.py vllmMLAAttentionMetadataBuilderMethods atom/plugin/vllm/attention/metadata.py AiterMlaMetadataBuilderForVllm
MLA sparse builder atom/plugin/attention.py vllmMLASparseAttentionMetadataBuilderMethods atom/plugin/vllm/attention/metadata.py AiterMlaSparseMetadataBuilder
MLA sparse indexer builder atom/plugin/attention.py vllmMLASparseIndexerAttentionMetadataBuilderMethods atom/plugin/vllm/attention/metadata.py AiterMlaSparseIndexerMetadataBuilder
MHA forward dispatch atom/plugin/attention.py unified_attention_with_output_base_for_plugin_mode atom/plugin/vllm/attention/ops.py torch.ops.aiter.atom_vllm_mha_attention
MLA forward dispatch atom/plugin/attention.py unified_attention_with_output_base_for_plugin_mode atom/plugin/vllm/attention/ops.py torch.ops.aiter.atom_vllm_mla_attention
Indexer decorator atom/plugin/attention_mla_sparse.py IndexerDecoratorForPluginMode atom/plugin/vllm/attention/layer_sparse_mla.py IndexerDecoratorForPluginMode
Indexer cache decorator atom/plugin/attention_mla_sparse.py DeepseekV32IndexerCacheDecoratorForPluginMode atom/plugin/vllm/attention/layer_sparse_mla.py DeepseekV32IndexerCacheDecoratorForPluginMode
MoE decorator atom/plugin/moe.py FusedMoEDecoratorForPluginMode atom/plugin/vllm/moe.py FusedMoEDecoratorForPluginMode
vLLM MLA patch atom/plugin/vllm/mla_patch.py patch_vllm_mla_attention removed (no longer needed)

@zejunchen-zejun zejunchen-zejun force-pushed the zejun/refact_attn_0511 branch from 3b061e9 to b832aee Compare May 15, 2026 05:30
@zejunchen-zejun zejunchen-zejun changed the title [feat][Attention Refactor] Reconstruct the Attention arch [feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention arch May 15, 2026
@zejunchen-zejun zejunchen-zejun changed the title [feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention arch [feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention Arch May 15, 2026
Comment thread atom/model_ops/base_attention.py
@wuhuikx wuhuikx requested review from ganyi1996ppo and whx-sjtu May 19, 2026 12:09
@wuhuikx
Copy link
Copy Markdown
Collaborator

wuhuikx commented May 19, 2026

@zejunchen-zejun Please resolve the conflict. Is this PR ready for review?

@zejunchen-zejun zejunchen-zejun force-pushed the zejun/refact_attn_0511 branch from 5ca6fbc to 174c96e Compare May 19, 2026 13:30
@zejunchen-zejun zejunchen-zejun changed the title [feat][ATOM-vLLM][Attention Refactor] Reconstruct the Attention Arch [Refactor][ATOM-vLLM][Attention] Refactor ATOM-vLLM Attention May 20, 2026
@zejunchen-zejun zejunchen-zejun marked this pull request as ready for review May 20, 2026 04:58
Copilot AI review requested due to automatic review settings May 20, 2026 04:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors ATOM’s attention integration to clearly separate native ATOM, ATOM-vLLM plugin, and ATOM-SGLang plugin attention paths. It removes decorator/monkey-patch driven behavior in favor of explicit mode dispatch + vLLM-owned attention-layer implementations, and drops the now-unsupportable “disable only attention” fallback flag.

Changes:

  • Introduces a frontend Attention dispatcher (atom.model_ops.base_attention.Attention) that selects the correct attention implementation per runtime mode (native / vLLM / SGLang).
  • Replaces prior vLLM attention patching/decorators with a dedicated atom/plugin/vllm/attention/ stack (layers, backends, metadata, custom ops) and removes ATOM_DISABLE_VLLM_PLUGIN_ATTENTION + PluginConfig.vllm_use_atom_attention.
  • Updates tests and documentation/recipes to match the new plugin behavior and env-flag semantics.

Reviewed changes

Copilot reviewed 46 out of 47 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/test_envs.py Removes the deprecated ATOM_DISABLE_VLLM_PLUGIN_ATTENTION env var expectations/tests.
tests/plugin/test_plugin_env_flags.py Simplifies plugin-disable behavior test to only cover ATOM_DISABLE_VLLM_PLUGIN.
tests/plugin/test_plugin_config_translation.py Removes translation expectations tied to vllm_use_atom_attention.
recipes/atom_vllm/Qwen3Next.md Documents the new “no attention-only disable” behavior (note: currently placed inside a bash block).
recipes/atom_vllm/Qwen3.5.md Adds guidance about full plugin disable vs attention-only disable.
recipes/atom_vllm/Llama.md Removes usage/docs of the removed attention-only disable flag.
pyproject.toml Removes obsolete entry-point comment referencing the removed flag.
docs/vllm_plugin_backend_guide.md Updates lifecycle/architecture docs to reflect new vLLM attention layers/backends layout and semantics.
docs/rfc_attention_refactor.md Adds an RFC-style doc describing the refactor motivation and architecture.
docs/rfc_attention_refactor_atom_vllm_sglang.md Adds an alternate RFC doc covering the same refactor at a high level.
docs/review_comment_for_attn_refactor.md Adds an internal review notes document for the refactor.
docs/environment_variables.md Removes the deprecated attention-only disable env var from the catalog.
docs/atom_vllm_attention_refactor_plan.md Adds a refactor planning/architecture document (CN).
docs/atom_vllm_attention_architecture_analysis.md Adds a detailed analysis document of old vs new attention architecture (CN).
atom/utils/forward_context.py Removes plugin-only plugin_metadata plumbing from AttentionMetaData.
atom/utils/envs.py Removes parsing of ATOM_DISABLE_VLLM_PLUGIN_ATTENTION.
atom/plugin/vllm/register.py Removes MLA patching hook and attention-only disable handling.
atom/plugin/vllm/platform.py Stops overriding vLLM attention backend selection; documents that attention is owned by ATOM vLLM layers.
atom/plugin/vllm/moe.py Moves vLLM-only MoE naming adaptation into atom.plugin.vllm.
atom/plugin/vllm/mla_patch.py Deletes legacy vLLM MLA patching module.
atom/plugin/vllm/attention/ops.py Adds ATOM-owned vLLM custom ops (atom_vllm_mha_attention, atom_vllm_mla_attention) and marks them as splitting ops.
atom/plugin/vllm/attention/mla_sparse_impl.py Removes legacy plugin-metadata assumptions and aligns sparse indexer path with new metadata types/backends.
atom/plugin/vllm/attention/mla_impl.py Adds MLA helper utilities (e.g., fused GEMM imports, reorg_kvcache).
atom/plugin/vllm/attention/metadata.py Adds vLLM-specific metadata dataclasses and helper utilities for MHA/MLA/sparse/indexer.
atom/plugin/vllm/attention/layer.py Adds AttentionForVllm factory and ensures custom ops are registered via import side-effect.
atom/plugin/vllm/attention/layer_mla.py Implements vLLM MLA layer(s) using native MLAAttention for weight processing + vLLM AttentionLayerBase contract for execution.
atom/plugin/vllm/attention/layer_mha.py Implements vLLM MHA layer implementing AttentionLayerBase, using ATOM kernels + ATOM custom ops.
atom/plugin/vllm/attention/layer_common.py Adds shared vLLM layer helpers (kv-cache dtype init, static context registration, default scale init).
atom/plugin/vllm/attention/init.py Adds package docstring; avoids importing heavy submodules by default.
atom/plugin/vllm/attention_backend/mla_sparse.py Refactors sparse MLA backends/builders to explicit vLLM-facing classes and metadata builders.
atom/plugin/sglang/attention.py Adds AttentionForSGLang wrapper as the SGLang attention entrypoint for the dispatcher.
atom/plugin/register.py Makes set_attn_cls() a compatibility no-op (logs only); attention selection now happens in the dispatcher.
atom/plugin/config.py Removes vllm_use_atom_attention from PluginConfig and translation.
atom/models/deepseek_v2.py Updates imports to point at new sparse MLA/indexer integration location.
atom/model_ops/paged_attention.py Renames native attention layer to Attention and asserts it’s not instantiated in plugin mode.
atom/model_ops/moe.py Updates import path for the vLLM-only MoE decorator.
atom/model_ops/base_attention.py Adds mode-dispatching Attention constructor; adds wrappers for PA kernels; removes redundant layer= passing.
atom/model_ops/attentions/triton_mha.py Removes dependency on mutable atom.model_ops.Attention; always returns PagedAttentionImpl.
atom/model_ops/attentions/aiter_mla.py Removes plugin decorators/branching; restores native-only backend naming and builder behavior.
atom/model_ops/attentions/aiter_attention.py Removes plugin decorators/branching; simplifies impl selection; uses proper super().
atom/model_ops/attention_mla.py Removes plugin-mode decorator injection and plugin forward branching; consolidates on native forward_impl.
atom/model_ops/attention_mha.py Removes plugin-mode decorator injection and plugin branching; consolidates on native forward_impl.
atom/model_ops/init.py Stops exporting mutable attention symbols; exports only the frontend dispatcher Attention.
atom/config.py Adds ATOM vLLM attention ops to the default splitting ops list.
.claude/commands/atom-vllm-benchmark-guide.md Updates benchmark guide to remove references to the deprecated attention-only disable flag.
Comments suppressed due to low confidence (1)

atom/plugin/vllm/attention/layer_mha.py:896

  • get_kv_cache_spec() always returns SlidingWindowSpec because self.sliding_window is set to -1 when per_layer_sliding_window is None, and the check only tests is not None. This can generate an invalid/meaningless sliding-window KV cache spec for the default case. Consider storing None when sliding window is disabled, or checking self.sliding_window > 0 (or != -1) before returning SlidingWindowSpec.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread recipes/atom_vllm/Qwen3Next.md
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 41 out of 42 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

atom/plugin/vllm/attention/layer_mha.py:896

  • self.sliding_window is initialized to -1 when sliding window is disabled, but get_kv_cache_spec() checks only is not None, so it will always return SlidingWindowSpec (including for sliding_window=-1). This likely misinforms vLLM KV-cache allocation for non-sliding-window models. Treat -1 (and possibly None) as the disabled case and return FullAttentionSpec then.

Comment thread atom/plugin/vllm/attention/backend.py Outdated
Comment thread atom/plugin/vllm/attention/metadata.py
Comment thread atom/plugin/vllm/attention/ops.py Outdated
@wuhuikx
Copy link
Copy Markdown
Collaborator

wuhuikx commented May 21, 2026

I want to hold on this and merge DS-V3.2-MTP, GLM-4.7-MTP first. I'm afraid there will be conflict.

Comment thread atom/plugin/vllm/attention/backend.py Outdated
@whx-sjtu
Copy link
Copy Markdown
Contributor

Can we also refactor the atom config part? I think it's better to make set_current_atom_config a contextmanager and get_current_atom_config can only take effect inside the context created by set_current_atom_config, just like vLLM. Then we don't have to pass atom_config into forward_context anymore, which looks really ugly. @zejunchen-zejun

Comment thread atom/model_ops/base_attention.py
@zejunchen-zejun zejunchen-zejun marked this pull request as draft May 22, 2026 02:38
@zejunchen-zejun
Copy link
Copy Markdown
Collaborator Author

zejunchen-zejun commented May 22, 2026

Can we also refactor the atom config part? I think it's better to make set_current_atom_config a contextmanager and get_current_atom_config can only take effect inside the context created by set_current_atom_config, just like vLLM. Then we don't have to pass atom_config into forward_context anymore, which looks really ugly. @zejunchen-zejun

Make perfect sense, the current atom-vllm config has 2 risky points:

  1. fetch config from the global stateful singleton, for both main model and draft model
  2. the global atom-vllm is passed through the forward context
    I agree to refactor the atom-vllm config by using with-context, but it will not happen in this PR. This PR could focus on attention refactor. Next PR we can refactor the atom-vllm config.

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
remove legacy code and comment
add FIXME for one legacy method used by atom-sgl

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
move gdn into atom/plugin/vllm/attention
remove folder atom/plugin/vllm/attention_backend

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
as atom-vllm doesn't need it for now

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
fix missing kv dtype for sparse MLA

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
remove the multi inheritance and inline
the attention methods

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
instead of deprecating it

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@zejunchen-zejun
Copy link
Copy Markdown
Collaborator Author

ATOM native GPTOSS:Accuracy test : 0.8749052312357847 < 0.88
ATOM native Qwen3.5: Accuracy test : 0.8233510235026535 < 0.835

@zejunchen-zejun zejunchen-zejun merged commit f03f845 into main Jun 1, 2026
54 of 65 checks passed
@zejunchen-zejun zejunchen-zejun deleted the zejun/refact_attn_0511 branch June 1, 2026 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants