[Refactor][ATOM-vLLM][Attention] Refactor ATOM-vLLM Attention#750
Conversation
3b061e9 to
b832aee
Compare
|
@zejunchen-zejun Please resolve the conflict. Is this PR ready for review? |
5ca6fbc to
174c96e
Compare
There was a problem hiding this comment.
Pull request overview
This PR refactors ATOM’s attention integration to clearly separate native ATOM, ATOM-vLLM plugin, and ATOM-SGLang plugin attention paths. It removes decorator/monkey-patch driven behavior in favor of explicit mode dispatch + vLLM-owned attention-layer implementations, and drops the now-unsupportable “disable only attention” fallback flag.
Changes:
- Introduces a frontend
Attentiondispatcher (atom.model_ops.base_attention.Attention) that selects the correct attention implementation per runtime mode (native / vLLM / SGLang). - Replaces prior vLLM attention patching/decorators with a dedicated
atom/plugin/vllm/attention/stack (layers, backends, metadata, custom ops) and removesATOM_DISABLE_VLLM_PLUGIN_ATTENTION+PluginConfig.vllm_use_atom_attention. - Updates tests and documentation/recipes to match the new plugin behavior and env-flag semantics.
Reviewed changes
Copilot reviewed 46 out of 47 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| tests/test_envs.py | Removes the deprecated ATOM_DISABLE_VLLM_PLUGIN_ATTENTION env var expectations/tests. |
| tests/plugin/test_plugin_env_flags.py | Simplifies plugin-disable behavior test to only cover ATOM_DISABLE_VLLM_PLUGIN. |
| tests/plugin/test_plugin_config_translation.py | Removes translation expectations tied to vllm_use_atom_attention. |
| recipes/atom_vllm/Qwen3Next.md | Documents the new “no attention-only disable” behavior (note: currently placed inside a bash block). |
| recipes/atom_vllm/Qwen3.5.md | Adds guidance about full plugin disable vs attention-only disable. |
| recipes/atom_vllm/Llama.md | Removes usage/docs of the removed attention-only disable flag. |
| pyproject.toml | Removes obsolete entry-point comment referencing the removed flag. |
| docs/vllm_plugin_backend_guide.md | Updates lifecycle/architecture docs to reflect new vLLM attention layers/backends layout and semantics. |
| docs/rfc_attention_refactor.md | Adds an RFC-style doc describing the refactor motivation and architecture. |
| docs/rfc_attention_refactor_atom_vllm_sglang.md | Adds an alternate RFC doc covering the same refactor at a high level. |
| docs/review_comment_for_attn_refactor.md | Adds an internal review notes document for the refactor. |
| docs/environment_variables.md | Removes the deprecated attention-only disable env var from the catalog. |
| docs/atom_vllm_attention_refactor_plan.md | Adds a refactor planning/architecture document (CN). |
| docs/atom_vllm_attention_architecture_analysis.md | Adds a detailed analysis document of old vs new attention architecture (CN). |
| atom/utils/forward_context.py | Removes plugin-only plugin_metadata plumbing from AttentionMetaData. |
| atom/utils/envs.py | Removes parsing of ATOM_DISABLE_VLLM_PLUGIN_ATTENTION. |
| atom/plugin/vllm/register.py | Removes MLA patching hook and attention-only disable handling. |
| atom/plugin/vllm/platform.py | Stops overriding vLLM attention backend selection; documents that attention is owned by ATOM vLLM layers. |
| atom/plugin/vllm/moe.py | Moves vLLM-only MoE naming adaptation into atom.plugin.vllm. |
| atom/plugin/vllm/mla_patch.py | Deletes legacy vLLM MLA patching module. |
| atom/plugin/vllm/attention/ops.py | Adds ATOM-owned vLLM custom ops (atom_vllm_mha_attention, atom_vllm_mla_attention) and marks them as splitting ops. |
| atom/plugin/vllm/attention/mla_sparse_impl.py | Removes legacy plugin-metadata assumptions and aligns sparse indexer path with new metadata types/backends. |
| atom/plugin/vllm/attention/mla_impl.py | Adds MLA helper utilities (e.g., fused GEMM imports, reorg_kvcache). |
| atom/plugin/vllm/attention/metadata.py | Adds vLLM-specific metadata dataclasses and helper utilities for MHA/MLA/sparse/indexer. |
| atom/plugin/vllm/attention/layer.py | Adds AttentionForVllm factory and ensures custom ops are registered via import side-effect. |
| atom/plugin/vllm/attention/layer_mla.py | Implements vLLM MLA layer(s) using native MLAAttention for weight processing + vLLM AttentionLayerBase contract for execution. |
| atom/plugin/vllm/attention/layer_mha.py | Implements vLLM MHA layer implementing AttentionLayerBase, using ATOM kernels + ATOM custom ops. |
| atom/plugin/vllm/attention/layer_common.py | Adds shared vLLM layer helpers (kv-cache dtype init, static context registration, default scale init). |
| atom/plugin/vllm/attention/init.py | Adds package docstring; avoids importing heavy submodules by default. |
| atom/plugin/vllm/attention_backend/mla_sparse.py | Refactors sparse MLA backends/builders to explicit vLLM-facing classes and metadata builders. |
| atom/plugin/sglang/attention.py | Adds AttentionForSGLang wrapper as the SGLang attention entrypoint for the dispatcher. |
| atom/plugin/register.py | Makes set_attn_cls() a compatibility no-op (logs only); attention selection now happens in the dispatcher. |
| atom/plugin/config.py | Removes vllm_use_atom_attention from PluginConfig and translation. |
| atom/models/deepseek_v2.py | Updates imports to point at new sparse MLA/indexer integration location. |
| atom/model_ops/paged_attention.py | Renames native attention layer to Attention and asserts it’s not instantiated in plugin mode. |
| atom/model_ops/moe.py | Updates import path for the vLLM-only MoE decorator. |
| atom/model_ops/base_attention.py | Adds mode-dispatching Attention constructor; adds wrappers for PA kernels; removes redundant layer= passing. |
| atom/model_ops/attentions/triton_mha.py | Removes dependency on mutable atom.model_ops.Attention; always returns PagedAttentionImpl. |
| atom/model_ops/attentions/aiter_mla.py | Removes plugin decorators/branching; restores native-only backend naming and builder behavior. |
| atom/model_ops/attentions/aiter_attention.py | Removes plugin decorators/branching; simplifies impl selection; uses proper super(). |
| atom/model_ops/attention_mla.py | Removes plugin-mode decorator injection and plugin forward branching; consolidates on native forward_impl. |
| atom/model_ops/attention_mha.py | Removes plugin-mode decorator injection and plugin branching; consolidates on native forward_impl. |
| atom/model_ops/init.py | Stops exporting mutable attention symbols; exports only the frontend dispatcher Attention. |
| atom/config.py | Adds ATOM vLLM attention ops to the default splitting ops list. |
| .claude/commands/atom-vllm-benchmark-guide.md | Updates benchmark guide to remove references to the deprecated attention-only disable flag. |
Comments suppressed due to low confidence (1)
atom/plugin/vllm/attention/layer_mha.py:896
get_kv_cache_spec()always returnsSlidingWindowSpecbecauseself.sliding_windowis set to-1whenper_layer_sliding_windowis None, and the check only testsis not None. This can generate an invalid/meaningless sliding-window KV cache spec for the default case. Consider storingNonewhen sliding window is disabled, or checkingself.sliding_window > 0(or!= -1) before returningSlidingWindowSpec.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e53ccff to
41c09ac
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 41 out of 42 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
atom/plugin/vllm/attention/layer_mha.py:896
self.sliding_windowis initialized to-1when sliding window is disabled, butget_kv_cache_spec()checks onlyis not None, so it will always returnSlidingWindowSpec(including forsliding_window=-1). This likely misinforms vLLM KV-cache allocation for non-sliding-window models. Treat-1(and possiblyNone) as the disabled case and returnFullAttentionSpecthen.
|
I want to hold on this and merge DS-V3.2-MTP, GLM-4.7-MTP first. I'm afraid there will be conflict. |
|
Can we also refactor the atom config part? I think it's better to make |
Make perfect sense, the current atom-vllm config has 2 risky points:
|
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
remove legacy code and comment add FIXME for one legacy method used by atom-sgl Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
move gdn into atom/plugin/vllm/attention remove folder atom/plugin/vllm/attention_backend Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
as atom-vllm doesn't need it for now Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
fix missing kv dtype for sparse MLA Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
remove the multi inheritance and inline the attention methods Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
instead of deprecating it Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
35c0aad to
f9edca3
Compare
|
This PR refactor the attention architecture for ATOM-vLLM. Here is the RFC: #758
Accuracy:
Performance:
P0 atom-vLLM Performance Regression Check
Branch head: 776e0b3
Image: docker.io/rocm/atom-dev:vllm-v0.19.0-nightly_20260526
Remaining perf test
atom-vllm
File Layout:
backend.pylayer.pylayer_common.pystatic_forward_contextlayer_mha.pyAttentionForVllmMHA: MHA attention layer (forward, KV cache, scales)layer_mla.pyAttentionForVllmMLA: MLA attention layer + helper functions (reorg_kvcache, triton BMM wrappers)layer_sparse_mla.pylayer_gdn.pyGatedDeltaNetattention layermetadata.pyops.pytorch.compilecustom ops (atom_vllm_mha_attention,atom_vllm_mla_attention)__init__.pyCode:
atom/plugin/attention_mha.pyPagedAttentionImplDecoratorForPluginModeatom/plugin/vllm/attention/layer_mha.pyAttentionForVllmMHAatom/plugin/attention_mla.pyMLAAttentionImplDecoratorForPluginModeatom/plugin/vllm/attention/layer_mla.pyAttentionForVllmMLAatom/plugin/attention_mla_sparse.pyMLASparseAttentionImplDecoratorForPluginModeatom/plugin/vllm/attention/layer_mla.pyAttentionForVllmSparseMLAatom/plugin/vllm/attention_backend/attention_gdn.pyGatedDeltaNetatom/plugin/vllm/attention/layer_gdn.pyGatedDeltaNetatom/model_ops/paged_attention.py(vllm branch in PagedAttention)PagedAttentionatom/plugin/vllm/attention/layer.pyAttentionForVllmatom/model_ops/attentions/aiter_attention.pyAiterBackend(reused native)atom/plugin/vllm/attention/backend.pyAiterMhaBackendForVllmatom/model_ops/attentions/aiter_mla.pyAiterMLABackend(reused native)atom/plugin/vllm/attention/backend.pyAiterMlaBackendForVllmatom/plugin/vllm/attention_backend/mla_sparse.pyAiterMLASparseBackendatom/plugin/vllm/attention/backend.pyAiterSparseMlaBackendForVllmatom/plugin/vllm/attention_backend/mla_sparse.pyAiterMLASparseIndexerBackendatom/plugin/vllm/attention/backend.pyAiterSparseMlaIndexerBackendForVllmatom/plugin/vllm/attention_backend/gdn_attn.pyGDNAttentionBackendatom/plugin/vllm/attention/backend.pyGDNAttentionBackendatom/plugin/attention.pyAiterFlashAttentionMetadataForPluginModeatom/plugin/vllm/attention/metadata.pyAiterMhaMetadataForVllmatom/plugin/attention.pyAiterFlashAttentionPhaseMetadataatom/plugin/vllm/attention/metadata.pyAiterMhaPhaseMetadataatom/plugin/attention.pyAiterFlashAttentionChunkPrefillMetadataatom/plugin/vllm/attention/metadata.pyAiterChunkPrefillMetadataatom/plugin/attention.pyAiterMLACommonMetadataForPluginModeatom/plugin/vllm/attention/metadata.pyAiterMlaMetadataForVllmatom/plugin/attention.pyAiterMLADecodeMetadataForPluginModeatom/plugin/vllm/attention/metadata.pyAiterMlaDecodeMetadataForVllmatom/plugin/attention.pyAiterMLACommonPrefillMetadataForPluginModeatom/plugin/vllm/attention/metadata.pyAiterMlaPrefillMetadataForVllmatom/plugin/attention.pyAiterMLAChunkedContextMetadataForPluginModeatom/plugin/vllm/attention/metadata.pyAiterMlaChunkedContextMetadataForVllmatom/plugin/attention.pyAiterMLASparseMetadataForPluginModeatom/plugin/vllm/attention/metadata.pyAiterMlaSparseMetadataForVllmatom/plugin/attention.pyvllmDeepseekV32IndexerMetadataatom/plugin/vllm/attention/metadata.pyAiterMlaSparseIndexerMetadataForVllmatom/plugin/attention.pyvllmAttentionMetadataBuilderMethodsatom/plugin/vllm/attention/metadata.pyAiterMhaMetadataBuilderForVllmatom/plugin/attention.pyvllmMLAAttentionMetadataBuilderMethodsatom/plugin/vllm/attention/metadata.pyAiterMlaMetadataBuilderForVllmatom/plugin/attention.pyvllmMLASparseAttentionMetadataBuilderMethodsatom/plugin/vllm/attention/metadata.pyAiterMlaSparseMetadataBuilderatom/plugin/attention.pyvllmMLASparseIndexerAttentionMetadataBuilderMethodsatom/plugin/vllm/attention/metadata.pyAiterMlaSparseIndexerMetadataBuilderatom/plugin/attention.pyunified_attention_with_output_base_for_plugin_modeatom/plugin/vllm/attention/ops.pytorch.ops.aiter.atom_vllm_mha_attentionatom/plugin/attention.pyunified_attention_with_output_base_for_plugin_modeatom/plugin/vllm/attention/ops.pytorch.ops.aiter.atom_vllm_mla_attentionatom/plugin/attention_mla_sparse.pyIndexerDecoratorForPluginModeatom/plugin/vllm/attention/layer_sparse_mla.pyIndexerDecoratorForPluginModeatom/plugin/attention_mla_sparse.pyDeepseekV32IndexerCacheDecoratorForPluginModeatom/plugin/vllm/attention/layer_sparse_mla.pyDeepseekV32IndexerCacheDecoratorForPluginModeatom/plugin/moe.pyFusedMoEDecoratorForPluginModeatom/plugin/vllm/moe.pyFusedMoEDecoratorForPluginModeatom/plugin/vllm/mla_patch.pypatch_vllm_mla_attention