[None][refactor] Flatten thop.attention sequence kwargs + rename rotary_embedding_* to rope_*#14569
Conversation
|
/bot run --disable-fail-fast --add-multi-gpu-test --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
📝 WalkthroughWalkthroughThis PR refactors the attention operation API to eliminate grouped/vectorized parameter passing. Layer indices are renamed for clarity, rotary embeddings become explicit rope scalars, helix and spec-decoding parameters transition from vector containers to explicit optional tensors, and a new sync test validates Python-C++ parameter binding alignment across the boundary. ChangesAttention Parameter API Refactoring: From Grouped Vectors to Explicit Scalars
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h (1)
24-24:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUpdate NVIDIA copyright year on this modified file.
This header still ends at 2024 even though this file is being modified in this PR; please bump it to the latest modification year.
Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2022-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines
**/*.{cpp,cc,h,hpp,py,cu,cuh}: Include NVIDIA copyright header on ALL new files; update year on modified files.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h` at line 24, Update the copyright year in the file header comment (flash_fwd_mla_kernel.h) from 2024 to the current modification year (2026); locate the top-of-file copyright block and bump the year range (e.g., "2022-2026") so the header matches the coding guideline for modified files.cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp (1)
1-15:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the NVIDIA copyright year.
This file is modified in this PR, but the header still ends at
2025.As per coding guidelines,
**/*.{cpp,cc,h,hpp,py,cu,cuh}: Include NVIDIA copyright header on ALL new files; update year on modified files.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp` around lines 1 - 15, Update the NVIDIA copyright header in cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp to include the current year (change "2020-2025" to "2020-2026") so the file header reflects the modification year per the project's copyright guidelines.
🧹 Nitpick comments (3)
tests/unittest/_torch/attention_backend/test_attention_op_sync.py (3)
338-344: 💤 Low valuePEP 604 union syntax (
X | Y) won't be recognized.With Python 3.10+,
typing.get_origin(X | Y)returnstypes.UnionType, nottyping.Union. This means Optional fields declared withX | Nonesyntax will classify as"unknown"instead of unwrapping to the inner type—causing the type check to be skipped rather than validated.Since the fallback is safe (skip rather than false-positive), this is low priority but worth noting for future robustness.
♻️ Suggested fix to handle both union styles
+import types + def _python_category(py_type) -> str: ... origin = typing.get_origin(py_type) - if origin is typing.Union: + if origin is typing.Union or origin is types.UnionType: args = [a for a in typing.get_args(py_type) if a is not type(None)] if len(args) == 1: return _python_category(args[0]) return "unknown"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py` around lines 338 - 344, The union-unwrapping in _python_category currently only checks for typing.Union so it misses PEP 604 unions (X | Y) whose origin is types.UnionType; update the logic that inspects origin = typing.get_origin(py_type) to treat both typing.Union and types.UnionType the same (e.g., import types and check if origin in (typing.Union, types.UnionType) or isinstance(origin, types.UnionType) ), then extract typing.get_args(py_type) and unwrap the None case exactly as done now so X | None and Optional[X] both return _python_category(args[0]).
463-493: 💤 Low valueMissing return type annotation.
Per coding guidelines, all Python functions should have return type annotations.
♻️ Add return type
-def _verify_consumed(cls, chains: set[tuple[str, ...]], excluded=frozenset()): +def _verify_consumed(cls, chains: set[tuple[str, ...]], excluded=frozenset()) -> None:As per coding guidelines: "Always annotate Python functions with return type; use
Noneif the function does not return anything".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py` around lines 463 - 493, The function _verify_consumed is missing a return type annotation; update its signature to include an explicit return type of None (i.e., -> None) since it only asserts and doesn't return a value. Locate _verify_consumed (which references fields, dataclasses.is_dataclass, and _self_attrs_in_property) and add the return annotation without changing behavior or logic; no other code changes are needed.
1-551: QA list updates are not required.This is a static/AST-only unit test under
tests/unittest/. It validates Python-C++ binding alignment without runtime execution. No QA test list additions are needed—this test will run with the standardpytestunit test suite and doesn't require multi-GPU scheduling or end-to-end integration coverage.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py` around lines 1 - 551, Single-line summary: this static AST-only unit test requires no QA list additions. Fix: remove any edits that add this test to QA/automation lists and revert CI metadata changes; ensure the PR/body and any test registration only include the new test file name (test_attention_op_sync.py) without placing it in multi-GPU or E2E QA lists, and add a short note in the PR description that the test is AST-only and runs under regular pytest so no QA test-list update is necessary (refer to TrtllmAttention._run and AttentionForwardArgs when explaining if needed).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp`:
- Around line 216-220: In loadChunkedKVCacheForMLA(), reintroduce the FP8-only
guard so that when hasKvCacheQuant() is true but hasFp8KvCache() is false the
function rejects/returns an error instead of falling through to the non-FP8
template path; update the control flow around the kv_scale_quant_orig_ptr setup
(and related logic that chooses the template specialization) to check
hasFp8KvCache() and short-circuit with an error/log if FP8 support is missing,
referencing loadChunkedKVCacheForMLA(), hasKvCacheQuant(), and hasFp8KvCache()
to locate and fix the branch.
In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 161-168: spec_decoding_position_offsets_for_cpp currently reshapes
the raw spec_decoding_position_offsets buffer which can expose the full backing
width instead of the live runtime width; use the precomputed C++ view computed
by update_position_offsets_for_cpp (spec_decoding_position_offsets_cpp) instead.
Replace the logic in spec_decoding_position_offsets_for_cpp to return
spec_decoding_position_offsets_cpp when available (falling back to reshaping
spec_decoding_position_offsets only if the _cpp view is None) so the C++ kernel
receives the correct (max_num_requests, runtime_query_len) shape rather than
buf_dim; reference symbols: spec_decoding_position_offsets_for_cpp,
spec_decoding_position_offsets_cpp, update_position_offsets_for_cpp,
spec_decoding_position_offsets, max_num_requests.
---
Outside diff comments:
In `@cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h`:
- Line 24: Update the copyright year in the file header comment
(flash_fwd_mla_kernel.h) from 2024 to the current modification year (2026);
locate the top-of-file copyright block and bump the year range (e.g.,
"2022-2026") so the header matches the coding guideline for modified files.
In `@cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp`:
- Around line 1-15: Update the NVIDIA copyright header in
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp to include the current year (change
"2020-2025" to "2020-2026") so the file header reflects the modification year
per the project's copyright guidelines.
---
Nitpick comments:
In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py`:
- Around line 338-344: The union-unwrapping in _python_category currently only
checks for typing.Union so it misses PEP 604 unions (X | Y) whose origin is
types.UnionType; update the logic that inspects origin =
typing.get_origin(py_type) to treat both typing.Union and types.UnionType the
same (e.g., import types and check if origin in (typing.Union, types.UnionType)
or isinstance(origin, types.UnionType) ), then extract typing.get_args(py_type)
and unwrap the None case exactly as done now so X | None and Optional[X] both
return _python_category(args[0]).
- Around line 463-493: The function _verify_consumed is missing a return type
annotation; update its signature to include an explicit return type of None
(i.e., -> None) since it only asserts and doesn't return a value. Locate
_verify_consumed (which references fields, dataclasses.is_dataclass, and
_self_attrs_in_property) and add the return annotation without changing behavior
or logic; no other code changes are needed.
- Around line 1-551: Single-line summary: this static AST-only unit test
requires no QA list additions. Fix: remove any edits that add this test to
QA/automation lists and revert CI metadata changes; ensure the PR/body and any
test registration only include the new test file name
(test_attention_op_sync.py) without placing it in multi-GPU or E2E QA lists, and
add a short note in the PR description that the test is AST-only and runs under
regular pytest so no QA test-list update is necessary (refer to
TrtllmAttention._run and AttentionForwardArgs when explaining if needed).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: de0ebf46-86ea-4d79-b6d5-4ee18b1136b5
📒 Files selected for processing (14)
cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.hcpp/tensorrt_llm/nanobind/thop/bindings.cppcpp/tensorrt_llm/thop/attentionOp.cppcpp/tensorrt_llm/thop/attentionOp.hcpp/tensorrt_llm/thop/mlaPreprocessOp.cpptensorrt_llm/_torch/attention_backend/interface.pytensorrt_llm/_torch/attention_backend/sparse/dsa.pytensorrt_llm/_torch/attention_backend/trtllm.pytensorrt_llm/_torch/attention_backend/trtllm_gen.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_into_trtllm_attention.pytensorrt_llm/_torch/modules/attention.pytests/unittest/_torch/attention_backend/test_attention_op_sync.py
…ature Replace 5 list-typed parameters with their flat named components in thop::attention() and the 2 internal Runner virtual signatures (attentionOp.cpp lines ~361 and ~423). Each list was a fixed-arity bundle whose ordinal positions encoded semantic meaning; flattening eliminates the bookkeeping and lets the static sync test verify each element by name. Flattenings: - rotary_embedding_scales (3 doubles) -> rotary_embedding_scale, rotary_embedding_short_m_scale, rotary_embedding_long_m_scale - rotary_embedding_max_position_info (2 ints) -> rotary_embedding_max_positions, rotary_embedding_original_max_positions - helix_tensor_params (2 optional tensors) -> helix_position_offsets, helix_is_inactive_rank - spec_decoding_bool_params (3 bools) -> is_spec_decoding_enabled, use_spec_decoding, is_spec_dec_tree - spec_decoding_tensor_params (3 or 6 optional tensors) -> spec_decoding_generation_lengths, spec_decoding_position_offsets_for_cpp, spec_decoding_packed_mask, spec_decoding_bl_tree_mask_offset, spec_decoding_bl_tree_mask, spec_bl_tree_first_sparse_mask_offset_kv (last 3 None on non-SM100) Body cleanups: - Drop the [i] unpacking at the start of attention() (now params arrive named). - Spec-dec branch: replace size()==3/6 with named has_value() checks gated on isSM100Family(). - extractHelixParams lambda now captures the 2 optional tensors directly instead of indexing the vector. The nanobind binding (bindings.cpp) and Python callers are updated in follow-up commits; this commit alone does not link cleanly. The intermediate state is for review-grouping only; the PR merges as a single squashed change. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
…n C++ surface Apply the rope_* prefix consistently to all 8 rotary embedding params visible at the thop.attention boundary: - rotary_embedding_dim -> rope_dim - rotary_embedding_base -> rope_base - rotary_embedding_scale_type -> rope_scale_type - rotary_embedding_scale -> rope_scale - rotary_embedding_short_m_scale -> rope_short_m_scale - rotary_embedding_long_m_scale -> rope_long_m_scale - rotary_embedding_max_positions -> rope_max_positions - rotary_embedding_original_max_positions -> rope_original_max_positions Touches only the public C++ surface (attention() params in attentionOp.cpp/.h). The AttentionOp class member fields (mRotaryEmbedding*) keep their existing names — they are internal class state, out of scope. rotary_inv_freq / rotary_cos_sin / mrope_* keep their current names since they use different prefixes and were not asked to be renamed. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
…ntion kwargs Mirror the C++ signature change in commits b7a5e4d and 071622f into the m.def("attention", ...) nb::arg chain. Replace 5 sequence kwargs with their 16 flat counterparts: - rotary_embedding_scales -> rope_scale, rope_short_m_scale, rope_long_m_scale - rotary_embedding_max_position_info -> rope_max_positions, rope_original_max_positions - helix_tensor_params -> helix_position_offsets, helix_is_inactive_rank - spec_decoding_bool_params -> is_spec_decoding_enabled, use_spec_decoding, is_spec_dec_tree - spec_decoding_tensor_params -> 6 named optional tensors (spec_decoding_generation_lengths, spec_decoding_position_offsets_for_cpp, spec_decoding_packed_mask, spec_decoding_bl_tree_mask_offset, spec_decoding_bl_tree_mask, spec_bl_tree_first_sparse_mask_offset_kv) Rename 5 scalar kwargs: - rotary_embedding_{dim,base,scale_type} -> rope_{dim,base,scale_type} Order matches the C++ attention() signature param order so positional dispatch works. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
…ntion call Mirror the C++ schema change in the PyTorch backend call site: Call site (thop.attention kwargs in TrtllmAttention._run): - Replace 5 sequence kwargs with 16 flat ones, each sourced from a named metadata/self attribute. - Rename 5 scalar kwargs from rotary_embedding_* to rope_*. Property cleanup on TrtllmAttention: - Replace the 5 rotary_embedding_* properties (scalar + 2 list builders) with 8 rope_* scalar properties — one per flat kwarg. - The 8 properties are all one-liner accessors over rope_params. Property cleanup on TrtllmAttentionMetadata: - Delete helix_tensor_params, spec_decoding_bool_params, and spec_decoding_tensor_params — all three were list builders whose sole consumer was the thop.attention call site, which now reads the underlying fields directly. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
… + renamed kwargs Mirror the C++ schema change in the AutoDeploy custom ops that call thop.attention via positional args. trtllm_attention.py (single call site + planner state): - Replace planner's spec_decoding_bool_params/spec_decoding_tensor_params list fields with a single use_spec_decoding scalar; the kernel-level 3-bool + 6-tensor decomposition is now passed directly at the call site. Remove the SM-version-dependent list-sizing in reset() and the per-batch list mutations in init_spec_decoding. - Replace the local rotary_embedding_scales / rotary_embedding_max_position_info / mla_tensor_params lists with their flat counterparts at the positional call site. - Rename rotary_embedding_* comments to rope_* at the call site. trtllm_mla.py (4 positional call sites): - Remove the 3 identical 7-line local-list construction blocks + their now-unused sm_version computation. - Inline the flat constants at each call site. fuse_rope_into_trtllm_attention.py (internal rope_info dict): - Rename the "rotary_embedding_dim" key to "rope_dim" and update its one reader in trtllm_attention.py. Internal to AutoDeploy; no external consumers. All AutoDeploy modules import cleanly. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Add test_no_sequence_kwargs_at_thop_attention_boundary to test_attention_op_sync.py: parse the void attention(...) declaration in attentionOp.h and assert no param has an outer container type (std::vector, c10::ArrayRef, std::array). The existing per-element source/type/literal sync tests cannot verify ordinal positions inside a list. Flat named params keep every slot checkable individually. If a future change tries to introduce a new sequence kwarg, this test fails with the specific param name and type and an instruction to flatten it. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
cd34033 to
6aafd09
Compare
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #50753 [ run ] triggered by Bot. Commit: |
|
PR_Github #50753 [ run ] completed with state
|
|
/bot run --disable-fail-fast --add-multi-gpu-test |
|
PR_Github #50953 [ run ] triggered by Bot. Commit: |
|
PR_Github #50953 [ run ] completed with state |
Summary
Eliminate every
Sequence[...]kwarg fromthop.attentionand rename the remainingrotary_embedding_*kwargs torope_*for consistency.Flattened (5 sequence → 16 scalar/tensor kwargs)
rotary_embedding_scales(3 doubles)rope_scale,rope_short_m_scale,rope_long_m_scalerotary_embedding_max_position_info(2 ints)rope_max_positions,rope_original_max_positionshelix_tensor_params(2 optional tensors)helix_position_offsets,helix_is_inactive_rankspec_decoding_bool_params(3 bools)is_spec_decoding_enabled,use_spec_decoding,is_spec_dec_treespec_decoding_tensor_params(3 or 6 optional tensors)Noneon non-SM100Renamed at the boundary
rotary_embedding_{dim,base,scale_type,scale,short_m_scale,long_m_scale,max_positions,original_max_positions}→rope_{...}.The internal
AttentionOp::mRotaryEmbedding*fields and therotary_inv_freq/rotary_cos_sin/mrope_*kwargs are out of scope.AutoDeploy
Both
auto_deploy/custom_ops/attention/trtllm_attention.pyandauto_deploy/custom_ops/mla/trtllm_mla.pycallthop.attentionvia positional args._GlobalTrtllmPlannerno longer holds spec-dec list buffers; each call site passes the 3 bool + 6 tensor slots inline. The internalrope_info["rotary_embedding_dim"]key infuse_rope_into_trtllm_attention.pyis renamed torope_dim.Test
New
test_no_sequence_kwargs_at_thop_attention_boundaryintest_attention_op_sync.pyparses thevoid attention(...)declaration inattentionOp.hand asserts nostd::vector/c10::ArrayRef/std::arrayouter types. Prevents future sequence-kwarg regressions.Test plan
pre-commit runon all 6 touched files — greenpytest test_attention_op_sync.py— 6/6 pass (5 existing + 1 new)test_attention_mla -k v1_kv_cacheon H100 — 40 passed / 2 skipped / 0 failedtest_trtllm_attention_op.py+test_trtllm_mla_op.pyon H100 — 68/68 passSummary by CodeRabbit
Refactor
Tests