Skip to content

[None][refactor] Flatten thop.attention sequence kwargs + rename rotary_embedding_* to rope_*#14569

Merged
yuxianq merged 6 commits into
NVIDIA:mainfrom
yuxianq:flatten-attn-args
May 29, 2026
Merged

[None][refactor] Flatten thop.attention sequence kwargs + rename rotary_embedding_* to rope_*#14569
yuxianq merged 6 commits into
NVIDIA:mainfrom
yuxianq:flatten-attn-args

Conversation

@yuxianq
Copy link
Copy Markdown
Collaborator

@yuxianq yuxianq commented May 26, 2026

Summary

Eliminate every Sequence[...] kwarg from thop.attention and rename the remaining rotary_embedding_* kwargs to rope_* for consistency.

Flattened (5 sequence → 16 scalar/tensor kwargs)

Old (sequence) New (flat)
rotary_embedding_scales (3 doubles) rope_scale, rope_short_m_scale, rope_long_m_scale
rotary_embedding_max_position_info (2 ints) rope_max_positions, rope_original_max_positions
helix_tensor_params (2 optional tensors) helix_position_offsets, helix_is_inactive_rank
spec_decoding_bool_params (3 bools) is_spec_decoding_enabled, use_spec_decoding, is_spec_dec_tree
spec_decoding_tensor_params (3 or 6 optional tensors) 6 named optionals; last 3 are None on non-SM100

Renamed at the boundary

rotary_embedding_{dim,base,scale_type,scale,short_m_scale,long_m_scale,max_positions,original_max_positions}rope_{...}.

The internal AttentionOp::mRotaryEmbedding* fields and the rotary_inv_freq / rotary_cos_sin / mrope_* kwargs are out of scope.

AutoDeploy

Both auto_deploy/custom_ops/attention/trtllm_attention.py and auto_deploy/custom_ops/mla/trtllm_mla.py call thop.attention via positional args. _GlobalTrtllmPlanner no longer holds spec-dec list buffers; each call site passes the 3 bool + 6 tensor slots inline. The internal rope_info["rotary_embedding_dim"] key in fuse_rope_into_trtllm_attention.py is renamed to rope_dim.

Test

New test_no_sequence_kwargs_at_thop_attention_boundary in test_attention_op_sync.py parses the void attention(...) declaration in attentionOp.h and asserts no std::vector / c10::ArrayRef / std::array outer types. Prevents future sequence-kwarg regressions.

Test plan

  • pre-commit run on all 6 touched files — green
  • pytest test_attention_op_sync.py — 6/6 pass (5 existing + 1 new)
  • Incremental C++ rebuild on H100 (ipp2-1776) — clean
  • PyTorch backend functional: test_attention_mla -k v1_kv_cache on H100 — 40 passed / 2 skipped / 0 failed
  • AutoDeploy unit tests: test_trtllm_attention_op.py + test_trtllm_mla_op.py on H100 — 68/68 pass
  • CI with AutoDeploy stages

Summary by CodeRabbit

  • Refactor

    • Restructured attention operation parameters for improved clarity and maintainability (layer indexing, RoPE configurations, helix and speculative decoding inputs refactored from vector-based to explicit parameters).
  • Tests

    • Added comprehensive validation test for attention operation parameter synchronization between Python and C++ layers.

Review Change Stack

@yuxianq yuxianq requested review from a team as code owners May 26, 2026 10:59
@yuxianq yuxianq requested a review from QiJune May 26, 2026 10:59
@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 26, 2026

/bot run --disable-fail-fast --add-multi-gpu-test --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@yuxianq yuxianq marked this pull request as draft May 26, 2026 11:04
@yuxianq yuxianq removed the request for review from QiJune May 26, 2026 11:05
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

📝 Walkthrough

Walkthrough

This PR refactors the attention operation API to eliminate grouped/vectorized parameter passing. Layer indices are renamed for clarity, rotary embeddings become explicit rope scalars, helix and spec-decoding parameters transition from vector containers to explicit optional tensors, and a new sync test validates Python-C++ parameter binding alignment across the boundary.

Changes

Attention Parameter API Refactoring: From Grouped Vectors to Explicit Scalars

Layer / File(s) Summary
C++ API contract: header and nanobind binding
cpp/tensorrt_llm/thop/attentionOp.h, cpp/tensorrt_llm/nanobind/thop/bindings.cpp
The torch_ext::attention signature is refactored: layer_idxlocal_layer_idx; rotary-embedding vectors (rotary_embedding_dim/base/scales/max_position_info) → rope scalars (rope_dim/base/scale_type/scale/short_m_scale/long_m_scale/max_positions/original_max_positions); helix tensor vector → explicit tensors (helix_position_offsets, helix_is_inactive_rank); spec-decoding bool/tensor vectors → explicit flags (is_spec_decoding_enabled, use_spec_decoding, is_spec_dec_tree) and individual tensors.
C++ implementation: parameter wiring and runner signatures
cpp/tensorrt_llm/thop/attentionOp.cpp (lines 359-368, 419-435, 900-932, 941-1051, 1144-1220, 1235-1245)
Update RunnerBase::run, Runner::run, and attention() to accept and wire explicit parameters; set mLayerIdx from local_layer_idx; assign mRotaryEmbedding* fields from rope scalars; set spec-decoding flags directly; update tracing/logging to use local_layer_idx.
C++ helix parameter extraction and wiring
cpp/tensorrt_llm/thop/attentionOp.cpp (lines 472-520, 737-754)
Refactor helix extraction in MLA path and extractHelixParams helper to read directly from helix_position_offsets and helix_is_inactive_rank optional tensors instead of vector indexing.
C++ spec-decoding tensor handling in generation stage
cpp/tensorrt_llm/thop/attentionOp.cpp (lines 810-839, 1071-1073)
Rewrite generation-stage logic to consume explicit spec-decoding tensors directly with validation for "tllm-gen" path; add rank check for position offsets; set spec-decoding flags from explicit booleans.
MLA KV cache quantization: parameter specialization
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
Drop kv_scale_orig_quant from loadPagedKVCacheForMLA and loadChunkedKVCacheForMLA; drop kv_scale_quant_orig from MLARopeAppendPagedKVAssignQ; update Torch binding schemas and pointer initialization logic to match.
Python interface contract: AttentionForwardArgs and sparse args
tensorrt_llm/_torch/attention_backend/interface.py
Extend AttentionForwardArgs with KV scale quantization tensors (kv_scale_orig_quant, kv_scale_quant_orig), control flags (is_fused_qkv, update_kv_cache), sparse attention config, and mask_type property mapping enums to C++ constants; add AttentionSparseArgs dataclass.
TrtllmAttentionMetadata: derived properties for C++ dispatch
tensorrt_llm/_torch/attention_backend/trtllm.py (lines 150-175, 221-225)
Add effective_workspace, spec_decoding_position_offsets_for_cpp (with 1D reshaping), max_context_length properties and use_paged_context_fmha field to support refactored keyword-based thop.attention dispatch.
TrtllmAttention: rope and skip-softmax properties
tensorrt_llm/_torch/attention_backend/trtllm.py (lines 1287-1333)
Add read-only properties delegating to rope_params and sparse_attention_config for rope dimensions, base, scale types, short/long scales, max positions, and skip-softmax thresholds; remove _get_mask_type helper.
TrtllmAttention._run refactoring: keyword-based dispatch
tensorrt_llm/_torch/attention_backend/trtllm.py (lines 90bfaf0244f1, 1139-1146, 1349-1410, 1420-1566)
Simplify _run to (q, k, v, metadata, forward_args), cache local_layer_idx, prime RoPE and metadata, and rewrite thop.attention call with explicit keyword arguments sourced from new metadata properties and module state instead of positional/older arguments.
TrtllmAttention.forward: output creation and sparse args setup
tensorrt_llm/_torch/attention_backend/trtllm.py (lines 1593-1669)
Set metadata.use_paged_context_fmha, create outputs when needed, compute is_fused_qkv/update_kv_cache flags, build sparse args conditionally, update Blackwell first-sparse offsets, and call simplified _run.
TrtllmAttention MLA KV-scale parameter updates
tensorrt_llm/_torch/attention_backend/trtllm.py (lines 1733, 1778, 1821, 1935-1936)
Update all MLA kernel invocations to pass None for KV scale arguments, aligning with the refactored parameter contract.
FlashInferTrtllmGenAttention: simplified rope and output scale
tensorrt_llm/_torch/attention_backend/trtllm_gen.py (lines 622-626, 980, 1089)
Simplify _get_mrope_rotary_cos_sin to directly return forward_args.mrope_rotary_cos_sin; remove _get_attention_output_orig_quant helper and use forward.out_scale directly in context and generation paths.
Attention module: KV scale and output scale restructuring
tensorrt_llm/_torch/modules/attention.py (lines 715, 744-761, 770-771)
Replace kv_scales_sf/kv_scales_sf_inv with kv_scale_orig_quant/kv_scale_quant_orig sourced from inv_kv_scales/kv_scales; compute single out_scale based on quantize-output flag; pass new field names into AttentionForwardArgs.
DSATrtllmAttention: MLA RoPE append parameter update
tensorrt_llm/_torch/attention_backend/sparse/dsa.py (line 2535)
Update mla_rope_append_paged_kv_assign_q call to pass None for KV scale arguments, matching simplified signature.
Auto-Deploy attention: rope and spec-decoding parameter rewiring
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
Replace rotary_embedding_* vector construction with explicit rope_* scalars; refactor spec-decoding from pre-packed lists to per-batch use_spec_decoding and individual tensors; switch layer_idxlocal_layer_idx.
Auto-Deploy MLA: rope scalar parameters and explicit helix/spec-decoding
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.py
Eliminate get_sm_version()-dependent vector construction across fresh/cached/chunked prefill and decode; use direct YaRN rope scalars and explicit per-parameter null/false/tensor values in all thop.attention calls.
RoPE dimension parameter rename: rotary_embedding_dim → rope_dim
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_into_trtllm_attention.py
Rename dimension parameter in _convert_to_thop_cos_sin helper and metadata dictionary key in _build_rope_info for consistency with new rope parameter naming scheme.
TrtllmAttention: import and constant setup
tensorrt_llm/_torch/attention_backend/trtllm.py (lines 5, 25-50)
Remove unused replace import; add environment-controlled feature flags and sync-test configuration constants (_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION, _THOP_EXCLUDED_FIELDS, _THOP_LITERALS).
Test suite: Python-C++ parameter binding validation
tests/unittest/_torch/attention_backend/test_attention_op_sync.py
Add comprehensive unit test that parses C++ header and Python AST to validate: kwarg set equivalence, unique attribute resolution, literal allowlists, field consumption tracking, and absence of sequence-typed C++ parameters in the thop.attention binding.
FlashMLA kernel: null-pointer tolerance for descale factors
cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h (lines 689-690)
Update flash_fwd_splitkv_mla_kernel to load descale_q and descale_k conditionally from non-null pointers, defaulting to 1.f when pointer is null.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes


Possibly related PRs

  • NVIDIA/TensorRT-LLM#14275: Updates the same attention() C++ entry point and nanobind binding in attentionOp.h/cpp and bindings.cpp, removing sink_token_length while this PR restructures rotary/helix/spec-decoding parameters.

Suggested reviewers

  • liji-nv
  • brb-nv
  • nv-guomingz
  • wenmingw
  • hyukn
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main refactoring work: flattening sequence keyword arguments and renaming rotary_embedding_* parameters to rope_*.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description provides a clear summary of changes, includes a detailed comparison table, explains the refactor scope, test coverage, and references a complete test plan with all validation steps.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h (1)

24-24: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update NVIDIA copyright year on this modified file.

This header still ends at 2024 even though this file is being modified in this PR; please bump it to the latest modification year.

Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines **/*.{cpp,cc,h,hpp,py,cu,cuh}: Include NVIDIA copyright header on ALL new files; update year on modified files.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h` at line 24, Update
the copyright year in the file header comment (flash_fwd_mla_kernel.h) from 2024
to the current modification year (2026); locate the top-of-file copyright block
and bump the year range (e.g., "2022-2026") so the header matches the coding
guideline for modified files.
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp (1)

1-15: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the NVIDIA copyright year.

This file is modified in this PR, but the header still ends at 2025.

As per coding guidelines, **/*.{cpp,cc,h,hpp,py,cu,cuh}: Include NVIDIA copyright header on ALL new files; update year on modified files.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp` around lines 1 - 15, Update the
NVIDIA copyright header in cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp to include
the current year (change "2020-2025" to "2020-2026") so the file header reflects
the modification year per the project's copyright guidelines.
🧹 Nitpick comments (3)
tests/unittest/_torch/attention_backend/test_attention_op_sync.py (3)

338-344: 💤 Low value

PEP 604 union syntax (X | Y) won't be recognized.

With Python 3.10+, typing.get_origin(X | Y) returns types.UnionType, not typing.Union. This means Optional fields declared with X | None syntax will classify as "unknown" instead of unwrapping to the inner type—causing the type check to be skipped rather than validated.

Since the fallback is safe (skip rather than false-positive), this is low priority but worth noting for future robustness.

♻️ Suggested fix to handle both union styles
+import types
+
 def _python_category(py_type) -> str:
     ...
     origin = typing.get_origin(py_type)
-    if origin is typing.Union:
+    if origin is typing.Union or origin is types.UnionType:
         args = [a for a in typing.get_args(py_type) if a is not type(None)]
         if len(args) == 1:
             return _python_category(args[0])
         return "unknown"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py` around
lines 338 - 344, The union-unwrapping in _python_category currently only checks
for typing.Union so it misses PEP 604 unions (X | Y) whose origin is
types.UnionType; update the logic that inspects origin =
typing.get_origin(py_type) to treat both typing.Union and types.UnionType the
same (e.g., import types and check if origin in (typing.Union, types.UnionType)
or isinstance(origin, types.UnionType) ), then extract typing.get_args(py_type)
and unwrap the None case exactly as done now so X | None and Optional[X] both
return _python_category(args[0]).

463-493: 💤 Low value

Missing return type annotation.

Per coding guidelines, all Python functions should have return type annotations.

♻️ Add return type
-def _verify_consumed(cls, chains: set[tuple[str, ...]], excluded=frozenset()):
+def _verify_consumed(cls, chains: set[tuple[str, ...]], excluded=frozenset()) -> None:

As per coding guidelines: "Always annotate Python functions with return type; use None if the function does not return anything".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py` around
lines 463 - 493, The function _verify_consumed is missing a return type
annotation; update its signature to include an explicit return type of None
(i.e., -> None) since it only asserts and doesn't return a value. Locate
_verify_consumed (which references fields, dataclasses.is_dataclass, and
_self_attrs_in_property) and add the return annotation without changing behavior
or logic; no other code changes are needed.

1-551: QA list updates are not required.

This is a static/AST-only unit test under tests/unittest/. It validates Python-C++ binding alignment without runtime execution. No QA test list additions are needed—this test will run with the standard pytest unit test suite and doesn't require multi-GPU scheduling or end-to-end integration coverage.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py` around
lines 1 - 551, Single-line summary: this static AST-only unit test requires no
QA list additions. Fix: remove any edits that add this test to QA/automation
lists and revert CI metadata changes; ensure the PR/body and any test
registration only include the new test file name (test_attention_op_sync.py)
without placing it in multi-GPU or E2E QA lists, and add a short note in the PR
description that the test is AST-only and runs under regular pytest so no QA
test-list update is necessary (refer to TrtllmAttention._run and
AttentionForwardArgs when explaining if needed).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp`:
- Around line 216-220: In loadChunkedKVCacheForMLA(), reintroduce the FP8-only
guard so that when hasKvCacheQuant() is true but hasFp8KvCache() is false the
function rejects/returns an error instead of falling through to the non-FP8
template path; update the control flow around the kv_scale_quant_orig_ptr setup
(and related logic that chooses the template specialization) to check
hasFp8KvCache() and short-circuit with an error/log if FP8 support is missing,
referencing loadChunkedKVCacheForMLA(), hasKvCacheQuant(), and hasFp8KvCache()
to locate and fix the branch.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 161-168: spec_decoding_position_offsets_for_cpp currently reshapes
the raw spec_decoding_position_offsets buffer which can expose the full backing
width instead of the live runtime width; use the precomputed C++ view computed
by update_position_offsets_for_cpp (spec_decoding_position_offsets_cpp) instead.
Replace the logic in spec_decoding_position_offsets_for_cpp to return
spec_decoding_position_offsets_cpp when available (falling back to reshaping
spec_decoding_position_offsets only if the _cpp view is None) so the C++ kernel
receives the correct (max_num_requests, runtime_query_len) shape rather than
buf_dim; reference symbols: spec_decoding_position_offsets_for_cpp,
spec_decoding_position_offsets_cpp, update_position_offsets_for_cpp,
spec_decoding_position_offsets, max_num_requests.

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h`:
- Line 24: Update the copyright year in the file header comment
(flash_fwd_mla_kernel.h) from 2024 to the current modification year (2026);
locate the top-of-file copyright block and bump the year range (e.g.,
"2022-2026") so the header matches the coding guideline for modified files.

In `@cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp`:
- Around line 1-15: Update the NVIDIA copyright header in
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp to include the current year (change
"2020-2025" to "2020-2026") so the file header reflects the modification year
per the project's copyright guidelines.

---

Nitpick comments:
In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py`:
- Around line 338-344: The union-unwrapping in _python_category currently only
checks for typing.Union so it misses PEP 604 unions (X | Y) whose origin is
types.UnionType; update the logic that inspects origin =
typing.get_origin(py_type) to treat both typing.Union and types.UnionType the
same (e.g., import types and check if origin in (typing.Union, types.UnionType)
or isinstance(origin, types.UnionType) ), then extract typing.get_args(py_type)
and unwrap the None case exactly as done now so X | None and Optional[X] both
return _python_category(args[0]).
- Around line 463-493: The function _verify_consumed is missing a return type
annotation; update its signature to include an explicit return type of None
(i.e., -> None) since it only asserts and doesn't return a value. Locate
_verify_consumed (which references fields, dataclasses.is_dataclass, and
_self_attrs_in_property) and add the return annotation without changing behavior
or logic; no other code changes are needed.
- Around line 1-551: Single-line summary: this static AST-only unit test
requires no QA list additions. Fix: remove any edits that add this test to
QA/automation lists and revert CI metadata changes; ensure the PR/body and any
test registration only include the new test file name
(test_attention_op_sync.py) without placing it in multi-GPU or E2E QA lists, and
add a short note in the PR description that the test is AST-only and runs under
regular pytest so no QA test-list update is necessary (refer to
TrtllmAttention._run and AttentionForwardArgs when explaining if needed).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: de0ebf46-86ea-4d79-b6d5-4ee18b1136b5

📥 Commits

Reviewing files that changed from the base of the PR and between 1f8312d and cd34033.

📒 Files selected for processing (14)
  • cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h
  • cpp/tensorrt_llm/nanobind/thop/bindings.cpp
  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • cpp/tensorrt_llm/thop/attentionOp.h
  • cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/attention_backend/sparse/dsa.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/attention_backend/trtllm_gen.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_into_trtllm_attention.py
  • tensorrt_llm/_torch/modules/attention.py
  • tests/unittest/_torch/attention_backend/test_attention_op_sync.py

Comment thread cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
Comment thread tensorrt_llm/_torch/attention_backend/trtllm.py
yuxianq added 6 commits May 28, 2026 08:43
…ature

Replace 5 list-typed parameters with their flat named components in
thop::attention() and the 2 internal Runner virtual signatures
(attentionOp.cpp lines ~361 and ~423). Each list was a fixed-arity
bundle whose ordinal positions encoded semantic meaning; flattening
eliminates the bookkeeping and lets the static sync test verify each
element by name.

Flattenings:
- rotary_embedding_scales (3 doubles) ->
  rotary_embedding_scale, rotary_embedding_short_m_scale,
  rotary_embedding_long_m_scale
- rotary_embedding_max_position_info (2 ints) ->
  rotary_embedding_max_positions,
  rotary_embedding_original_max_positions
- helix_tensor_params (2 optional tensors) ->
  helix_position_offsets, helix_is_inactive_rank
- spec_decoding_bool_params (3 bools) ->
  is_spec_decoding_enabled, use_spec_decoding, is_spec_dec_tree
- spec_decoding_tensor_params (3 or 6 optional tensors) ->
  spec_decoding_generation_lengths,
  spec_decoding_position_offsets_for_cpp,
  spec_decoding_packed_mask,
  spec_decoding_bl_tree_mask_offset,
  spec_decoding_bl_tree_mask,
  spec_bl_tree_first_sparse_mask_offset_kv (last 3 None on non-SM100)

Body cleanups:
- Drop the [i] unpacking at the start of attention() (now params
  arrive named).
- Spec-dec branch: replace size()==3/6 with named has_value() checks
  gated on isSM100Family().
- extractHelixParams lambda now captures the 2 optional tensors
  directly instead of indexing the vector.

The nanobind binding (bindings.cpp) and Python callers are updated in
follow-up commits; this commit alone does not link cleanly. The
intermediate state is for review-grouping only; the PR merges as a
single squashed change.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
…n C++ surface

Apply the rope_* prefix consistently to all 8 rotary embedding params
visible at the thop.attention boundary:
- rotary_embedding_dim -> rope_dim
- rotary_embedding_base -> rope_base
- rotary_embedding_scale_type -> rope_scale_type
- rotary_embedding_scale -> rope_scale
- rotary_embedding_short_m_scale -> rope_short_m_scale
- rotary_embedding_long_m_scale -> rope_long_m_scale
- rotary_embedding_max_positions -> rope_max_positions
- rotary_embedding_original_max_positions -> rope_original_max_positions

Touches only the public C++ surface (attention() params in
attentionOp.cpp/.h). The AttentionOp class member fields
(mRotaryEmbedding*) keep their existing names — they are internal
class state, out of scope.

rotary_inv_freq / rotary_cos_sin / mrope_* keep their current names
since they use different prefixes and were not asked to be renamed.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
…ntion kwargs

Mirror the C++ signature change in commits b7a5e4d and 071622f
into the m.def("attention", ...) nb::arg chain.

Replace 5 sequence kwargs with their 16 flat counterparts:
- rotary_embedding_scales -> rope_scale, rope_short_m_scale,
  rope_long_m_scale
- rotary_embedding_max_position_info -> rope_max_positions,
  rope_original_max_positions
- helix_tensor_params -> helix_position_offsets,
  helix_is_inactive_rank
- spec_decoding_bool_params -> is_spec_decoding_enabled,
  use_spec_decoding, is_spec_dec_tree
- spec_decoding_tensor_params -> 6 named optional tensors
  (spec_decoding_generation_lengths,
  spec_decoding_position_offsets_for_cpp,
  spec_decoding_packed_mask, spec_decoding_bl_tree_mask_offset,
  spec_decoding_bl_tree_mask,
  spec_bl_tree_first_sparse_mask_offset_kv)

Rename 5 scalar kwargs:
- rotary_embedding_{dim,base,scale_type} -> rope_{dim,base,scale_type}

Order matches the C++ attention() signature param order so positional
dispatch works.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
…ntion call

Mirror the C++ schema change in the PyTorch backend call site:

Call site (thop.attention kwargs in TrtllmAttention._run):
- Replace 5 sequence kwargs with 16 flat ones, each sourced from a
  named metadata/self attribute.
- Rename 5 scalar kwargs from rotary_embedding_* to rope_*.

Property cleanup on TrtllmAttention:
- Replace the 5 rotary_embedding_* properties (scalar + 2 list
  builders) with 8 rope_* scalar properties — one per flat kwarg.
- The 8 properties are all one-liner accessors over rope_params.

Property cleanup on TrtllmAttentionMetadata:
- Delete helix_tensor_params, spec_decoding_bool_params, and
  spec_decoding_tensor_params — all three were list builders whose
  sole consumer was the thop.attention call site, which now reads
  the underlying fields directly.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
… + renamed kwargs

Mirror the C++ schema change in the AutoDeploy custom ops that call
thop.attention via positional args.

trtllm_attention.py (single call site + planner state):
- Replace planner's spec_decoding_bool_params/spec_decoding_tensor_params
  list fields with a single use_spec_decoding scalar; the kernel-level
  3-bool + 6-tensor decomposition is now passed directly at the call
  site. Remove the SM-version-dependent list-sizing in reset() and
  the per-batch list mutations in init_spec_decoding.
- Replace the local rotary_embedding_scales / rotary_embedding_max_position_info
  / mla_tensor_params lists with their flat counterparts at the
  positional call site.
- Rename rotary_embedding_* comments to rope_* at the call site.

trtllm_mla.py (4 positional call sites):
- Remove the 3 identical 7-line local-list construction blocks +
  their now-unused sm_version computation.
- Inline the flat constants at each call site.

fuse_rope_into_trtllm_attention.py (internal rope_info dict):
- Rename the "rotary_embedding_dim" key to "rope_dim" and update its
  one reader in trtllm_attention.py. Internal to AutoDeploy; no
  external consumers.

All AutoDeploy modules import cleanly.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
Add test_no_sequence_kwargs_at_thop_attention_boundary to
test_attention_op_sync.py: parse the void attention(...)
declaration in attentionOp.h and assert no param has an outer
container type (std::vector, c10::ArrayRef, std::array).

The existing per-element source/type/literal sync tests cannot verify
ordinal positions inside a list. Flat named params keep every slot
checkable individually. If a future change tries to introduce a new
sequence kwarg, this test fails with the specific param name and type
and an instruction to flatten it.

Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>
@yuxianq yuxianq force-pushed the flatten-attn-args branch from cd34033 to 6aafd09 Compare May 28, 2026 08:46
@yuxianq yuxianq marked this pull request as ready for review May 28, 2026 09:43
@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 28, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50753 [ run ] triggered by Bot. Commit: 6aafd09 Link to invocation

@yuxianq yuxianq requested review from MrGeva and brb-nv May 28, 2026 09:59
Copy link
Copy Markdown
Collaborator

@MrGeva MrGeva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AD changes LGTM

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50753 [ run ] completed with state FAILURE. Commit: 6aafd09
/LLM/main/L0_MergeRequest_PR pipeline #40232 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yuxianq
Copy link
Copy Markdown
Collaborator Author

yuxianq commented May 29, 2026

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50953 [ run ] triggered by Bot. Commit: 6aafd09 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50953 [ run ] completed with state SUCCESS. Commit: 6aafd09
/LLM/main/L0_MergeRequest_PR pipeline #40410 completed with status: 'SUCCESS'

CI Report

Link to invocation

Copy link
Copy Markdown
Collaborator

@yihwang-nv yihwang-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yuxianq yuxianq merged commit 0a19205 into NVIDIA:main May 29, 2026
14 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants