[None][refactor] Flatten thop.attention sequence kwargs + rename rotary_embedding_* to rope_* by yuxianq · Pull Request #14569 · NVIDIA/TensorRT-LLM

yuxianq · 2026-05-26T10:59:38Z

Summary

Eliminate every Sequence[...] kwarg from thop.attention and rename the remaining rotary_embedding_* kwargs to rope_* for consistency.

Flattened (5 sequence → 16 scalar/tensor kwargs)

Old (sequence)	New (flat)
`rotary_embedding_scales` (3 doubles)	`rope_scale`, `rope_short_m_scale`, `rope_long_m_scale`
`rotary_embedding_max_position_info` (2 ints)	`rope_max_positions`, `rope_original_max_positions`
`helix_tensor_params` (2 optional tensors)	`helix_position_offsets`, `helix_is_inactive_rank`
`spec_decoding_bool_params` (3 bools)	`is_spec_decoding_enabled`, `use_spec_decoding`, `is_spec_dec_tree`
`spec_decoding_tensor_params` (3 or 6 optional tensors)	6 named optionals; last 3 are `None` on non-SM100

Renamed at the boundary

rotary_embedding_{dim,base,scale_type,scale,short_m_scale,long_m_scale,max_positions,original_max_positions} → rope_{...}.

The internal AttentionOp::mRotaryEmbedding* fields and the rotary_inv_freq / rotary_cos_sin / mrope_* kwargs are out of scope.

AutoDeploy

Both auto_deploy/custom_ops/attention/trtllm_attention.py and auto_deploy/custom_ops/mla/trtllm_mla.py call thop.attention via positional args. _GlobalTrtllmPlanner no longer holds spec-dec list buffers; each call site passes the 3 bool + 6 tensor slots inline. The internal rope_info["rotary_embedding_dim"] key in fuse_rope_into_trtllm_attention.py is renamed to rope_dim.

Test

New test_no_sequence_kwargs_at_thop_attention_boundary in test_attention_op_sync.py parses the void attention(...) declaration in attentionOp.h and asserts no std::vector / c10::ArrayRef / std::array outer types. Prevents future sequence-kwarg regressions.

Test plan

pre-commit run on all 6 touched files — green
pytest test_attention_op_sync.py — 6/6 pass (5 existing + 1 new)
Incremental C++ rebuild on H100 (ipp2-1776) — clean
PyTorch backend functional: test_attention_mla -k v1_kv_cache on H100 — 40 passed / 2 skipped / 0 failed
AutoDeploy unit tests: test_trtllm_attention_op.py + test_trtllm_mla_op.py on H100 — 68/68 pass
CI with AutoDeploy stages

Summary by CodeRabbit

Refactor
- Restructured attention operation parameters for improved clarity and maintainability (layer indexing, RoPE configurations, helix and speculative decoding inputs refactored from vector-based to explicit parameters).
Tests
- Added comprehensive validation test for attention operation parameter synchronization between Python and C++ layers.

yuxianq · 2026-05-26T10:59:46Z

/bot run --disable-fail-fast --add-multi-gpu-test --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

coderabbitai · 2026-05-26T11:10:10Z

📝 Walkthrough

Walkthrough

This PR refactors the attention operation API to eliminate grouped/vectorized parameter passing. Layer indices are renamed for clarity, rotary embeddings become explicit rope scalars, helix and spec-decoding parameters transition from vector containers to explicit optional tensors, and a new sync test validates Python-C++ parameter binding alignment across the boundary.

Changes

Attention Parameter API Refactoring: From Grouped Vectors to Explicit Scalars

Layer / File(s)	Summary
C++ API contract: header and nanobind binding `cpp/tensorrt_llm/thop/attentionOp.h`, `cpp/tensorrt_llm/nanobind/thop/bindings.cpp`	The `torch_ext::attention` signature is refactored: `layer_idx` → `local_layer_idx`; rotary-embedding vectors (`rotary_embedding_dim/base/scales/max_position_info`) → rope scalars (`rope_dim/base/scale_type/scale/short_m_scale/long_m_scale/max_positions/original_max_positions`); helix tensor vector → explicit tensors (`helix_position_offsets`, `helix_is_inactive_rank`); spec-decoding bool/tensor vectors → explicit flags (`is_spec_decoding_enabled`, `use_spec_decoding`, `is_spec_dec_tree`) and individual tensors.
C++ implementation: parameter wiring and runner signatures `cpp/tensorrt_llm/thop/attentionOp.cpp` (lines 359-368, 419-435, 900-932, 941-1051, 1144-1220, 1235-1245)	Update `RunnerBase::run`, `Runner::run`, and `attention()` to accept and wire explicit parameters; set `mLayerIdx` from `local_layer_idx`; assign `mRotaryEmbedding*` fields from rope scalars; set spec-decoding flags directly; update tracing/logging to use `local_layer_idx`.
C++ helix parameter extraction and wiring `cpp/tensorrt_llm/thop/attentionOp.cpp` (lines 472-520, 737-754)	Refactor helix extraction in MLA path and `extractHelixParams` helper to read directly from `helix_position_offsets` and `helix_is_inactive_rank` optional tensors instead of vector indexing.
C++ spec-decoding tensor handling in generation stage `cpp/tensorrt_llm/thop/attentionOp.cpp` (lines 810-839, 1071-1073)	Rewrite generation-stage logic to consume explicit spec-decoding tensors directly with validation for "tllm-gen" path; add rank check for position offsets; set spec-decoding flags from explicit booleans.
MLA KV cache quantization: parameter specialization `cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp`	Drop `kv_scale_orig_quant` from `loadPagedKVCacheForMLA` and `loadChunkedKVCacheForMLA`; drop `kv_scale_quant_orig` from `MLARopeAppendPagedKVAssignQ`; update Torch binding schemas and pointer initialization logic to match.
Python interface contract: AttentionForwardArgs and sparse args `tensorrt_llm/_torch/attention_backend/interface.py`	Extend `AttentionForwardArgs` with KV scale quantization tensors (`kv_scale_orig_quant`, `kv_scale_quant_orig`), control flags (`is_fused_qkv`, `update_kv_cache`), sparse attention config, and `mask_type` property mapping enums to C++ constants; add `AttentionSparseArgs` dataclass.
TrtllmAttentionMetadata: derived properties for C++ dispatch `tensorrt_llm/_torch/attention_backend/trtllm.py` (lines 150-175, 221-225)	Add `effective_workspace`, `spec_decoding_position_offsets_for_cpp` (with 1D reshaping), `max_context_length` properties and `use_paged_context_fmha` field to support refactored keyword-based `thop.attention` dispatch.
TrtllmAttention: rope and skip-softmax properties `tensorrt_llm/_torch/attention_backend/trtllm.py` (lines 1287-1333)	Add read-only properties delegating to `rope_params` and `sparse_attention_config` for rope dimensions, base, scale types, short/long scales, max positions, and skip-softmax thresholds; remove `_get_mask_type` helper.
TrtllmAttention._run refactoring: keyword-based dispatch `tensorrt_llm/_torch/attention_backend/trtllm.py` (lines 90bfaf0244f1, 1139-1146, 1349-1410, 1420-1566)	Simplify `_run` to `(q, k, v, metadata, forward_args)`, cache `local_layer_idx`, prime RoPE and metadata, and rewrite `thop.attention` call with explicit keyword arguments sourced from new metadata properties and module state instead of positional/older arguments.
TrtllmAttention.forward: output creation and sparse args setup `tensorrt_llm/_torch/attention_backend/trtllm.py` (lines 1593-1669)	Set `metadata.use_paged_context_fmha`, create outputs when needed, compute `is_fused_qkv`/`update_kv_cache` flags, build sparse args conditionally, update Blackwell first-sparse offsets, and call simplified `_run`.
TrtllmAttention MLA KV-scale parameter updates `tensorrt_llm/_torch/attention_backend/trtllm.py` (lines 1733, 1778, 1821, 1935-1936)	Update all MLA kernel invocations to pass `None` for KV scale arguments, aligning with the refactored parameter contract.
FlashInferTrtllmGenAttention: simplified rope and output scale `tensorrt_llm/_torch/attention_backend/trtllm_gen.py` (lines 622-626, 980, 1089)	Simplify `_get_mrope_rotary_cos_sin` to directly return `forward_args.mrope_rotary_cos_sin`; remove `_get_attention_output_orig_quant` helper and use `forward.out_scale` directly in context and generation paths.
Attention module: KV scale and output scale restructuring `tensorrt_llm/_torch/modules/attention.py` (lines 715, 744-761, 770-771)	Replace `kv_scales_sf`/`kv_scales_sf_inv` with `kv_scale_orig_quant`/`kv_scale_quant_orig` sourced from `inv_kv_scales`/`kv_scales`; compute single `out_scale` based on quantize-output flag; pass new field names into `AttentionForwardArgs`.
DSATrtllmAttention: MLA RoPE append parameter update `tensorrt_llm/_torch/attention_backend/sparse/dsa.py` (line 2535)	Update `mla_rope_append_paged_kv_assign_q` call to pass `None` for KV scale arguments, matching simplified signature.
Auto-Deploy attention: rope and spec-decoding parameter rewiring `tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py`	Replace `rotary_embedding_` vector construction with explicit `rope_` scalars; refactor spec-decoding from pre-packed lists to per-batch `use_spec_decoding` and individual tensors; switch `layer_idx` → `local_layer_idx`.
Auto-Deploy MLA: rope scalar parameters and explicit helix/spec-decoding `tensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.py`	Eliminate `get_sm_version()`-dependent vector construction across fresh/cached/chunked prefill and decode; use direct YaRN rope scalars and explicit per-parameter null/false/tensor values in all `thop.attention` calls.
RoPE dimension parameter rename: rotary_embedding_dim → rope_dim `tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_into_trtllm_attention.py`	Rename dimension parameter in `_convert_to_thop_cos_sin` helper and metadata dictionary key in `_build_rope_info` for consistency with new rope parameter naming scheme.
TrtllmAttention: import and constant setup `tensorrt_llm/_torch/attention_backend/trtllm.py` (lines 5, 25-50)	Remove unused `replace` import; add environment-controlled feature flags and sync-test configuration constants (`_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION`, `_THOP_EXCLUDED_FIELDS`, `_THOP_LITERALS`).
Test suite: Python-C++ parameter binding validation `tests/unittest/_torch/attention_backend/test_attention_op_sync.py`	Add comprehensive unit test that parses C++ header and Python AST to validate: kwarg set equivalence, unique attribute resolution, literal allowlists, field consumption tracking, and absence of sequence-typed C++ parameters in the `thop.attention` binding.
FlashMLA kernel: null-pointer tolerance for descale factors `cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h` (lines 689-690)	Update `flash_fwd_splitkv_mla_kernel` to load `descale_q` and `descale_k` conditionally from non-null pointers, defaulting to `1.f` when pointer is null.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#14275: Updates the same attention() C++ entry point and nanobind binding in attentionOp.h/cpp and bindings.cpp, removing sink_token_length while this PR restructures rotary/helix/spec-decoding parameters.

Suggested reviewers

liji-nv
brb-nv
nv-guomingz
wenmingw
hyukn

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main refactoring work: flattening sequence keyword arguments and renaming rotary_embedding_* parameters to rope_*.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description check	✅ Passed	The PR description provides a clear summary of changes, includes a detailed comparison table, explains the refactor scope, test coverage, and references a complete test plan with all validation steps.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h (1)
24-24: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Update NVIDIA copyright year on this modified file.

This header still ends at 2024 even though this file is being modified in this PR; please bump it to the latest modification year.
Proposed fix
- * Copyright (c) 2022-2024, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines **/*.{cpp,cc,h,hpp,py,cu,cuh}: Include NVIDIA copyright header on ALL new files; update year on modified files.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h` at line 24, Update
the copyright year in the file header comment (flash_fwd_mla_kernel.h) from 2024
to the current modification year (2026); locate the top-of-file copyright block
and bump the year range (e.g., "2022-2026") so the header matches the coding
guideline for modified files.
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp (1)
1-15: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Update the NVIDIA copyright year.

This file is modified in this PR, but the header still ends at 2025.

As per coding guidelines, **/*.{cpp,cc,h,hpp,py,cu,cuh}: Include NVIDIA copyright header on ALL new files; update year on modified files.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp` around lines 1 - 15, Update the
NVIDIA copyright header in cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp to include
the current year (change "2020-2025" to "2020-2026") so the file header reflects
the modification year per the project's copyright guidelines.

🧹 Nitpick comments (3)

tests/unittest/_torch/attention_backend/test_attention_op_sync.py (3)
338-344: 💤 Low value

PEP 604 union syntax (X | Y) won't be recognized.

With Python 3.10+, typing.get_origin(X | Y) returns types.UnionType, not typing.Union. This means Optional fields declared with X | None syntax will classify as "unknown" instead of unwrapping to the inner type—causing the type check to be skipped rather than validated.

Since the fallback is safe (skip rather than false-positive), this is low priority but worth noting for future robustness.
♻️ Suggested fix to handle both union styles
+import types
+
 def _python_category(py_type) -> str:
     ...
     origin = typing.get_origin(py_type)
-    if origin is typing.Union:
+    if origin is typing.Union or origin is types.UnionType:
         args = [a for a in typing.get_args(py_type) if a is not type(None)]
         if len(args) == 1:
             return _python_category(args[0])
         return "unknown"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py` around
lines 338 - 344, The union-unwrapping in _python_category currently only checks
for typing.Union so it misses PEP 604 unions (X | Y) whose origin is
types.UnionType; update the logic that inspects origin =
typing.get_origin(py_type) to treat both typing.Union and types.UnionType the
same (e.g., import types and check if origin in (typing.Union, types.UnionType)
or isinstance(origin, types.UnionType) ), then extract typing.get_args(py_type)
and unwrap the None case exactly as done now so X | None and Optional[X] both
return _python_category(args[0]).
463-493: 💤 Low value

Missing return type annotation.

Per coding guidelines, all Python functions should have return type annotations.
♻️ Add return type
-def _verify_consumed(cls, chains: set[tuple[str, ...]], excluded=frozenset()):
+def _verify_consumed(cls, chains: set[tuple[str, ...]], excluded=frozenset()) -> None:
As per coding guidelines: "Always annotate Python functions with return type; use None if the function does not return anything".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py` around
lines 463 - 493, The function _verify_consumed is missing a return type
annotation; update its signature to include an explicit return type of None
(i.e., -> None) since it only asserts and doesn't return a value. Locate
_verify_consumed (which references fields, dataclasses.is_dataclass, and
_self_attrs_in_property) and add the return annotation without changing behavior
or logic; no other code changes are needed.
1-551: QA list updates are not required.

This is a static/AST-only unit test under tests/unittest/. It validates Python-C++ binding alignment without runtime execution. No QA test list additions are needed—this test will run with the standard pytest unit test suite and doesn't require multi-GPU scheduling or end-to-end integration coverage.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py` around
lines 1 - 551, Single-line summary: this static AST-only unit test requires no
QA list additions. Fix: remove any edits that add this test to QA/automation
lists and revert CI metadata changes; ensure the PR/body and any test
registration only include the new test file name (test_attention_op_sync.py)
without placing it in multi-GPU or E2E QA lists, and add a short note in the PR
description that the test is AST-only and runs under regular pytest so no QA
test-list update is necessary (refer to TrtllmAttention._run and
AttentionForwardArgs when explaining if needed).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp`:
- Around line 216-220: In loadChunkedKVCacheForMLA(), reintroduce the FP8-only
guard so that when hasKvCacheQuant() is true but hasFp8KvCache() is false the
function rejects/returns an error instead of falling through to the non-FP8
template path; update the control flow around the kv_scale_quant_orig_ptr setup
(and related logic that chooses the template specialization) to check
hasFp8KvCache() and short-circuit with an error/log if FP8 support is missing,
referencing loadChunkedKVCacheForMLA(), hasKvCacheQuant(), and hasFp8KvCache()
to locate and fix the branch.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 161-168: spec_decoding_position_offsets_for_cpp currently reshapes
the raw spec_decoding_position_offsets buffer which can expose the full backing
width instead of the live runtime width; use the precomputed C++ view computed
by update_position_offsets_for_cpp (spec_decoding_position_offsets_cpp) instead.
Replace the logic in spec_decoding_position_offsets_for_cpp to return
spec_decoding_position_offsets_cpp when available (falling back to reshaping
spec_decoding_position_offsets only if the _cpp view is None) so the C++ kernel
receives the correct (max_num_requests, runtime_query_len) shape rather than
buf_dim; reference symbols: spec_decoding_position_offsets_for_cpp,
spec_decoding_position_offsets_cpp, update_position_offsets_for_cpp,
spec_decoding_position_offsets, max_num_requests.

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h`:
- Line 24: Update the copyright year in the file header comment
(flash_fwd_mla_kernel.h) from 2024 to the current modification year (2026);
locate the top-of-file copyright block and bump the year range (e.g.,
"2022-2026") so the header matches the coding guideline for modified files.

In `@cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp`:
- Around line 1-15: Update the NVIDIA copyright header in
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp to include the current year (change
"2020-2025" to "2020-2026") so the file header reflects the modification year
per the project's copyright guidelines.

---

Nitpick comments:
In `@tests/unittest/_torch/attention_backend/test_attention_op_sync.py`:
- Around line 338-344: The union-unwrapping in _python_category currently only
checks for typing.Union so it misses PEP 604 unions (X | Y) whose origin is
types.UnionType; update the logic that inspects origin =
typing.get_origin(py_type) to treat both typing.Union and types.UnionType the
same (e.g., import types and check if origin in (typing.Union, types.UnionType)
or isinstance(origin, types.UnionType) ), then extract typing.get_args(py_type)
and unwrap the None case exactly as done now so X | None and Optional[X] both
return _python_category(args[0]).
- Around line 463-493: The function _verify_consumed is missing a return type
annotation; update its signature to include an explicit return type of None
(i.e., -> None) since it only asserts and doesn't return a value. Locate
_verify_consumed (which references fields, dataclasses.is_dataclass, and
_self_attrs_in_property) and add the return annotation without changing behavior
or logic; no other code changes are needed.
- Around line 1-551: Single-line summary: this static AST-only unit test
requires no QA list additions. Fix: remove any edits that add this test to
QA/automation lists and revert CI metadata changes; ensure the PR/body and any
test registration only include the new test file name
(test_attention_op_sync.py) without placing it in multi-GPU or E2E QA lists, and
add a short note in the PR description that the test is AST-only and runs under
regular pytest so no QA test-list update is necessary (refer to
TrtllmAttention._run and AttentionForwardArgs when explaining if needed).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: de0ebf46-86ea-4d79-b6d5-4ee18b1136b5

📥 Commits

Reviewing files that changed from the base of the PR and between 1f8312d and cd34033.

📒 Files selected for processing (14)

cpp/tensorrt_llm/kernels/flashMLA/flash_fwd_mla_kernel.h
cpp/tensorrt_llm/nanobind/thop/bindings.cpp
cpp/tensorrt_llm/thop/attentionOp.cpp
cpp/tensorrt_llm/thop/attentionOp.h
cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp
tensorrt_llm/_torch/attention_backend/interface.py
tensorrt_llm/_torch/attention_backend/sparse/dsa.py
tensorrt_llm/_torch/attention_backend/trtllm.py
tensorrt_llm/_torch/attention_backend/trtllm_gen.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_into_trtllm_attention.py
tensorrt_llm/_torch/modules/attention.py
tests/unittest/_torch/attention_backend/test_attention_op_sync.py

…ature Replace 5 list-typed parameters with their flat named components in thop::attention() and the 2 internal Runner virtual signatures (attentionOp.cpp lines ~361 and ~423). Each list was a fixed-arity bundle whose ordinal positions encoded semantic meaning; flattening eliminates the bookkeeping and lets the static sync test verify each element by name. Flattenings: - rotary_embedding_scales (3 doubles) -> rotary_embedding_scale, rotary_embedding_short_m_scale, rotary_embedding_long_m_scale - rotary_embedding_max_position_info (2 ints) -> rotary_embedding_max_positions, rotary_embedding_original_max_positions - helix_tensor_params (2 optional tensors) -> helix_position_offsets, helix_is_inactive_rank - spec_decoding_bool_params (3 bools) -> is_spec_decoding_enabled, use_spec_decoding, is_spec_dec_tree - spec_decoding_tensor_params (3 or 6 optional tensors) -> spec_decoding_generation_lengths, spec_decoding_position_offsets_for_cpp, spec_decoding_packed_mask, spec_decoding_bl_tree_mask_offset, spec_decoding_bl_tree_mask, spec_bl_tree_first_sparse_mask_offset_kv (last 3 None on non-SM100) Body cleanups: - Drop the [i] unpacking at the start of attention() (now params arrive named). - Spec-dec branch: replace size()==3/6 with named has_value() checks gated on isSM100Family(). - extractHelixParams lambda now captures the 2 optional tensors directly instead of indexing the vector. The nanobind binding (bindings.cpp) and Python callers are updated in follow-up commits; this commit alone does not link cleanly. The intermediate state is for review-grouping only; the PR merges as a single squashed change. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

…n C++ surface Apply the rope_* prefix consistently to all 8 rotary embedding params visible at the thop.attention boundary: - rotary_embedding_dim -> rope_dim - rotary_embedding_base -> rope_base - rotary_embedding_scale_type -> rope_scale_type - rotary_embedding_scale -> rope_scale - rotary_embedding_short_m_scale -> rope_short_m_scale - rotary_embedding_long_m_scale -> rope_long_m_scale - rotary_embedding_max_positions -> rope_max_positions - rotary_embedding_original_max_positions -> rope_original_max_positions Touches only the public C++ surface (attention() params in attentionOp.cpp/.h). The AttentionOp class member fields (mRotaryEmbedding*) keep their existing names — they are internal class state, out of scope. rotary_inv_freq / rotary_cos_sin / mrope_* keep their current names since they use different prefixes and were not asked to be renamed. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

…ntion kwargs Mirror the C++ signature change in commits b7a5e4d and 071622f into the m.def("attention", ...) nb::arg chain. Replace 5 sequence kwargs with their 16 flat counterparts: - rotary_embedding_scales -> rope_scale, rope_short_m_scale, rope_long_m_scale - rotary_embedding_max_position_info -> rope_max_positions, rope_original_max_positions - helix_tensor_params -> helix_position_offsets, helix_is_inactive_rank - spec_decoding_bool_params -> is_spec_decoding_enabled, use_spec_decoding, is_spec_dec_tree - spec_decoding_tensor_params -> 6 named optional tensors (spec_decoding_generation_lengths, spec_decoding_position_offsets_for_cpp, spec_decoding_packed_mask, spec_decoding_bl_tree_mask_offset, spec_decoding_bl_tree_mask, spec_bl_tree_first_sparse_mask_offset_kv) Rename 5 scalar kwargs: - rotary_embedding_{dim,base,scale_type} -> rope_{dim,base,scale_type} Order matches the C++ attention() signature param order so positional dispatch works. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

…ntion call Mirror the C++ schema change in the PyTorch backend call site: Call site (thop.attention kwargs in TrtllmAttention._run): - Replace 5 sequence kwargs with 16 flat ones, each sourced from a named metadata/self attribute. - Rename 5 scalar kwargs from rotary_embedding_* to rope_*. Property cleanup on TrtllmAttention: - Replace the 5 rotary_embedding_* properties (scalar + 2 list builders) with 8 rope_* scalar properties — one per flat kwarg. - The 8 properties are all one-liner accessors over rope_params. Property cleanup on TrtllmAttentionMetadata: - Delete helix_tensor_params, spec_decoding_bool_params, and spec_decoding_tensor_params — all three were list builders whose sole consumer was the thop.attention call site, which now reads the underlying fields directly. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

… + renamed kwargs Mirror the C++ schema change in the AutoDeploy custom ops that call thop.attention via positional args. trtllm_attention.py (single call site + planner state): - Replace planner's spec_decoding_bool_params/spec_decoding_tensor_params list fields with a single use_spec_decoding scalar; the kernel-level 3-bool + 6-tensor decomposition is now passed directly at the call site. Remove the SM-version-dependent list-sizing in reset() and the per-batch list mutations in init_spec_decoding. - Replace the local rotary_embedding_scales / rotary_embedding_max_position_info / mla_tensor_params lists with their flat counterparts at the positional call site. - Rename rotary_embedding_* comments to rope_* at the call site. trtllm_mla.py (4 positional call sites): - Remove the 3 identical 7-line local-list construction blocks + their now-unused sm_version computation. - Inline the flat constants at each call site. fuse_rope_into_trtllm_attention.py (internal rope_info dict): - Rename the "rotary_embedding_dim" key to "rope_dim" and update its one reader in trtllm_attention.py. Internal to AutoDeploy; no external consumers. All AutoDeploy modules import cleanly. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

Add test_no_sequence_kwargs_at_thop_attention_boundary to test_attention_op_sync.py: parse the void attention(...) declaration in attentionOp.h and assert no param has an outer container type (std::vector, c10::ArrayRef, std::array). The existing per-element source/type/literal sync tests cannot verify ordinal positions inside a list. Flat named params keep every slot checkable individually. If a future change tries to introduce a new sequence kwarg, this test fails with the specific param name and type and an instruction to flatten it. Signed-off-by: Yuxian Qiu <142763828+yuxianq@users.noreply.github.com>

yuxianq · 2026-05-28T09:44:29Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-28T09:50:14Z

PR_Github #50753 [ run ] triggered by Bot. Commit: 6aafd09 Link to invocation

MrGeva

AD changes LGTM

tensorrt-cicd · 2026-05-28T23:03:23Z

PR_Github #50753 [ run ] completed with state FAILURE. Commit: 6aafd09
/LLM/main/L0_MergeRequest_PR pipeline #40232 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yuxianq · 2026-05-29T03:29:23Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-05-29T03:34:56Z

PR_Github #50953 [ run ] triggered by Bot. Commit: 6aafd09 Link to invocation

tensorrt-cicd · 2026-05-29T07:46:59Z

PR_Github #50953 [ run ] completed with state SUCCESS. Commit: 6aafd09
/LLM/main/L0_MergeRequest_PR pipeline #40410 completed with status: 'SUCCESS'

CI Report

Link to invocation

yihwang-nv

LGTM

yuxianq requested review from a team as code owners May 26, 2026 10:59

yuxianq requested a review from QiJune May 26, 2026 10:59

yuxianq marked this pull request as draft May 26, 2026 11:04

yuxianq removed the request for review from QiJune May 26, 2026 11:05

coderabbitai Bot reviewed May 26, 2026

View reviewed changes

Comment thread cpp/tensorrt_llm/thop/mlaPreprocessOp.cpp

Comment thread tensorrt_llm/_torch/attention_backend/trtllm.py

yuxianq added 6 commits May 28, 2026 08:43

yuxianq force-pushed the flatten-attn-args branch from cd34033 to 6aafd09 Compare May 28, 2026 08:46

yuxianq marked this pull request as ready for review May 28, 2026 09:43

yuxianq requested review from MrGeva and brb-nv May 28, 2026 09:59

MrGeva approved these changes May 28, 2026

View reviewed changes

yihwang-nv approved these changes May 29, 2026

View reviewed changes

yuxianq merged commit 0a19205 into NVIDIA:main May 29, 2026
14 of 16 checks passed

Conversation

yuxianq commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Flattened (5 sequence → 16 scalar/tensor kwargs)

Renamed at the boundary

AutoDeploy

Test

Test plan

Summary by CodeRabbit

Uh oh!

yuxianq commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

yuxianq commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

MrGeva left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

yuxianq commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

yihwang-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuxianq commented May 26, 2026 •

edited

Loading

coderabbitai Bot commented May 26, 2026 •

edited

Loading