Skip to content

[#11823][feat] AutoDeploy MLA with PWCG support on Deepseek R1#13497

Merged
MrGeva merged 1 commit intoNVIDIA:mainfrom
nv-auto-deploy:eg/pwcg_tmla
May 4, 2026
Merged

[#11823][feat] AutoDeploy MLA with PWCG support on Deepseek R1#13497
MrGeva merged 1 commit intoNVIDIA:mainfrom
nv-auto-deploy:eg/pwcg_tmla

Conversation

@MrGeva
Copy link
Copy Markdown
Collaborator

@MrGeva MrGeva commented Apr 27, 2026

  • piecewise_utils.py: re-classify auto_deploy::trtllm_mla_prepare_metadata from _METADATA_PREP_OPS to _PERSISTENT_BUFFER_OPS — its output (planner.block_offsets) lives at a stable persistent address.

  • torch_cudagraph.py + piecewise_runner.py (finalize_capture hook): keep dynamic-op output buffers strong-ref'd through the entire bucket's split-graph capture; without this the shared graph pool reuses captured slots across segments
    mid-capture and replay reads garbage (verified — MMLU collapsed from 87.33 → 35.70 when we tried immediate weak-refs).

  • trtllm_mla.py: add the out= kwarg to trtllm_mla_with_cache and plumb it through _mla_with_cache_impl to honor the DynamicOpWrapper contract (write real-token rows into the bucket-sized buffer, zero the padded tail, return the empty
    alias).

  • Split the AutoDeploy MLA thop.attention workspace into separate eager (prefill) and captured (decode) tensors so the C++ side's resize_() can grow the eager one freely without invalidating CUDA-graph-captured pointers — replacing the prior 512MB static reserve which failed on large setups.

  • examples/auto_deploy/model_registry/configs/deepseek-r1.yaml: enable PWCG (buckets [256..8192], max_num_tokens=15360).

Summary by CodeRabbit

  • New Features

    • Added TRT-LLM MLA (Multi-Head Latent Attention) backend with paged KV cache support.
    • Introduced RoPE fusion optimization into TRT-LLM MLA attention path.
    • Added explicit V tensor stride override for improved non-contiguous tensor handling.
  • Bug Fixes

    • Fixed extend request routing in decode-only detection logic.
    • Improved output buffer lifecycle management during CUDA graph capture.
  • Tests

    • Added coverage for non-contiguous tensor view handling and RMSNorm operations.
    • Added tests for decode-only detection and output buffer finalization.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@MrGeva MrGeva force-pushed the eg/pwcg_tmla branch 3 times, most recently from 3306064 to c9fd46a Compare April 28, 2026 14:40
@MrGeva MrGeva changed the title Eg/pwcg tmla [#11823][feat] AutoDeploy PWCG fixes to support MLA and Deepseek R1 Apr 28, 2026
@MrGeva MrGeva marked this pull request as ready for review April 28, 2026 15:22
@MrGeva MrGeva requested review from a team as code owners April 28, 2026 15:22
@MrGeva MrGeva requested a review from Fridah-nv April 28, 2026 15:22
@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented Apr 28, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

📝 Walkthrough

Walkthrough

Adds V tensor stride parameters to FMHA operations for handling non-contiguous tensor layouts, introduces TensorRT-LLM MLA cached attention backend with paged KV cache support, refactors CUDA graph capture to finalize dynamic output references, and adds RoPE fusion transform for MLA with supporting configs and tests.

Changes

Cohort / File(s) Summary
FMHA V Stride Parameters
cpp/tensorrt_llm/common/attentionOp.h, cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h, cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp, cpp/tensorrt_llm/thop/attentionOp.cpp
Added v_stride_in_bytes parameter to attention context enqueue operations, enabling explicit control over V tensor token stride for both contiguous and non-contiguous layouts. Existing logic falls back to computed defaults when stride is not provided.
TensorRT-LLM MLA Backend
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.py, tensorrt_llm/_torch/auto_deploy/custom_ops/mla/__init__.py
Introduces complete TRT-LLM MLA cached attention backend with paged latent KV cache, block-based offset tracking, RoPE/identity cos-sin tables, mixed prefill/decode handling, and workspace management. Registers custom ops auto_deploy::trtllm_mla_prepare_metadata and auto_deploy::trtllm_mla_with_cache with fake implementations for tracing.
MLA Constant Extraction
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py, tensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_backend_mla.py
Updated scale constant retrieval to use extract_op_args method instead of direct kwargs indexing for consistency across MLA backends.
CUDA Graph Capture Refactoring
tensorrt_llm/_torch/auto_deploy/compile/torch_cudagraph.py, tensorrt_llm/_torch/auto_deploy/compile/piecewise_runner.py, tensorrt_llm/_torch/auto_deploy/compile/piecewise_utils.py
Enhanced piecewise capture to track static runners, finalize captures via try/finally, retain dynamic output buffer references during capture and convert to weak refs after finalization. Tightened decode-only routing to check both prefill and extend counts. Extended op classifications to include TRT-LLM MLA cached ops.
RoPE Fusion Transform
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py
Introduces FuseRopeIntoTrtllmMLA transform that fuses RoPE into TRT-LLM MLA by rewiring attention nodes, materializing cos/sin tensors from graph buffers or config, and reversing RoPE de-interleave on MLA weights for tensor-parallel compatibility.
Configuration Updates
examples/auto_deploy/model_registry/configs/deepseek-r1.yaml, tensorrt_llm/_torch/auto_deploy/config/default.yaml, examples/auto_deploy/create_standalone_package.py
Added new YAML model configuration with increased token limits, FP8 KV cache, CUDA graph overrides, MLA attention transforms, and piecewise compilation breakpoints. Introduced disabled-by-default RoPE fusion transform. Updated test exclusion list.
Model Factory Updates
tensorrt_llm/_torch/auto_deploy/models/hf.py
Changed get_cache_config_updates to only apply kv_cache_dtype override when explicitly provided in quantization config, allowing user-specified kv_cache_config.dtype to remain effective when not overridden.
Accuracy References
tests/integration/defs/accuracy/references/gsm8k.yaml, tests/integration/defs/accuracy/references/mmlu.yaml
Added accuracy reference entries for DeepSeek-R1 with FP8 KV cache quantization, recording expected accuracy values for validation.
Test Additions
tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py, tests/unittest/auto_deploy/singlegpu/custom_ops/normalization/test_triton_rms_norm.py, tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py
Added test coverage for decode-only heuristics with extend requests, dynamic output buffer finalization lifecycle, non-contiguous RMSNorm inputs, and runtime model generation for smoke testing.

Sequence Diagram(s)

sequenceDiagram
    participant Frontend as Frontend<br/>(PyTorch)
    participant Prepare as Metadata<br/>Preparation
    participant Cache as Cache<br/>Management
    participant KV as KV<br/>Processing
    participant Attention as THOP<br/>Attention

    rect rgba(100, 150, 200, 0.5)
    Note over Frontend,Attention: TRT-LLM MLA Prefill (Fresh or Cached)
    Frontend->>Prepare: prepare_trtllm_mla_metadata<br/>(batch_info, cache locations)
    Prepare->>Cache: compute block_offsets<br/>block_ids_per_seq
    Cache->>KV: RoPE/append new tokens<br/>to paged cache
    KV->>KV: reload [past+new] KV
    KV->>KV: project compressed KV
    KV->>Attention: call thop.attention<br/>(with latent_cache)
    end

    rect rgba(150, 200, 100, 0.5)
    Note over Frontend,Attention: TRT-LLM MLA Decode
    Frontend->>Prepare: prepare_trtllm_mla_metadata<br/>(batch_info, cache locations)
    Prepare->>Cache: compute block_offsets
    Cache->>KV: mla_rope_generation<br/>write decoded tokens
    KV->>Attention: call thop.attention<br/>(generation mode)
    Attention->>KV: project from latent space<br/>back to v_head_dim
    end

    rect rgba(200, 150, 100, 0.5)
    Note over Frontend,Attention: Mixed Prefill+Decode (In-Order Slices)
    Frontend->>Frontend: auto_deploy::trtllm_mla_with_cache<br/>processes prefill and decode slices
    Frontend->>Prepare: for each slice:<br/>prepare metadata
    Prepare->>Attention: execute appropriate<br/>prefill or decode path
    Attention->>Frontend: return output (reuse buffer if provided)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

  • symphonylyh
  • liji-nv
  • niukuo
  • poweiw
  • hchings
  • govind-ramnarayan
  • yuxianq
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.04% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description lacks a clear PR title following the required template, missing JIRA/issue ticket and type identifier. The Description and Test Coverage sections are empty. Add a proper PR title with format '[JIRA/Issue/None][type] Summary', complete the Description section explaining the issue and solution, and document relevant test coverage for the changes.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately describes the main change: adding AutoDeploy MLA support with piecewise CUDA graph (PWCG) on DeepSeek R1, which matches the primary objectives of enabling MLA backend with PWCG support.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year on this modified file.

Line 2 still ends at 2025 even though this file is modified in this PR.

📝 Suggested patch
- * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h`
at line 2, Update the copyright year range in the file header string "Copyright
(c) 2020-2025, NVIDIA CORPORATION.  All rights reserved." to include the current
modification year (e.g., change 2025 to 2026) so the header on
fused_multihead_attention_common.h reflects the file was modified in this PR.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp (1)

2-2: ⚠️ Potential issue | 🟡 Minor

Update the file header year to 2026.

This modified file still shows 2025 in the copyright line.

📝 Suggested patch
- * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2026, NVIDIA CORPORATION.  All rights reserved.

As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp` at
line 2, Update the file header copyright year from 2025 to 2026 in the
fmhaRunner.cpp file header; locate the top-of-file copyright line (the file
header comment block containing "Copyright (c) 2020-2025, NVIDIA CORPORATION.")
and change "2025" to "2026" (also scan the same header block for any duplicated
year entries and update them as well).
🧹 Nitpick comments (5)
tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py (1)

93-109: QA list update is unnecessary for this test-only scope.

This PR scope here is under tests/unittest/, so no tests/integration/test_lists/qa/* update is needed.

As per coding guidelines: “If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py` around
lines 93 - 109, This test lives under tests/unittest/ and does not require any
QA list changes; update the PR or the test header to explicitly state that QA
list updates under tests/integration/test_lists/qa/* are unnecessary for this
change, e.g., add a brief comment near the
test_trtllm_serve_openai_chat_completion definition clarifying the QA list
update is not required for unit-scope changes.
cpp/tensorrt_llm/common/attentionOp.h (1)

146-147: Include the new V-stride field in debug output.

Line 147 adds v_stride_in_bytes, but enqueueContextParamsToString() does not print it, which makes stride/layout triage harder.

💡 Suggested patch
             ss << "k_ptr: " << this->k_ptr << std::endl;
             ss << "v_ptr: " << this->v_ptr << std::endl;
+            ss << "v_stride_in_bytes: " << this->v_stride_in_bytes << std::endl;
             return ss.str();
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/common/attentionOp.h` around lines 146 - 147,
enqueueContextParamsToString() currently omits the newly added field
v_stride_in_bytes, so update that function to include v_stride_in_bytes in its
debug output; locate enqueueContextParamsToString (used to stringify the
attention/enqueue context) and append a labeled entry like "v_stride_in_bytes="
+ std::to_string(v_stride_in_bytes) (or the existing value formatting used for
other stride fields) to the returned string so the V tensor stride is printed
alongside the other stride/layout fields.
examples/auto_deploy/model_registry/configs/deepseek-r1.yaml (1)

33-35: Clarify the intentional bucket cap vs max token limit.

max_num_tokens is 15360, but piecewise_num_tokens tops out at 8192. Requests above 8192 will run eager. If that is intended, add a short comment here so this doesn’t look like accidental under-capture.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/auto_deploy/model_registry/configs/deepseek-r1.yaml` around lines 33
- 35, Clarify that the piecewise bucket cap is intentional by adding an inline
comment next to piecewise_enabled or piecewise_num_tokens explaining that
piecewise_num_tokens intentionally tops out at 8192 while max_num_tokens remains
15360, and that any requests above 8192 will be handled eagerly; update the YAML
comment near the symbols piecewise_enabled, piecewise_num_tokens, and
max_num_tokens to state this explicit design decision so it is not mistaken for
an accidental under-capture.
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py (1)

57-61: Simplify _trace_rope_node return shape to remove duplicate/unused value.

Line 461 unpacks rope_node_i but never uses it; the helper currently returns rope_node twice. This can be simplified for clarity.

Optional cleanup
-def _trace_rope_node(mla_node: Node) -> Optional[Tuple[Node, Node, Node, Node]]:
+def _trace_rope_node(mla_node: Node) -> Optional[Tuple[Node, Node, Node]]:
@@
-    Returns (rope_node, q_pe_pre, kpe_pre, rope_node) or None if
+    Returns (rope_node, q_pe_pre, kpe_pre) or None if
@@
-    return rope_node, q_pe_pre, kpe_pre, rope_node
+    return rope_node, q_pe_pre, kpe_pre
@@
-        rope_node, _, _, _ = trace_result
+        rope_node, _, _ = trace_result
@@
-            rope_node_i, q_pe_pre, kpe_pre, _ = result
+            _, q_pe_pre, kpe_pre = result

Also applies to: 108-109, 445-446, 461-462

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py` around
lines 57 - 61, The helper _trace_rope_node currently returns (rope_node,
q_pe_pre, kpe_pre, rope_node) and callers unpack four values but never use the
duplicate rope_node; change the return type to Optional[Tuple[Node, Node, Node]]
and return (rope_node, q_pe_pre, kpe_pre) instead, update the docstring and
function signature accordingly, and fix all call sites that currently unpack
four values (remove the redundant fourth variable such as rope_node_i and unpack
only rope_node, q_pe_pre, kpe_pre) so callers at the locations referenced are
updated to expect three values.
tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py (1)

370-474: Cover a decode path with out= too.

This validates bucketed prefill, but the new out= contract is also relevant for extend/decode. Add one single-token decode assertion so a regression in the dynamic-output path doesn’t slip through.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py`
around lines 370 - 474, Add a single-token decode/extend assertion: call
_run_trtllm_mla with an out= buffer using inputs/meta that indicate one
additional decode token (update padded_meta or create a new meta with per-batch
lengths increased by 1), compute the expected decode output by running
_run_trtllm_mla for the same scenario without padding (or with
real_inputs/real_meta extended by one token), then assert out_result.numel()==0
and compare out[:, total_tokens] (the new decode token) to the expected decode
token with torch.testing.assert_close and verify out[:, total_tokens+1:] remains
zero; update test_trtllm_mla_out_buffer_padding to include this check using the
existing helpers (_run_trtllm_mla, padded_inputs, padded_meta, real_inputs,
real_meta).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/models/hf.py`:
- Around line 335-337: Replace the runtime assert that checks kv_cache_dtype
with an explicit ValueError raise so validation always runs (assert can be
skipped with -O). Locate the check using the kv_cache_dtype variable (the
current assert line) and change it to raise ValueError(f"Unsupported dtype:
{kv_cache_dtype}. Only fp8 and auto are supported.") so the same message is
preserved and the validation is enforced at runtime.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py`:
- Around line 221-225: The try/except around factory._get_model_config() is too
broad; replace the bare except with specific exception types that
_get_model_config can raise (e.g., AttributeError, KeyError, ValueError or a
factory-specific exception if one exists), log the caught exception via
ad_logger.debug including the exception details, and only return None for those
expected config-access failures while letting unexpected exceptions propagate;
target the block using factory._get_model_config(), ad_logger.debug(...), and
the current return None.

In `@tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py`:
- Around line 182-184: The zip() usage that computes kv_lengths and
pages_per_seq can silently truncate if input_positions and seq_lengths have
different lengths; update both occurrences (the comprehension creating
kv_lengths and the similar logic in _build_metadata_with_pages()) to call
zip(input_positions, seq_lengths, strict=True) so the code fails fast on
mismatched metadata lengths and surfaces bad test fixtures immediately.
- Around line 645-656: The zip between input_positions and seq_lengths that
builds kv_lengths should be made strict to fail fast on mismatched test data:
change the expression creating kv_lengths from zip(input_positions, seq_lengths)
to zip(input_positions, seq_lengths, strict=True) so a length mismatch raises
immediately; similarly update any other zip usages in this helper that pair
input_positions with seq_lengths (e.g., the kv_lengths comprehension and any
loops that iterate page_assignments alongside seq_lengths) to use strict=True to
prevent silent truncation.

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp`:
- Line 2: Update the file header copyright year from 2025 to 2026 in the
fmhaRunner.cpp file header; locate the top-of-file copyright line (the file
header comment block containing "Copyright (c) 2020-2025, NVIDIA CORPORATION.")
and change "2025" to "2026" (also scan the same header block for any duplicated
year entries and update them as well).

In
`@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h`:
- Line 2: Update the copyright year range in the file header string "Copyright
(c) 2020-2025, NVIDIA CORPORATION.  All rights reserved." to include the current
modification year (e.g., change 2025 to 2026) so the header on
fused_multihead_attention_common.h reflects the file was modified in this PR.

---

Nitpick comments:
In `@cpp/tensorrt_llm/common/attentionOp.h`:
- Around line 146-147: enqueueContextParamsToString() currently omits the newly
added field v_stride_in_bytes, so update that function to include
v_stride_in_bytes in its debug output; locate enqueueContextParamsToString (used
to stringify the attention/enqueue context) and append a labeled entry like
"v_stride_in_bytes=" + std::to_string(v_stride_in_bytes) (or the existing value
formatting used for other stride fields) to the returned string so the V tensor
stride is printed alongside the other stride/layout fields.

In `@examples/auto_deploy/model_registry/configs/deepseek-r1.yaml`:
- Around line 33-35: Clarify that the piecewise bucket cap is intentional by
adding an inline comment next to piecewise_enabled or piecewise_num_tokens
explaining that piecewise_num_tokens intentionally tops out at 8192 while
max_num_tokens remains 15360, and that any requests above 8192 will be handled
eagerly; update the YAML comment near the symbols piecewise_enabled,
piecewise_num_tokens, and max_num_tokens to state this explicit design decision
so it is not mistaken for an accidental under-capture.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py`:
- Around line 57-61: The helper _trace_rope_node currently returns (rope_node,
q_pe_pre, kpe_pre, rope_node) and callers unpack four values but never use the
duplicate rope_node; change the return type to Optional[Tuple[Node, Node, Node]]
and return (rope_node, q_pe_pre, kpe_pre) instead, update the docstring and
function signature accordingly, and fix all call sites that currently unpack
four values (remove the redundant fourth variable such as rope_node_i and unpack
only rope_node, q_pe_pre, kpe_pre) so callers at the locations referenced are
updated to expect three values.

In `@tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py`:
- Around line 370-474: Add a single-token decode/extend assertion: call
_run_trtllm_mla with an out= buffer using inputs/meta that indicate one
additional decode token (update padded_meta or create a new meta with per-batch
lengths increased by 1), compute the expected decode output by running
_run_trtllm_mla for the same scenario without padding (or with
real_inputs/real_meta extended by one token), then assert out_result.numel()==0
and compare out[:, total_tokens] (the new decode token) to the expected decode
token with torch.testing.assert_close and verify out[:, total_tokens+1:] remains
zero; update test_trtllm_mla_out_buffer_padding to include this check using the
existing helpers (_run_trtllm_mla, padded_inputs, padded_meta, real_inputs,
real_meta).

In `@tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py`:
- Around line 93-109: This test lives under tests/unittest/ and does not require
any QA list changes; update the PR or the test header to explicitly state that
QA list updates under tests/integration/test_lists/qa/* are unnecessary for this
change, e.g., add a brief comment near the
test_trtllm_serve_openai_chat_completion definition clarifying the QA list
update is not required for unit-scope changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f2511657-e58d-4ad3-b70e-6d193df2dc34

📥 Commits

Reviewing files that changed from the base of the PR and between 1e8640c and c9fd46a.

📒 Files selected for processing (22)
  • cpp/tensorrt_llm/common/attentionOp.h
  • cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp
  • cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h
  • cpp/tensorrt_llm/thop/attentionOp.cpp
  • examples/auto_deploy/create_standalone_package.py
  • examples/auto_deploy/model_registry/configs/deepseek-r1.yaml
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/compile/piecewise_runner.py
  • tensorrt_llm/_torch/auto_deploy/compile/piecewise_utils.py
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla/__init__.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_backend_mla.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.py
  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py
  • tests/integration/defs/accuracy/references/gsm8k.yaml
  • tests/integration/defs/accuracy/references/mmlu.yaml
  • tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/normalization/test_triton_rms_norm.py
  • tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py

Comment thread tensorrt_llm/_torch/auto_deploy/models/hf.py
Comment thread tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45947 [ run ] triggered by Bot. Commit: c9fd46a Link to invocation

Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Comment thread tensorrt_llm/_torch/auto_deploy/compile/piecewise_runner.py
Comment thread tensorrt_llm/_torch/auto_deploy/compile/piecewise_runner.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45947 [ run ] completed with state SUCCESS. Commit: c9fd46a
/LLM/main/L0_MergeRequest_PR pipeline #36103 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@MrGeva MrGeva force-pushed the eg/pwcg_tmla branch 3 times, most recently from d6d411f to 6bb225c Compare April 29, 2026 09:47
Copy link
Copy Markdown
Collaborator

@nvchenghaoz nvchenghaoz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approve as the memory usage does not increase.

Comment thread tensorrt_llm/_torch/auto_deploy/compile/piecewise_utils.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/compile/piecewise_runner.py
@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented Apr 30, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@MrGeva MrGeva changed the title [#11823][feat] AutoDeploy PWCG fixes to support MLA and Deepseek R1 [#11823][feat] AutoDeploy MLA with PWCG support on Deepseek R1 Apr 30, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46384 [ run ] triggered by Bot. Commit: f24a8cc Link to invocation

Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46384 [ run ] completed with state ABORTED. Commit: f24a8cc

Link to invocation

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 3, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@MrGeva MrGeva enabled auto-merge (squash) May 3, 2026 06:33
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46583 [ run ] triggered by Bot. Commit: f24a8cc Link to invocation

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 3, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46584 [ run ] triggered by Bot. Commit: 0115ade Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46584 [ run ] completed with state SUCCESS. Commit: 0115ade
/LLM/main/L0_MergeRequest_PR pipeline #36633 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 3, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46600 [ run ] triggered by Bot. Commit: d2a4e2d Link to invocation

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 3, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46601 [ run ] triggered by Bot. Commit: d172009 Link to invocation

Piecewise CUDA graph (PWCG) infrastructure already exists in AutoDeploy.
This commit wires the trtllm_mla cached-attention op into PWCG so that
DeepSeek-R1 can run under PWCG.

Required changes:

* ``piecewise_utils.py`` — re-classify
  ``auto_deploy::trtllm_mla_prepare_metadata`` from
  ``_METADATA_PREP_OPS`` to ``_PERSISTENT_BUFFER_OPS``.  The MLA metadata
  op produces a stable persistent buffer (``planner.block_offsets``);
  the persistent-buffer classification is the right contract for it.
* ``torch_cudagraph.py`` — track ``_static_runners`` and call
  ``finalize_capture(nt)`` per bucket so dynamic-op output buffers stay
  strong-ref'd until the bucket's split-graph capture finishes —
  otherwise the shared graph pool can reuse those addresses for
  downstream graph outputs and replay reads garbage.
* ``piecewise_runner.py`` — add the ``finalize_capture`` lifecycle hook
  on ``ADPiecewiseRunner`` and store ``dynamic_out_bufs`` as strong refs
  during capture (transitioned to weak refs in ``finalize_capture``).
* ``trtllm_mla.py`` — add the ``out=`` kwarg to ``trtllm_mla_with_cache``
  and plumb it through ``_mla_with_cache_impl`` so PWCG's
  ``DynamicOpWrapper`` can pre-allocate the bucket-sized output buffer
  in the graph pool, giving the next captured static segment a stable
  read address.

Config:

* ``deepseek-r1.yaml`` — enable PWCG (``compile_model.piecewise_enabled``,
  ``piecewise_num_tokens=[256..8192]``, ``max_num_tokens=15360``).  The
  pre-existing trailing-static lm_head exclusion (``capture_lm_head`` ⇒
  False by default in ``PiecewiseCapturedGraph``) keeps lm_head out of
  the captured buckets, so PWCG runs at the default 0.9
  ``free_gpu_memory_fraction`` without lossy KV-cache budget.

Signed-off-by: Eran Geva <egeva@prenyx0035.a51.clusters.nvidia.com>

[None][test] AutoDeploy: re-enable MMLU for DeepSeek-R1-0528 PWCG accuracy test

PWCG + trtllm_mla now passes both MMLU (~82.7) and GSM8K (~94) on
DeepSeek-R1-0528, so the temporary MMLU skip from the previous
debugging commit can be removed.

Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>

[None][fix] AutoDeploy MLA: fix IMA in cache-reused prefill at high concurrency

At TP=8 / max_batch_size=256 / max_num_tokens=15360 / isl=1000 / conc=256,
the cache-reused-prefill path (``_handle_prefill_thop_cached_kv``) hits
a CUDA illegal memory access inside ``AttentionOp::enqueueContext``.  The
async surface signature is ``CUBLAS_STATUS_EXECUTION_FAILED`` collateral;
under ``CUDA_LAUNCH_BLOCKING=1`` the first synchronous fault is in
``cudaStreamIsCapturing`` after a kernel launched from
``invokeMLAContextFp8Quantize``.

Repro: ``bench-sweep --isl 1000 --osl 1000 --concurrencies 256
--world-size 8`` with ``max_batch_size=256`` set in ``deepseek-r1.yaml``.

Root cause (Python caller-side bug):

* ``thop.attention``'s FP8 context-MLA workspace (``fp8_k_buf`` /
  ``fp8_v_buf``) is sized to
  ``chunked_prefill_buffer_batch_size * max_num_tokens`` tokens.  The
  cache-reused-prefill kernel writes ``total_kv_len`` (sum of
  ``[past + new]`` over all in-flight prefill seqs) FP8 K/V tokens into
  that buffer.  ``trtllm_mla.py`` hard-coded
  ``chunked_prefill_buffer_batch_size = 1``, so the buffer covered only
  ``max_num_tokens`` tokens while a real chunked-prefill batch routinely
  exceeds that — at the repro config we observed
  ``num_full_tokens = 84252`` vs ``max_num_tokens = 15360`` (5.5× OOB).

Fix:

1. Pass ``chunked_prefill_buffer_batch_size = 16`` at the cache-reused-
   prefill call site only (the fresh-prefill and decode call sites are
   correct at ``1``).  ``16 * 15360 = 245760`` token budget — covers the
   diag's 84,252 tokens with ~3× margin.  Avoids ``max_num_requests``
   (which the regular AD trtllm attention path uses) because for MLA the
   formula multiplies by ``total_k_dim_all_heads`` and at
   ``max_num_requests=256`` would size ``fp8_k_buf`` to ~12 GiB / rank.
2. Bump the static workspace reserve from 512 MiB to 2 GiB.  Without
   this, the C++ side (``cpp/tensorrt_llm/thop/attentionOp.cpp``) calls
   ``workspace.resize_()`` mid-run when N=16 needs ~1.3 GiB — which
   reallocates storage and invalidates the monolithic decode CUDA graphs
   that were captured against the original 512-MiB buffer, producing a
   SIGSEGV in ``at::cuda::CUDAGraph::replay()``.  Reserving 2 GiB up
   front keeps ``resize_()`` quiescent so captured-graph pointers stay
   valid.

Validated end-to-end: yeonbok's bench-sweep at bs=256 completes cleanly
with no IMA, no CUBLAS errors, no workspace resize warnings, and 1281
successful 200 OK responses at 5,270 output tok/s (matches the
``max_num_tokens<=8192`` workaround's throughput).

Signed-off-by: Eran Geva <egeva@prenyx0109.a51.clusters.nvidia.com>

[None][refactor] AutoDeploy MLA: split workspace tensor for eager / captured paths

Replaces the 2 GiB static workspace reserve from 250a0ce with the
two-tensor pattern the standard ``trtllm_attention`` backend already
uses (``workspace`` / ``cuda_graph_workspace``).

Background: ``thop.attention``'s C++ side resizes the workspace tensor
in-place (``resize_()``) when its sizing formula exceeds the current
capacity.  ``resize_()`` reallocates storage and rebinds ``data_ptr_``,
which **invalidates any captured CUDA graph that recorded the old
address**.  The previous fix worked around this by pre-allocating 2 GiB
up front so ``resize_()`` would never fire — at the cost of permanently
reserving memory that is rarely needed and depending on a hand-tuned
upper bound that has to track future config changes (max_num_tokens,
chunked_prefill_buffer_batch_size, etc.).

This change splits the single workspace into two tensors and routes per
call site:

* ``workspace`` — used by eager paths (fresh + cache-reused prefill).
  Free to grow on demand via ``resize_()``; no captured graph references
  it, so storage churn is harmless.
* ``cuda_graph_workspace`` — used during CUDA-graph warmup and capture
  (decode).  Grows lazily during warmup so the captured graph records
  the final pointer; afterwards no resize fires for the captured
  workload.

Routing happens in a new ``_TrtllmMLAPlanner._select_workspace()``
helper, called at all three ``thop.attention`` sites.  The discriminator
is the same signal ``plan_host`` already uses:
``torch.cuda.is_current_stream_capturing() or
cuda_graph_state.in_warm_up()``.  Both tensors start size-0 and grow on
first use, mirroring the standard backend.

Validated on B200 TP=8:

* Yeonbok's bs=256 / isl=osl=1000 / conc=256 bench-sweep:
  1281 × 200 OK, 5283 output tok/s, 0 IMA, 0 CUBLAS errors, 0 SIGSEGV.
  Resize warnings fire as expected (``cuda_graph_workspace`` 0 → 287 MiB
  during warmup, ``workspace`` 0 → 168 MiB across eager prefill chunks)
  and none touches a tensor a captured graph references.
* Registry accuracy run for DeepSeek-R1-0528:
  MMLU 87.33 (ref 84.72, threshold 82.91), GSM8K 95.30 (ref 92.72,
  threshold 89.52).  Both pass.

Recovers ~1.5 GiB / rank versus the static reserve and removes the
dependency on the 2 GiB upper bound.  ``chunked_prefill_buffer_batch_size
= 16`` at the cache-reused-prefill site (the actual IMA fix from
250a0ce) is unchanged.

Signed-off-by: Eran Geva <egeva@prenyx0074.a51.clusters.nvidia.com>
Signed-off-by: Eran Geva <egeva@prenyx0167.a51.clusters.nvidia.com>
Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 3, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46607 [ run ] triggered by Bot. Commit: def073e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46607 [ run ] completed with state SUCCESS. Commit: def073e
/LLM/main/L0_MergeRequest_PR pipeline #36654 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 4, 2026

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 4, 2026

/bot help

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 4, 2026

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46621 [ run ] triggered by Bot. Commit: def073e Link to invocation

@MrGeva
Copy link
Copy Markdown
Collaborator Author

MrGeva commented May 4, 2026

/bot skip --comment "failed on non related issue, all the rest passed"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46622 [ skip ] triggered by Bot. Commit: def073e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46622 [ skip ] completed with state SUCCESS. Commit: def073e
Skipping testing for commit def073e

Link to invocation

@MrGeva MrGeva merged commit f504047 into NVIDIA:main May 4, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants