[#11823][feat] AutoDeploy MLA with PWCG support on Deepseek R1#13497
[#11823][feat] AutoDeploy MLA with PWCG support on Deepseek R1#13497MrGeva merged 1 commit intoNVIDIA:mainfrom
Conversation
3306064 to
c9fd46a
Compare
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
📝 WalkthroughWalkthroughAdds V tensor stride parameters to FMHA operations for handling non-contiguous tensor layouts, introduces TensorRT-LLM MLA cached attention backend with paged KV cache support, refactors CUDA graph capture to finalize dynamic output references, and adds RoPE fusion transform for MLA with supporting configs and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Frontend as Frontend<br/>(PyTorch)
participant Prepare as Metadata<br/>Preparation
participant Cache as Cache<br/>Management
participant KV as KV<br/>Processing
participant Attention as THOP<br/>Attention
rect rgba(100, 150, 200, 0.5)
Note over Frontend,Attention: TRT-LLM MLA Prefill (Fresh or Cached)
Frontend->>Prepare: prepare_trtllm_mla_metadata<br/>(batch_info, cache locations)
Prepare->>Cache: compute block_offsets<br/>block_ids_per_seq
Cache->>KV: RoPE/append new tokens<br/>to paged cache
KV->>KV: reload [past+new] KV
KV->>KV: project compressed KV
KV->>Attention: call thop.attention<br/>(with latent_cache)
end
rect rgba(150, 200, 100, 0.5)
Note over Frontend,Attention: TRT-LLM MLA Decode
Frontend->>Prepare: prepare_trtllm_mla_metadata<br/>(batch_info, cache locations)
Prepare->>Cache: compute block_offsets
Cache->>KV: mla_rope_generation<br/>write decoded tokens
KV->>Attention: call thop.attention<br/>(generation mode)
Attention->>KV: project from latent space<br/>back to v_head_dim
end
rect rgba(200, 150, 100, 0.5)
Note over Frontend,Attention: Mixed Prefill+Decode (In-Order Slices)
Frontend->>Frontend: auto_deploy::trtllm_mla_with_cache<br/>processes prefill and decode slices
Frontend->>Prepare: for each slice:<br/>prepare metadata
Prepare->>Attention: execute appropriate<br/>prefill or decode path
Attention->>Frontend: return output (reuse buffer if provided)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h (1)
2-2:⚠️ Potential issue | 🟡 MinorUpdate the copyright year on this modified file.
Line 2 still ends at
2025even though this file is modified in this PR.📝 Suggested patch
- * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h` at line 2, Update the copyright year range in the file header string "Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved." to include the current modification year (e.g., change 2025 to 2026) so the header on fused_multihead_attention_common.h reflects the file was modified in this PR.cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp (1)
2-2:⚠️ Potential issue | 🟡 MinorUpdate the file header year to 2026.
This modified file still shows
2025in the copyright line.📝 Suggested patch
- * Copyright (c) 2020-2025, NVIDIA CORPORATION. All rights reserved. + * Copyright (c) 2020-2026, NVIDIA CORPORATION. All rights reserved.As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp` at line 2, Update the file header copyright year from 2025 to 2026 in the fmhaRunner.cpp file header; locate the top-of-file copyright line (the file header comment block containing "Copyright (c) 2020-2025, NVIDIA CORPORATION.") and change "2025" to "2026" (also scan the same header block for any duplicated year entries and update them as well).
🧹 Nitpick comments (5)
tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py (1)
93-109: QA list update is unnecessary for this test-only scope.This PR scope here is under
tests/unittest/, so notests/integration/test_lists/qa/*update is needed.As per coding guidelines: “If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py` around lines 93 - 109, This test lives under tests/unittest/ and does not require any QA list changes; update the PR or the test header to explicitly state that QA list updates under tests/integration/test_lists/qa/* are unnecessary for this change, e.g., add a brief comment near the test_trtllm_serve_openai_chat_completion definition clarifying the QA list update is not required for unit-scope changes.cpp/tensorrt_llm/common/attentionOp.h (1)
146-147: Include the new V-stride field in debug output.Line 147 adds
v_stride_in_bytes, butenqueueContextParamsToString()does not print it, which makes stride/layout triage harder.💡 Suggested patch
ss << "k_ptr: " << this->k_ptr << std::endl; ss << "v_ptr: " << this->v_ptr << std::endl; + ss << "v_stride_in_bytes: " << this->v_stride_in_bytes << std::endl; return ss.str();🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/common/attentionOp.h` around lines 146 - 147, enqueueContextParamsToString() currently omits the newly added field v_stride_in_bytes, so update that function to include v_stride_in_bytes in its debug output; locate enqueueContextParamsToString (used to stringify the attention/enqueue context) and append a labeled entry like "v_stride_in_bytes=" + std::to_string(v_stride_in_bytes) (or the existing value formatting used for other stride fields) to the returned string so the V tensor stride is printed alongside the other stride/layout fields.examples/auto_deploy/model_registry/configs/deepseek-r1.yaml (1)
33-35: Clarify the intentional bucket cap vs max token limit.
max_num_tokensis 15360, butpiecewise_num_tokenstops out at 8192. Requests above 8192 will run eager. If that is intended, add a short comment here so this doesn’t look like accidental under-capture.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@examples/auto_deploy/model_registry/configs/deepseek-r1.yaml` around lines 33 - 35, Clarify that the piecewise bucket cap is intentional by adding an inline comment next to piecewise_enabled or piecewise_num_tokens explaining that piecewise_num_tokens intentionally tops out at 8192 while max_num_tokens remains 15360, and that any requests above 8192 will be handled eagerly; update the YAML comment near the symbols piecewise_enabled, piecewise_num_tokens, and max_num_tokens to state this explicit design decision so it is not mistaken for an accidental under-capture.tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py (1)
57-61: Simplify_trace_rope_nodereturn shape to remove duplicate/unused value.Line 461 unpacks
rope_node_ibut never uses it; the helper currently returnsrope_nodetwice. This can be simplified for clarity.Optional cleanup
-def _trace_rope_node(mla_node: Node) -> Optional[Tuple[Node, Node, Node, Node]]: +def _trace_rope_node(mla_node: Node) -> Optional[Tuple[Node, Node, Node]]: @@ - Returns (rope_node, q_pe_pre, kpe_pre, rope_node) or None if + Returns (rope_node, q_pe_pre, kpe_pre) or None if @@ - return rope_node, q_pe_pre, kpe_pre, rope_node + return rope_node, q_pe_pre, kpe_pre @@ - rope_node, _, _, _ = trace_result + rope_node, _, _ = trace_result @@ - rope_node_i, q_pe_pre, kpe_pre, _ = result + _, q_pe_pre, kpe_pre = resultAlso applies to: 108-109, 445-446, 461-462
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py` around lines 57 - 61, The helper _trace_rope_node currently returns (rope_node, q_pe_pre, kpe_pre, rope_node) and callers unpack four values but never use the duplicate rope_node; change the return type to Optional[Tuple[Node, Node, Node]] and return (rope_node, q_pe_pre, kpe_pre) instead, update the docstring and function signature accordingly, and fix all call sites that currently unpack four values (remove the redundant fourth variable such as rope_node_i and unpack only rope_node, q_pe_pre, kpe_pre) so callers at the locations referenced are updated to expect three values.tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py (1)
370-474: Cover a decode path without=too.This validates bucketed prefill, but the new
out=contract is also relevant for extend/decode. Add one single-token decode assertion so a regression in the dynamic-output path doesn’t slip through.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py` around lines 370 - 474, Add a single-token decode/extend assertion: call _run_trtllm_mla with an out= buffer using inputs/meta that indicate one additional decode token (update padded_meta or create a new meta with per-batch lengths increased by 1), compute the expected decode output by running _run_trtllm_mla for the same scenario without padding (or with real_inputs/real_meta extended by one token), then assert out_result.numel()==0 and compare out[:, total_tokens] (the new decode token) to the expected decode token with torch.testing.assert_close and verify out[:, total_tokens+1:] remains zero; update test_trtllm_mla_out_buffer_padding to include this check using the existing helpers (_run_trtllm_mla, padded_inputs, padded_meta, real_inputs, real_meta).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/models/hf.py`:
- Around line 335-337: Replace the runtime assert that checks kv_cache_dtype
with an explicit ValueError raise so validation always runs (assert can be
skipped with -O). Locate the check using the kv_cache_dtype variable (the
current assert line) and change it to raise ValueError(f"Unsupported dtype:
{kv_cache_dtype}. Only fp8 and auto are supported.") so the same message is
preserved and the validation is enforced at runtime.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py`:
- Around line 221-225: The try/except around factory._get_model_config() is too
broad; replace the bare except with specific exception types that
_get_model_config can raise (e.g., AttributeError, KeyError, ValueError or a
factory-specific exception if one exists), log the caught exception via
ad_logger.debug including the exception details, and only return None for those
expected config-access failures while letting unexpected exceptions propagate;
target the block using factory._get_model_config(), ad_logger.debug(...), and
the current return None.
In `@tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py`:
- Around line 182-184: The zip() usage that computes kv_lengths and
pages_per_seq can silently truncate if input_positions and seq_lengths have
different lengths; update both occurrences (the comprehension creating
kv_lengths and the similar logic in _build_metadata_with_pages()) to call
zip(input_positions, seq_lengths, strict=True) so the code fails fast on
mismatched metadata lengths and surfaces bad test fixtures immediately.
- Around line 645-656: The zip between input_positions and seq_lengths that
builds kv_lengths should be made strict to fail fast on mismatched test data:
change the expression creating kv_lengths from zip(input_positions, seq_lengths)
to zip(input_positions, seq_lengths, strict=True) so a length mismatch raises
immediately; similarly update any other zip usages in this helper that pair
input_positions with seq_lengths (e.g., the kv_lengths comprehension and any
loops that iterate page_assignments alongside seq_lengths) to use strict=True to
prevent silent truncation.
---
Outside diff comments:
In `@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp`:
- Line 2: Update the file header copyright year from 2025 to 2026 in the
fmhaRunner.cpp file header; locate the top-of-file copyright line (the file
header comment block containing "Copyright (c) 2020-2025, NVIDIA CORPORATION.")
and change "2025" to "2026" (also scan the same header block for any duplicated
year entries and update them as well).
In
`@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h`:
- Line 2: Update the copyright year range in the file header string "Copyright
(c) 2020-2025, NVIDIA CORPORATION. All rights reserved." to include the current
modification year (e.g., change 2025 to 2026) so the header on
fused_multihead_attention_common.h reflects the file was modified in this PR.
---
Nitpick comments:
In `@cpp/tensorrt_llm/common/attentionOp.h`:
- Around line 146-147: enqueueContextParamsToString() currently omits the newly
added field v_stride_in_bytes, so update that function to include
v_stride_in_bytes in its debug output; locate enqueueContextParamsToString (used
to stringify the attention/enqueue context) and append a labeled entry like
"v_stride_in_bytes=" + std::to_string(v_stride_in_bytes) (or the existing value
formatting used for other stride fields) to the returned string so the V tensor
stride is printed alongside the other stride/layout fields.
In `@examples/auto_deploy/model_registry/configs/deepseek-r1.yaml`:
- Around line 33-35: Clarify that the piecewise bucket cap is intentional by
adding an inline comment next to piecewise_enabled or piecewise_num_tokens
explaining that piecewise_num_tokens intentionally tops out at 8192 while
max_num_tokens remains 15360, and that any requests above 8192 will be handled
eagerly; update the YAML comment near the symbols piecewise_enabled,
piecewise_num_tokens, and max_num_tokens to state this explicit design decision
so it is not mistaken for an accidental under-capture.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py`:
- Around line 57-61: The helper _trace_rope_node currently returns (rope_node,
q_pe_pre, kpe_pre, rope_node) and callers unpack four values but never use the
duplicate rope_node; change the return type to Optional[Tuple[Node, Node, Node]]
and return (rope_node, q_pe_pre, kpe_pre) instead, update the docstring and
function signature accordingly, and fix all call sites that currently unpack
four values (remove the redundant fourth variable such as rope_node_i and unpack
only rope_node, q_pe_pre, kpe_pre) so callers at the locations referenced are
updated to expect three values.
In `@tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py`:
- Around line 370-474: Add a single-token decode/extend assertion: call
_run_trtllm_mla with an out= buffer using inputs/meta that indicate one
additional decode token (update padded_meta or create a new meta with per-batch
lengths increased by 1), compute the expected decode output by running
_run_trtllm_mla for the same scenario without padding (or with
real_inputs/real_meta extended by one token), then assert out_result.numel()==0
and compare out[:, total_tokens] (the new decode token) to the expected decode
token with torch.testing.assert_close and verify out[:, total_tokens+1:] remains
zero; update test_trtllm_mla_out_buffer_padding to include this check using the
existing helpers (_run_trtllm_mla, padded_inputs, padded_meta, real_inputs,
real_meta).
In `@tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py`:
- Around line 93-109: This test lives under tests/unittest/ and does not require
any QA list changes; update the PR or the test header to explicitly state that
QA list updates under tests/integration/test_lists/qa/* are unnecessary for this
change, e.g., add a brief comment near the
test_trtllm_serve_openai_chat_completion definition clarifying the QA list
update is not required for unit-scope changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f2511657-e58d-4ad3-b70e-6d193df2dc34
📒 Files selected for processing (22)
cpp/tensorrt_llm/common/attentionOp.hcpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cppcpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.hcpp/tensorrt_llm/thop/attentionOp.cppexamples/auto_deploy/create_standalone_package.pyexamples/auto_deploy/model_registry/configs/deepseek-r1.yamltensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.pytensorrt_llm/_torch/auto_deploy/compile/piecewise_runner.pytensorrt_llm/_torch/auto_deploy/compile/piecewise_utils.pytensorrt_llm/_torch/auto_deploy/config/default.yamltensorrt_llm/_torch/auto_deploy/custom_ops/mla/__init__.pytensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.pytensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_backend_mla.pytensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.pytensorrt_llm/_torch/auto_deploy/models/hf.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.pytests/integration/defs/accuracy/references/gsm8k.yamltests/integration/defs/accuracy/references/mmlu.yamltests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.pytests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.pytests/unittest/auto_deploy/singlegpu/custom_ops/normalization/test_triton_rms_norm.pytests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py
|
PR_Github #45947 [ run ] triggered by Bot. Commit: |
|
PR_Github #45947 [ run ] completed with state
|
d6d411f to
6bb225c
Compare
nvchenghaoz
left a comment
There was a problem hiding this comment.
approve as the memory usage does not increase.
204b225 to
f24a8cc
Compare
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #46384 [ run ] triggered by Bot. Commit: |
|
PR_Github #46384 [ run ] completed with state |
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #46583 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #46584 [ run ] triggered by Bot. Commit: |
|
PR_Github #46584 [ run ] completed with state
|
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #46600 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #46601 [ run ] triggered by Bot. Commit: |
Piecewise CUDA graph (PWCG) infrastructure already exists in AutoDeploy. This commit wires the trtllm_mla cached-attention op into PWCG so that DeepSeek-R1 can run under PWCG. Required changes: * ``piecewise_utils.py`` — re-classify ``auto_deploy::trtllm_mla_prepare_metadata`` from ``_METADATA_PREP_OPS`` to ``_PERSISTENT_BUFFER_OPS``. The MLA metadata op produces a stable persistent buffer (``planner.block_offsets``); the persistent-buffer classification is the right contract for it. * ``torch_cudagraph.py`` — track ``_static_runners`` and call ``finalize_capture(nt)`` per bucket so dynamic-op output buffers stay strong-ref'd until the bucket's split-graph capture finishes — otherwise the shared graph pool can reuse those addresses for downstream graph outputs and replay reads garbage. * ``piecewise_runner.py`` — add the ``finalize_capture`` lifecycle hook on ``ADPiecewiseRunner`` and store ``dynamic_out_bufs`` as strong refs during capture (transitioned to weak refs in ``finalize_capture``). * ``trtllm_mla.py`` — add the ``out=`` kwarg to ``trtllm_mla_with_cache`` and plumb it through ``_mla_with_cache_impl`` so PWCG's ``DynamicOpWrapper`` can pre-allocate the bucket-sized output buffer in the graph pool, giving the next captured static segment a stable read address. Config: * ``deepseek-r1.yaml`` — enable PWCG (``compile_model.piecewise_enabled``, ``piecewise_num_tokens=[256..8192]``, ``max_num_tokens=15360``). The pre-existing trailing-static lm_head exclusion (``capture_lm_head`` ⇒ False by default in ``PiecewiseCapturedGraph``) keeps lm_head out of the captured buckets, so PWCG runs at the default 0.9 ``free_gpu_memory_fraction`` without lossy KV-cache budget. Signed-off-by: Eran Geva <egeva@prenyx0035.a51.clusters.nvidia.com> [None][test] AutoDeploy: re-enable MMLU for DeepSeek-R1-0528 PWCG accuracy test PWCG + trtllm_mla now passes both MMLU (~82.7) and GSM8K (~94) on DeepSeek-R1-0528, so the temporary MMLU skip from the previous debugging commit can be removed. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> [None][fix] AutoDeploy MLA: fix IMA in cache-reused prefill at high concurrency At TP=8 / max_batch_size=256 / max_num_tokens=15360 / isl=1000 / conc=256, the cache-reused-prefill path (``_handle_prefill_thop_cached_kv``) hits a CUDA illegal memory access inside ``AttentionOp::enqueueContext``. The async surface signature is ``CUBLAS_STATUS_EXECUTION_FAILED`` collateral; under ``CUDA_LAUNCH_BLOCKING=1`` the first synchronous fault is in ``cudaStreamIsCapturing`` after a kernel launched from ``invokeMLAContextFp8Quantize``. Repro: ``bench-sweep --isl 1000 --osl 1000 --concurrencies 256 --world-size 8`` with ``max_batch_size=256`` set in ``deepseek-r1.yaml``. Root cause (Python caller-side bug): * ``thop.attention``'s FP8 context-MLA workspace (``fp8_k_buf`` / ``fp8_v_buf``) is sized to ``chunked_prefill_buffer_batch_size * max_num_tokens`` tokens. The cache-reused-prefill kernel writes ``total_kv_len`` (sum of ``[past + new]`` over all in-flight prefill seqs) FP8 K/V tokens into that buffer. ``trtllm_mla.py`` hard-coded ``chunked_prefill_buffer_batch_size = 1``, so the buffer covered only ``max_num_tokens`` tokens while a real chunked-prefill batch routinely exceeds that — at the repro config we observed ``num_full_tokens = 84252`` vs ``max_num_tokens = 15360`` (5.5× OOB). Fix: 1. Pass ``chunked_prefill_buffer_batch_size = 16`` at the cache-reused- prefill call site only (the fresh-prefill and decode call sites are correct at ``1``). ``16 * 15360 = 245760`` token budget — covers the diag's 84,252 tokens with ~3× margin. Avoids ``max_num_requests`` (which the regular AD trtllm attention path uses) because for MLA the formula multiplies by ``total_k_dim_all_heads`` and at ``max_num_requests=256`` would size ``fp8_k_buf`` to ~12 GiB / rank. 2. Bump the static workspace reserve from 512 MiB to 2 GiB. Without this, the C++ side (``cpp/tensorrt_llm/thop/attentionOp.cpp``) calls ``workspace.resize_()`` mid-run when N=16 needs ~1.3 GiB — which reallocates storage and invalidates the monolithic decode CUDA graphs that were captured against the original 512-MiB buffer, producing a SIGSEGV in ``at::cuda::CUDAGraph::replay()``. Reserving 2 GiB up front keeps ``resize_()`` quiescent so captured-graph pointers stay valid. Validated end-to-end: yeonbok's bench-sweep at bs=256 completes cleanly with no IMA, no CUBLAS errors, no workspace resize warnings, and 1281 successful 200 OK responses at 5,270 output tok/s (matches the ``max_num_tokens<=8192`` workaround's throughput). Signed-off-by: Eran Geva <egeva@prenyx0109.a51.clusters.nvidia.com> [None][refactor] AutoDeploy MLA: split workspace tensor for eager / captured paths Replaces the 2 GiB static workspace reserve from 250a0ce with the two-tensor pattern the standard ``trtllm_attention`` backend already uses (``workspace`` / ``cuda_graph_workspace``). Background: ``thop.attention``'s C++ side resizes the workspace tensor in-place (``resize_()``) when its sizing formula exceeds the current capacity. ``resize_()`` reallocates storage and rebinds ``data_ptr_``, which **invalidates any captured CUDA graph that recorded the old address**. The previous fix worked around this by pre-allocating 2 GiB up front so ``resize_()`` would never fire — at the cost of permanently reserving memory that is rarely needed and depending on a hand-tuned upper bound that has to track future config changes (max_num_tokens, chunked_prefill_buffer_batch_size, etc.). This change splits the single workspace into two tensors and routes per call site: * ``workspace`` — used by eager paths (fresh + cache-reused prefill). Free to grow on demand via ``resize_()``; no captured graph references it, so storage churn is harmless. * ``cuda_graph_workspace`` — used during CUDA-graph warmup and capture (decode). Grows lazily during warmup so the captured graph records the final pointer; afterwards no resize fires for the captured workload. Routing happens in a new ``_TrtllmMLAPlanner._select_workspace()`` helper, called at all three ``thop.attention`` sites. The discriminator is the same signal ``plan_host`` already uses: ``torch.cuda.is_current_stream_capturing() or cuda_graph_state.in_warm_up()``. Both tensors start size-0 and grow on first use, mirroring the standard backend. Validated on B200 TP=8: * Yeonbok's bs=256 / isl=osl=1000 / conc=256 bench-sweep: 1281 × 200 OK, 5283 output tok/s, 0 IMA, 0 CUBLAS errors, 0 SIGSEGV. Resize warnings fire as expected (``cuda_graph_workspace`` 0 → 287 MiB during warmup, ``workspace`` 0 → 168 MiB across eager prefill chunks) and none touches a tensor a captured graph references. * Registry accuracy run for DeepSeek-R1-0528: MMLU 87.33 (ref 84.72, threshold 82.91), GSM8K 95.30 (ref 92.72, threshold 89.52). Both pass. Recovers ~1.5 GiB / rank versus the static reserve and removes the dependency on the 2 GiB upper bound. ``chunked_prefill_buffer_batch_size = 16`` at the cache-reused-prefill site (the actual IMA fix from 250a0ce) is unchanged. Signed-off-by: Eran Geva <egeva@prenyx0074.a51.clusters.nvidia.com> Signed-off-by: Eran Geva <egeva@prenyx0167.a51.clusters.nvidia.com> Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #46607 [ run ] triggered by Bot. Commit: |
|
PR_Github #46607 [ run ] completed with state
|
|
/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
PR_Github #46621 [ run ] triggered by Bot. Commit: |
|
/bot skip --comment "failed on non related issue, all the rest passed" |
|
PR_Github #46622 [ skip ] triggered by Bot. Commit: |
|
PR_Github #46622 [ skip ] completed with state |
piecewise_utils.py: re-classify auto_deploy::trtllm_mla_prepare_metadata from _METADATA_PREP_OPS to _PERSISTENT_BUFFER_OPS — its output (planner.block_offsets) lives at a stable persistent address.
torch_cudagraph.py + piecewise_runner.py (finalize_capture hook): keep dynamic-op output buffers strong-ref'd through the entire bucket's split-graph capture; without this the shared graph pool reuses captured slots across segments
mid-capture and replay reads garbage (verified — MMLU collapsed from 87.33 → 35.70 when we tried immediate weak-refs).
trtllm_mla.py: add the out= kwarg to trtllm_mla_with_cache and plumb it through _mla_with_cache_impl to honor the DynamicOpWrapper contract (write real-token rows into the bucket-sized buffer, zero the padded tail, return the empty
alias).
Split the AutoDeploy MLA thop.attention workspace into separate eager (prefill) and captured (decode) tensors so the C++ side's resize_() can grow the eager one freely without invalidating CUDA-graph-captured pointers — replacing the prior 512MB static reserve which failed on large setups.
examples/auto_deploy/model_registry/configs/deepseek-r1.yaml: enable PWCG (buckets [256..8192], max_num_tokens=15360).
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.