[#11823][feat] AutoDeploy MLA with PWCG support on Deepseek R1 by MrGeva · Pull Request #13497 · NVIDIA/TensorRT-LLM

MrGeva · 2026-04-27T07:29:01Z

piecewise_utils.py: re-classify auto_deploy::trtllm_mla_prepare_metadata from _METADATA_PREP_OPS to _PERSISTENT_BUFFER_OPS — its output (planner.block_offsets) lives at a stable persistent address.
torch_cudagraph.py + piecewise_runner.py (finalize_capture hook): keep dynamic-op output buffers strong-ref'd through the entire bucket's split-graph capture; without this the shared graph pool reuses captured slots across segments
mid-capture and replay reads garbage (verified — MMLU collapsed from 87.33 → 35.70 when we tried immediate weak-refs).
trtllm_mla.py: add the out= kwarg to trtllm_mla_with_cache and plumb it through _mla_with_cache_impl to honor the DynamicOpWrapper contract (write real-token rows into the bucket-sized buffer, zero the padded tail, return the empty
alias).
Split the AutoDeploy MLA thop.attention workspace into separate eager (prefill) and captured (decode) tensors so the C++ side's resize_() can grow the eager one freely without invalidating CUDA-graph-captured pointers — replacing the prior 512MB static reserve which failed on large setups.
examples/auto_deploy/model_registry/configs/deepseek-r1.yaml: enable PWCG (buckets [256..8192], max_num_tokens=15360).

Summary by CodeRabbit

New Features
- Added TRT-LLM MLA (Multi-Head Latent Attention) backend with paged KV cache support.
- Introduced RoPE fusion optimization into TRT-LLM MLA attention path.
- Added explicit V tensor stride override for improved non-contiguous tensor handling.
Bug Fixes
- Fixed extend request routing in decode-only detection logic.
- Improved output buffer lifecycle management during CUDA graph capture.
Tests
- Added coverage for non-contiguous tensor view handling and RMSNorm operations.
- Added tests for decode-only detection and output buffer finalization.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

MrGeva · 2026-04-28T15:27:29Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

coderabbitai · 2026-04-28T15:34:18Z

📝 Walkthrough

Walkthrough

Adds V tensor stride parameters to FMHA operations for handling non-contiguous tensor layouts, introduces TensorRT-LLM MLA cached attention backend with paged KV cache support, refactors CUDA graph capture to finalize dynamic output references, and adds RoPE fusion transform for MLA with supporting configs and tests.

Changes

Cohort / File(s)	Summary
FMHA V Stride Parameters `cpp/tensorrt_llm/common/attentionOp.h`, `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h`, `cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp`, `cpp/tensorrt_llm/thop/attentionOp.cpp`	Added `v_stride_in_bytes` parameter to attention context enqueue operations, enabling explicit control over V tensor token stride for both contiguous and non-contiguous layouts. Existing logic falls back to computed defaults when stride is not provided.
TensorRT-LLM MLA Backend `tensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/mla/__init__.py`	Introduces complete TRT-LLM MLA cached attention backend with paged latent KV cache, block-based offset tracking, RoPE/identity cos-sin tables, mixed prefill/decode handling, and workspace management. Registers custom ops `auto_deploy::trtllm_mla_prepare_metadata` and `auto_deploy::trtllm_mla_with_cache` with fake implementations for tracing.
MLA Constant Extraction `tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py`, `tensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_backend_mla.py`	Updated scale constant retrieval to use `extract_op_args` method instead of direct kwargs indexing for consistency across MLA backends.
CUDA Graph Capture Refactoring `tensorrt_llm/_torch/auto_deploy/compile/torch_cudagraph.py`, `tensorrt_llm/_torch/auto_deploy/compile/piecewise_runner.py`, `tensorrt_llm/_torch/auto_deploy/compile/piecewise_utils.py`	Enhanced piecewise capture to track static runners, finalize captures via `try/finally`, retain dynamic output buffer references during capture and convert to weak refs after finalization. Tightened decode-only routing to check both prefill and extend counts. Extended op classifications to include TRT-LLM MLA cached ops.
RoPE Fusion Transform `tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py`	Introduces `FuseRopeIntoTrtllmMLA` transform that fuses RoPE into TRT-LLM MLA by rewiring attention nodes, materializing cos/sin tensors from graph buffers or config, and reversing RoPE de-interleave on MLA weights for tensor-parallel compatibility.
Configuration Updates `examples/auto_deploy/model_registry/configs/deepseek-r1.yaml`, `tensorrt_llm/_torch/auto_deploy/config/default.yaml`, `examples/auto_deploy/create_standalone_package.py`	Added new YAML model configuration with increased token limits, FP8 KV cache, CUDA graph overrides, MLA attention transforms, and piecewise compilation breakpoints. Introduced disabled-by-default RoPE fusion transform. Updated test exclusion list.
Model Factory Updates `tensorrt_llm/_torch/auto_deploy/models/hf.py`	Changed `get_cache_config_updates` to only apply `kv_cache_dtype` override when explicitly provided in quantization config, allowing user-specified `kv_cache_config.dtype` to remain effective when not overridden.
Accuracy References `tests/integration/defs/accuracy/references/gsm8k.yaml`, `tests/integration/defs/accuracy/references/mmlu.yaml`	Added accuracy reference entries for DeepSeek-R1 with FP8 KV cache quantization, recording expected accuracy values for validation.
Test Additions `tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py`, `tests/unittest/auto_deploy/singlegpu/custom_ops/normalization/test_triton_rms_norm.py`, `tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py`	Added test coverage for decode-only heuristics with extend requests, dynamic output buffer finalization lifecycle, non-contiguous RMSNorm inputs, and runtime model generation for smoke testing.

Sequence Diagram(s)

sequenceDiagram
    participant Frontend as Frontend<br/>(PyTorch)
    participant Prepare as Metadata<br/>Preparation
    participant Cache as Cache<br/>Management
    participant KV as KV<br/>Processing
    participant Attention as THOP<br/>Attention

    rect rgba(100, 150, 200, 0.5)
    Note over Frontend,Attention: TRT-LLM MLA Prefill (Fresh or Cached)
    Frontend->>Prepare: prepare_trtllm_mla_metadata<br/>(batch_info, cache locations)
    Prepare->>Cache: compute block_offsets<br/>block_ids_per_seq
    Cache->>KV: RoPE/append new tokens<br/>to paged cache
    KV->>KV: reload [past+new] KV
    KV->>KV: project compressed KV
    KV->>Attention: call thop.attention<br/>(with latent_cache)
    end

    rect rgba(150, 200, 100, 0.5)
    Note over Frontend,Attention: TRT-LLM MLA Decode
    Frontend->>Prepare: prepare_trtllm_mla_metadata<br/>(batch_info, cache locations)
    Prepare->>Cache: compute block_offsets
    Cache->>KV: mla_rope_generation<br/>write decoded tokens
    KV->>Attention: call thop.attention<br/>(generation mode)
    Attention->>KV: project from latent space<br/>back to v_head_dim
    end

    rect rgba(200, 150, 100, 0.5)
    Note over Frontend,Attention: Mixed Prefill+Decode (In-Order Slices)
    Frontend->>Frontend: auto_deploy::trtllm_mla_with_cache<br/>processes prefill and decode slices
    Frontend->>Prepare: for each slice:<br/>prepare metadata
    Prepare->>Attention: execute appropriate<br/>prefill or decode path
    Attention->>Frontend: return output (reuse buffer if provided)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Suggested reviewers

symphonylyh
liji-nv
niukuo
poweiw
hchings
govind-ramnarayan
yuxianq

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 63.04% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	PR description lacks a clear PR title following the required template, missing JIRA/issue ticket and type identifier. The Description and Test Coverage sections are empty.	Add a proper PR title with format '[JIRA/Issue/None][type] Summary', complete the Description section explaining the issue and solution, and document relevant test coverage for the changes.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately describes the main change: adding AutoDeploy MLA support with piecewise CUDA graph (PWCG) on DeepSeek R1, which matches the primary objectives of enabling MLA backend with PWCG support.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year on this modified file.

Line 2 still ends at 2025 even though this file is modified in this PR.
📝 Suggested patch
- * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h`
at line 2, Update the copyright year range in the file header string "Copyright
(c) 2020-2025, NVIDIA CORPORATION.  All rights reserved." to include the current
modification year (e.g., change 2025 to 2026) so the header on
fused_multihead_attention_common.h reflects the file was modified in this PR.
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp (1)
2-2: ⚠️ Potential issue | 🟡 Minor

Update the file header year to 2026.

This modified file still shows 2025 in the copyright line.
📝 Suggested patch
- * Copyright (c) 2020-2025, NVIDIA CORPORATION.  All rights reserved.
+ * Copyright (c) 2020-2026, NVIDIA CORPORATION.  All rights reserved.
As per coding guidelines, "Include NVIDIA copyright header on all new files; update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp` at
line 2, Update the file header copyright year from 2025 to 2026 in the
fmhaRunner.cpp file header; locate the top-of-file copyright line (the file
header comment block containing "Copyright (c) 2020-2025, NVIDIA CORPORATION.")
and change "2025" to "2026" (also scan the same header block for any duplicated
year entries and update them as well).

🧹 Nitpick comments (5)

tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py (1)

93-109: QA list update is unnecessary for this test-only scope.

This PR scope here is under tests/unittest/, so no tests/integration/test_lists/qa/* update is needed.

As per coding guidelines: “If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py` around
lines 93 - 109, This test lives under tests/unittest/ and does not require any
QA list changes; update the PR or the test header to explicitly state that QA
list updates under tests/integration/test_lists/qa/* are unnecessary for this
change, e.g., add a brief comment near the
test_trtllm_serve_openai_chat_completion definition clarifying the QA list
update is not required for unit-scope changes.

cpp/tensorrt_llm/common/attentionOp.h (1)

146-147: Include the new V-stride field in debug output.

Line 147 adds v_stride_in_bytes, but enqueueContextParamsToString() does not print it, which makes stride/layout triage harder.

💡 Suggested patch

             ss << "k_ptr: " << this->k_ptr << std::endl;
             ss << "v_ptr: " << this->v_ptr << std::endl;
+            ss << "v_stride_in_bytes: " << this->v_stride_in_bytes << std::endl;
             return ss.str();

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@cpp/tensorrt_llm/common/attentionOp.h` around lines 146 - 147,
enqueueContextParamsToString() currently omits the newly added field
v_stride_in_bytes, so update that function to include v_stride_in_bytes in its
debug output; locate enqueueContextParamsToString (used to stringify the
attention/enqueue context) and append a labeled entry like "v_stride_in_bytes="
+ std::to_string(v_stride_in_bytes) (or the existing value formatting used for
other stride fields) to the returned string so the V tensor stride is printed
alongside the other stride/layout fields.

examples/auto_deploy/model_registry/configs/deepseek-r1.yaml (1)

33-35: Clarify the intentional bucket cap vs max token limit.

max_num_tokens is 15360, but piecewise_num_tokens tops out at 8192. Requests above 8192 will run eager. If that is intended, add a short comment here so this doesn’t look like accidental under-capture.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@examples/auto_deploy/model_registry/configs/deepseek-r1.yaml` around lines 33
- 35, Clarify that the piecewise bucket cap is intentional by adding an inline
comment next to piecewise_enabled or piecewise_num_tokens explaining that
piecewise_num_tokens intentionally tops out at 8192 while max_num_tokens remains
15360, and that any requests above 8192 will be handled eagerly; update the YAML
comment near the symbols piecewise_enabled, piecewise_num_tokens, and
max_num_tokens to state this explicit design decision so it is not mistaken for
an accidental under-capture.

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py (1)

57-61: Simplify _trace_rope_node return shape to remove duplicate/unused value.

Line 461 unpacks rope_node_i but never uses it; the helper currently returns rope_node twice. This can be simplified for clarity.

Optional cleanup

-def _trace_rope_node(mla_node: Node) -> Optional[Tuple[Node, Node, Node, Node]]:
+def _trace_rope_node(mla_node: Node) -> Optional[Tuple[Node, Node, Node]]:
@@
-    Returns (rope_node, q_pe_pre, kpe_pre, rope_node) or None if
+    Returns (rope_node, q_pe_pre, kpe_pre) or None if
@@
-    return rope_node, q_pe_pre, kpe_pre, rope_node
+    return rope_node, q_pe_pre, kpe_pre
@@
-        rope_node, _, _, _ = trace_result
+        rope_node, _, _ = trace_result
@@
-            rope_node_i, q_pe_pre, kpe_pre, _ = result
+            _, q_pe_pre, kpe_pre = result

Also applies to: 108-109, 445-446, 461-462

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py` around
lines 57 - 61, The helper _trace_rope_node currently returns (rope_node,
q_pe_pre, kpe_pre, rope_node) and callers unpack four values but never use the
duplicate rope_node; change the return type to Optional[Tuple[Node, Node, Node]]
and return (rope_node, q_pe_pre, kpe_pre) instead, update the docstring and
function signature accordingly, and fix all call sites that currently unpack
four values (remove the redundant fourth variable such as rope_node_i and unpack
only rope_node, q_pe_pre, kpe_pre) so callers at the locations referenced are
updated to expect three values.

tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py (1)

370-474: Cover a decode path with out= too.

This validates bucketed prefill, but the new out= contract is also relevant for extend/decode. Add one single-token decode assertion so a regression in the dynamic-output path doesn’t slip through.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py`
around lines 370 - 474, Add a single-token decode/extend assertion: call
_run_trtllm_mla with an out= buffer using inputs/meta that indicate one
additional decode token (update padded_meta or create a new meta with per-batch
lengths increased by 1), compute the expected decode output by running
_run_trtllm_mla for the same scenario without padding (or with
real_inputs/real_meta extended by one token), then assert out_result.numel()==0
and compare out[:, total_tokens] (the new decode token) to the expected decode
token with torch.testing.assert_close and verify out[:, total_tokens+1:] remains
zero; update test_trtllm_mla_out_buffer_padding to include this check using the
existing helpers (_run_trtllm_mla, padded_inputs, padded_meta, real_inputs,
real_meta).

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/models/hf.py`:
- Around line 335-337: Replace the runtime assert that checks kv_cache_dtype
with an explicit ValueError raise so validation always runs (assert can be
skipped with -O). Locate the check using the kv_cache_dtype variable (the
current assert line) and change it to raise ValueError(f"Unsupported dtype:
{kv_cache_dtype}. Only fp8 and auto are supported.") so the same message is
preserved and the validation is enforced at runtime.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py`:
- Around line 221-225: The try/except around factory._get_model_config() is too
broad; replace the bare except with specific exception types that
_get_model_config can raise (e.g., AttributeError, KeyError, ValueError or a
factory-specific exception if one exists), log the caught exception via
ad_logger.debug including the exception details, and only return None for those
expected config-access failures while letting unexpected exceptions propagate;
target the block using factory._get_model_config(), ad_logger.debug(...), and
the current return None.

In `@tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py`:
- Around line 182-184: The zip() usage that computes kv_lengths and
pages_per_seq can silently truncate if input_positions and seq_lengths have
different lengths; update both occurrences (the comprehension creating
kv_lengths and the similar logic in _build_metadata_with_pages()) to call
zip(input_positions, seq_lengths, strict=True) so the code fails fast on
mismatched metadata lengths and surfaces bad test fixtures immediately.
- Around line 645-656: The zip between input_positions and seq_lengths that
builds kv_lengths should be made strict to fail fast on mismatched test data:
change the expression creating kv_lengths from zip(input_positions, seq_lengths)
to zip(input_positions, seq_lengths, strict=True) so a length mismatch raises
immediately; similarly update any other zip usages in this helper that pair
input_positions with seq_lengths (e.g., the kv_lengths comprehension and any
loops that iterate page_assignments alongside seq_lengths) to use strict=True to
prevent silent truncation.

---

Outside diff comments:
In `@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp`:
- Line 2: Update the file header copyright year from 2025 to 2026 in the
fmhaRunner.cpp file header; locate the top-of-file copyright line (the file
header comment block containing "Copyright (c) 2020-2025, NVIDIA CORPORATION.")
and change "2025" to "2026" (also scan the same header block for any duplicated
year entries and update them as well).

In
`@cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h`:
- Line 2: Update the copyright year range in the file header string "Copyright
(c) 2020-2025, NVIDIA CORPORATION.  All rights reserved." to include the current
modification year (e.g., change 2025 to 2026) so the header on
fused_multihead_attention_common.h reflects the file was modified in this PR.

---

Nitpick comments:
In `@cpp/tensorrt_llm/common/attentionOp.h`:
- Around line 146-147: enqueueContextParamsToString() currently omits the newly
added field v_stride_in_bytes, so update that function to include
v_stride_in_bytes in its debug output; locate enqueueContextParamsToString (used
to stringify the attention/enqueue context) and append a labeled entry like
"v_stride_in_bytes=" + std::to_string(v_stride_in_bytes) (or the existing value
formatting used for other stride fields) to the returned string so the V tensor
stride is printed alongside the other stride/layout fields.

In `@examples/auto_deploy/model_registry/configs/deepseek-r1.yaml`:
- Around line 33-35: Clarify that the piecewise bucket cap is intentional by
adding an inline comment next to piecewise_enabled or piecewise_num_tokens
explaining that piecewise_num_tokens intentionally tops out at 8192 while
max_num_tokens remains 15360, and that any requests above 8192 will be handled
eagerly; update the YAML comment near the symbols piecewise_enabled,
piecewise_num_tokens, and max_num_tokens to state this explicit design decision
so it is not mistaken for an accidental under-capture.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py`:
- Around line 57-61: The helper _trace_rope_node currently returns (rope_node,
q_pe_pre, kpe_pre, rope_node) and callers unpack four values but never use the
duplicate rope_node; change the return type to Optional[Tuple[Node, Node, Node]]
and return (rope_node, q_pe_pre, kpe_pre) instead, update the docstring and
function signature accordingly, and fix all call sites that currently unpack
four values (remove the redundant fourth variable such as rope_node_i and unpack
only rope_node, q_pe_pre, kpe_pre) so callers at the locations referenced are
updated to expect three values.

In `@tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py`:
- Around line 370-474: Add a single-token decode/extend assertion: call
_run_trtllm_mla with an out= buffer using inputs/meta that indicate one
additional decode token (update padded_meta or create a new meta with per-batch
lengths increased by 1), compute the expected decode output by running
_run_trtllm_mla for the same scenario without padding (or with
real_inputs/real_meta extended by one token), then assert out_result.numel()==0
and compare out[:, total_tokens] (the new decode token) to the expected decode
token with torch.testing.assert_close and verify out[:, total_tokens+1:] remains
zero; update test_trtllm_mla_out_buffer_padding to include this check using the
existing helpers (_run_trtllm_mla, padded_inputs, padded_meta, real_inputs,
real_meta).

In `@tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py`:
- Around line 93-109: This test lives under tests/unittest/ and does not require
any QA list changes; update the PR or the test header to explicitly state that
QA list updates under tests/integration/test_lists/qa/* are unnecessary for this
change, e.g., add a brief comment near the
test_trtllm_serve_openai_chat_completion definition clarifying the QA list
update is not required for unit-scope changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f2511657-e58d-4ad3-b70e-6d193df2dc34

📥 Commits

Reviewing files that changed from the base of the PR and between 1e8640c and c9fd46a.

📒 Files selected for processing (22)

cpp/tensorrt_llm/common/attentionOp.h
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fmhaRunner.cpp
cpp/tensorrt_llm/kernels/contextFusedMultiHeadAttention/fused_multihead_attention_common.h
cpp/tensorrt_llm/thop/attentionOp.cpp
examples/auto_deploy/create_standalone_package.py
examples/auto_deploy/model_registry/configs/deepseek-r1.yaml
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
tensorrt_llm/_torch/auto_deploy/compile/piecewise_runner.py
tensorrt_llm/_torch/auto_deploy/compile/piecewise_utils.py
tensorrt_llm/_torch/auto_deploy/config/default.yaml
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/__init__.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_backend_mla.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.py
tensorrt_llm/_torch/auto_deploy/models/hf.py
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_rope_mla.py
tests/integration/defs/accuracy/references/gsm8k.yaml
tests/integration/defs/accuracy/references/mmlu.yaml
tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py
tests/unittest/auto_deploy/singlegpu/custom_ops/mla/test_trtllm_mla_op.py
tests/unittest/auto_deploy/singlegpu/custom_ops/normalization/test_triton_rms_norm.py
tests/unittest/auto_deploy/singlegpu/smoke/test_ad_trtllm_serve.py

tensorrt-cicd · 2026-04-28T15:34:42Z

PR_Github #45947 [ run ] triggered by Bot. Commit: c9fd46a Link to invocation

tensorrt-cicd · 2026-04-28T23:54:50Z

PR_Github #45947 [ run ] completed with state SUCCESS. Commit: c9fd46a
/LLM/main/L0_MergeRequest_PR pipeline #36103 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

nvchenghaoz

approve as the memory usage does not increase.

MrGeva · 2026-04-30T14:27:08Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-04-30T14:33:42Z

PR_Github #46384 [ run ] triggered by Bot. Commit: f24a8cc Link to invocation

tensorrt-cicd · 2026-05-01T14:34:31Z

PR_Github #46384 [ run ] completed with state ABORTED. Commit: f24a8cc

Link to invocation

MrGeva · 2026-05-03T06:32:36Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-03T06:39:36Z

PR_Github #46583 [ run ] triggered by Bot. Commit: f24a8cc Link to invocation

MrGeva · 2026-05-03T06:44:02Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-03T06:51:24Z

PR_Github #46584 [ run ] triggered by Bot. Commit: 0115ade Link to invocation

tensorrt-cicd · 2026-05-03T12:40:59Z

PR_Github #46584 [ run ] completed with state SUCCESS. Commit: 0115ade
/LLM/main/L0_MergeRequest_PR pipeline #36633 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva · 2026-05-03T14:28:35Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-03T14:34:22Z

PR_Github #46600 [ run ] triggered by Bot. Commit: d2a4e2d Link to invocation

MrGeva · 2026-05-03T14:48:53Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-03T14:55:41Z

PR_Github #46601 [ run ] triggered by Bot. Commit: d172009 Link to invocation

Piecewise CUDA graph (PWCG) infrastructure already exists in AutoDeploy. This commit wires the trtllm_mla cached-attention op into PWCG so that DeepSeek-R1 can run under PWCG. Required changes: * ``piecewise_utils.py`` — re-classify ``auto_deploy::trtllm_mla_prepare_metadata`` from ``_METADATA_PREP_OPS`` to ``_PERSISTENT_BUFFER_OPS``. The MLA metadata op produces a stable persistent buffer (``planner.block_offsets``); the persistent-buffer classification is the right contract for it. * ``torch_cudagraph.py`` — track ``_static_runners`` and call ``finalize_capture(nt)`` per bucket so dynamic-op output buffers stay strong-ref'd until the bucket's split-graph capture finishes — otherwise the shared graph pool can reuse those addresses for downstream graph outputs and replay reads garbage. * ``piecewise_runner.py`` — add the ``finalize_capture`` lifecycle hook on ``ADPiecewiseRunner`` and store ``dynamic_out_bufs`` as strong refs during capture (transitioned to weak refs in ``finalize_capture``). * ``trtllm_mla.py`` — add the ``out=`` kwarg to ``trtllm_mla_with_cache`` and plumb it through ``_mla_with_cache_impl`` so PWCG's ``DynamicOpWrapper`` can pre-allocate the bucket-sized output buffer in the graph pool, giving the next captured static segment a stable read address. Config: * ``deepseek-r1.yaml`` — enable PWCG (``compile_model.piecewise_enabled``, ``piecewise_num_tokens=[256..8192]``, ``max_num_tokens=15360``). The pre-existing trailing-static lm_head exclusion (``capture_lm_head`` ⇒ False by default in ``PiecewiseCapturedGraph``) keeps lm_head out of the captured buckets, so PWCG runs at the default 0.9 ``free_gpu_memory_fraction`` without lossy KV-cache budget. Signed-off-by: Eran Geva <egeva@prenyx0035.a51.clusters.nvidia.com> [None][test] AutoDeploy: re-enable MMLU for DeepSeek-R1-0528 PWCG accuracy test PWCG + trtllm_mla now passes both MMLU (~82.7) and GSM8K (~94) on DeepSeek-R1-0528, so the temporary MMLU skip from the previous debugging commit can be removed. Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com> [None][fix] AutoDeploy MLA: fix IMA in cache-reused prefill at high concurrency At TP=8 / max_batch_size=256 / max_num_tokens=15360 / isl=1000 / conc=256, the cache-reused-prefill path (``_handle_prefill_thop_cached_kv``) hits a CUDA illegal memory access inside ``AttentionOp::enqueueContext``. The async surface signature is ``CUBLAS_STATUS_EXECUTION_FAILED`` collateral; under ``CUDA_LAUNCH_BLOCKING=1`` the first synchronous fault is in ``cudaStreamIsCapturing`` after a kernel launched from ``invokeMLAContextFp8Quantize``. Repro: ``bench-sweep --isl 1000 --osl 1000 --concurrencies 256 --world-size 8`` with ``max_batch_size=256`` set in ``deepseek-r1.yaml``. Root cause (Python caller-side bug): * ``thop.attention``'s FP8 context-MLA workspace (``fp8_k_buf`` / ``fp8_v_buf``) is sized to ``chunked_prefill_buffer_batch_size * max_num_tokens`` tokens. The cache-reused-prefill kernel writes ``total_kv_len`` (sum of ``[past + new]`` over all in-flight prefill seqs) FP8 K/V tokens into that buffer. ``trtllm_mla.py`` hard-coded ``chunked_prefill_buffer_batch_size = 1``, so the buffer covered only ``max_num_tokens`` tokens while a real chunked-prefill batch routinely exceeds that — at the repro config we observed ``num_full_tokens = 84252`` vs ``max_num_tokens = 15360`` (5.5× OOB). Fix: 1. Pass ``chunked_prefill_buffer_batch_size = 16`` at the cache-reused- prefill call site only (the fresh-prefill and decode call sites are correct at ``1``). ``16 * 15360 = 245760`` token budget — covers the diag's 84,252 tokens with ~3× margin. Avoids ``max_num_requests`` (which the regular AD trtllm attention path uses) because for MLA the formula multiplies by ``total_k_dim_all_heads`` and at ``max_num_requests=256`` would size ``fp8_k_buf`` to ~12 GiB / rank. 2. Bump the static workspace reserve from 512 MiB to 2 GiB. Without this, the C++ side (``cpp/tensorrt_llm/thop/attentionOp.cpp``) calls ``workspace.resize_()`` mid-run when N=16 needs ~1.3 GiB — which reallocates storage and invalidates the monolithic decode CUDA graphs that were captured against the original 512-MiB buffer, producing a SIGSEGV in ``at::cuda::CUDAGraph::replay()``. Reserving 2 GiB up front keeps ``resize_()`` quiescent so captured-graph pointers stay valid. Validated end-to-end: yeonbok's bench-sweep at bs=256 completes cleanly with no IMA, no CUBLAS errors, no workspace resize warnings, and 1281 successful 200 OK responses at 5,270 output tok/s (matches the ``max_num_tokens<=8192`` workaround's throughput). Signed-off-by: Eran Geva <egeva@prenyx0109.a51.clusters.nvidia.com> [None][refactor] AutoDeploy MLA: split workspace tensor for eager / captured paths Replaces the 2 GiB static workspace reserve from 250a0ce with the two-tensor pattern the standard ``trtllm_attention`` backend already uses (``workspace`` / ``cuda_graph_workspace``). Background: ``thop.attention``'s C++ side resizes the workspace tensor in-place (``resize_()``) when its sizing formula exceeds the current capacity. ``resize_()`` reallocates storage and rebinds ``data_ptr_``, which **invalidates any captured CUDA graph that recorded the old address**. The previous fix worked around this by pre-allocating 2 GiB up front so ``resize_()`` would never fire — at the cost of permanently reserving memory that is rarely needed and depending on a hand-tuned upper bound that has to track future config changes (max_num_tokens, chunked_prefill_buffer_batch_size, etc.). This change splits the single workspace into two tensors and routes per call site: * ``workspace`` — used by eager paths (fresh + cache-reused prefill). Free to grow on demand via ``resize_()``; no captured graph references it, so storage churn is harmless. * ``cuda_graph_workspace`` — used during CUDA-graph warmup and capture (decode). Grows lazily during warmup so the captured graph records the final pointer; afterwards no resize fires for the captured workload. Routing happens in a new ``_TrtllmMLAPlanner._select_workspace()`` helper, called at all three ``thop.attention`` sites. The discriminator is the same signal ``plan_host`` already uses: ``torch.cuda.is_current_stream_capturing() or cuda_graph_state.in_warm_up()``. Both tensors start size-0 and grow on first use, mirroring the standard backend. Validated on B200 TP=8: * Yeonbok's bs=256 / isl=osl=1000 / conc=256 bench-sweep: 1281 × 200 OK, 5283 output tok/s, 0 IMA, 0 CUBLAS errors, 0 SIGSEGV. Resize warnings fire as expected (``cuda_graph_workspace`` 0 → 287 MiB during warmup, ``workspace`` 0 → 168 MiB across eager prefill chunks) and none touches a tensor a captured graph references. * Registry accuracy run for DeepSeek-R1-0528: MMLU 87.33 (ref 84.72, threshold 82.91), GSM8K 95.30 (ref 92.72, threshold 89.52). Both pass. Recovers ~1.5 GiB / rank versus the static reserve and removes the dependency on the 2 GiB upper bound. ``chunked_prefill_buffer_batch_size = 16`` at the cache-reused-prefill site (the actual IMA fix from 250a0ce) is unchanged. Signed-off-by: Eran Geva <egeva@prenyx0074.a51.clusters.nvidia.com> Signed-off-by: Eran Geva <egeva@prenyx0167.a51.clusters.nvidia.com> Signed-off-by: Eran Geva <19514940+MrGeva@users.noreply.github.com>

MrGeva · 2026-05-03T18:41:08Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-05-03T18:48:27Z

PR_Github #46607 [ run ] triggered by Bot. Commit: def073e Link to invocation

tensorrt-cicd · 2026-05-04T00:51:45Z

PR_Github #46607 [ run ] completed with state SUCCESS. Commit: def073e
/LLM/main/L0_MergeRequest_PR pipeline #36654 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

MrGeva · 2026-05-04T05:52:47Z

/bot run --disable-fail-fast --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

MrGeva · 2026-05-04T05:57:14Z

/bot help

github-actions · 2026-05-04T05:57:29Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

tensorrt-cicd · 2026-05-04T05:58:36Z

PR_Github #46621 [ run ] triggered by Bot. Commit: def073e Link to invocation

MrGeva · 2026-05-04T05:58:43Z

/bot skip --comment "failed on non related issue, all the rest passed"

tensorrt-cicd · 2026-05-04T06:04:15Z

PR_Github #46622 [ skip ] triggered by Bot. Commit: def073e Link to invocation

tensorrt-cicd · 2026-05-04T06:12:06Z

PR_Github #46622 [ skip ] completed with state SUCCESS. Commit: def073e
Skipping testing for commit def073e

Link to invocation

github-actions Bot assigned MrGeva Apr 27, 2026

MrGeva force-pushed the eg/pwcg_tmla branch 3 times, most recently from 3306064 to c9fd46a Compare April 28, 2026 14:40

MrGeva changed the title ~~Eg/pwcg tmla~~ [#11823][feat] AutoDeploy PWCG fixes to support MLA and Deepseek R1 Apr 28, 2026

MrGeva marked this pull request as ready for review April 28, 2026 15:22

MrGeva requested review from a team as code owners April 28, 2026 15:22

MrGeva requested a review from Fridah-nv April 28, 2026 15:22

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

nvchenghaoz reviewed Apr 28, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py

Comment thread tensorrt_llm/_torch/auto_deploy/compile/piecewise_runner.py

Comment thread tensorrt_llm/_torch/auto_deploy/compile/piecewise_runner.py

MrGeva force-pushed the eg/pwcg_tmla branch 3 times, most recently from d6d411f to 6bb225c Compare April 29, 2026 09:47

nvchenghaoz approved these changes Apr 29, 2026

View reviewed changes

nvchenghaoz reviewed Apr 29, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/compile/piecewise_utils.py Outdated

Comment thread tensorrt_llm/_torch/auto_deploy/compile/piecewise_runner.py

MrGeva force-pushed the eg/pwcg_tmla branch 5 times, most recently from 204b225 to f24a8cc Compare April 30, 2026 14:06

MrGeva mentioned this pull request Apr 30, 2026

[Feature]: Check if we can free tensors before capture finalize in PWCG #13666

Open

1 task

MrGeva changed the title ~~[#11823][feat] AutoDeploy PWCG fixes to support MLA and Deepseek R1~~ [#11823][feat] AutoDeploy MLA with PWCG support on Deepseek R1 Apr 30, 2026

MrGeva commented Apr 30, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/mla/trtllm_mla.py

lucaslie approved these changes May 1, 2026

View reviewed changes

MrGeva enabled auto-merge (squash) May 3, 2026 06:33

MrGeva force-pushed the eg/pwcg_tmla branch from f24a8cc to 0115ade Compare May 3, 2026 06:43

MrGeva force-pushed the eg/pwcg_tmla branch from d2a4e2d to d172009 Compare May 3, 2026 14:48

MrGeva force-pushed the eg/pwcg_tmla branch from d172009 to def073e Compare May 3, 2026 18:39

MrGeva merged commit f504047 into NVIDIA:main May 4, 2026
6 checks passed

taylor-yb-lee mentioned this pull request May 4, 2026

[#12784][feat] AutoDeploy: Optimize DeepSeek-R1 model performance #12946

Open

1 task

Conversation

MrGeva commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

MrGeva commented Apr 28, 2026

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

nvchenghaoz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MrGeva commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

MrGeva commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

MrGeva commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

MrGeva commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

MrGeva commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

MrGeva commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 3, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

MrGeva commented May 4, 2026

Uh oh!

MrGeva commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

MrGeva commented Apr 27, 2026 •

edited

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading