[https://nvbugs/6071081][fix] Disable spec decoding on Blackwell for MLA models by sunnyqgg · Pull Request #13000 · NVIDIA/TensorRT-LLM

sunnyqgg · 2026-04-13T13:30:33Z

Summary

Re-adds a guard to disable speculative decoding on Blackwell (sm100+) GPUs specifically for MLA models (DeepSeek-V2/V3/R1)
The blanket Blackwell spec-decoding guard was intentionally removed in 4ece13c to enable EAGLE3 dynamic tree support. This patch restores the guard selectively for MLA models only, keeping EAGLE3 functional for non-MLA models (e.g. LLaMA) on Blackwell
The trtllmGen FMHA kernel does not support spec-decoding with MLA's non-standard V head dimensions

Changes

interface.py: Add is_mla_enable field to AttentionMetadata base class
trtllm.py: Add guard in update_spec_dec_param to disable spec decoding when MLA + Blackwell
model_engine.py: Pass is_mla_enable when constructing attention metadata

Test plan

Verify non-MLA models (LLaMA) with EAGLE3 spec decoding on Blackwell still work
Verify MLA models (DeepSeek-V3) on Blackwell no longer hit trtllmGen FMHA assertion
Verify MLA models on Hopper with spec decoding are unaffected

Bug 6071081

Summary by CodeRabbit

New Features
- Enhanced Multi-head Latent Attention (MLA) support with improved configuration handling.
Bug Fixes
- Fixed spec decoding behavior on newer GPU architectures when using MLA.
Improvements
- Optimized MLA enablement calculation for better performance.

On Blackwell (sm100+), the trtllmGen FMHA kernel does not support spec-decoding with MLA's non-standard V head dimensions used by DeepSeek-V2/V3/R1. The guard that previously blocked all spec decoding on Blackwell was intentionally removed in 4ece13c to enable EAGLE3 dynamic tree support. This patch re-adds the guard selectively for MLA models only, keeping EAGLE3 working for non-MLA models (e.g. LLaMA) on Blackwell. Bug 6071081 Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg · 2026-04-13T13:30:39Z

/bot run

coderabbitai · 2026-04-13T13:36:15Z

📝 Walkthrough

Walkthrough

Three files are modified to introduce MLA (Multi-Latent Attention) enablement tracking. A new boolean field is_mla_enable is added to AttentionMetadata, propagated through model engine initialization, and used to conditionally disable spec decoding on Blackwell+ GPUs when MLA is active.

Changes

Cohort / File(s)	Summary
MLA Enablement Field `tensorrt_llm/_torch/attention_backend/interface.py`	Added new boolean field `is_mla_enable: bool = False` to `AttentionMetadata` class to track MLA enablement state.
Spec Decoding Override Logic `tensorrt_llm/_torch/attention_backend/trtllm.py`	Conditional logic in `update_spec_dec_param` to forcibly disable spec decoding when MLA is enabled, attention backend uses MLA, and Blackwell+ GPU with TRTLLM-Gen FMHA kernel is active.
MLA Computation & Propagation `tensorrt_llm/_torch/pyexecutor/model_engine.py`	Refactored MLA enablement calculation in `_set_up_attn_metadata` by caching `is_mla_enable` from config; new field passed to both KV-cache and non-KV-cache attention metadata initializations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description clearly explains the issue, solution, changes, and test plan. All required sections are present and substantially filled.
Title check	✅ Passed	The title clearly and specifically summarizes the main change: disabling spec decoding on Blackwell GPUs for MLA models, which is the core objective of the PR.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

sunnyqgg · 2026-04-13T13:36:16Z

/bot run --extra-stage "DGX_H100-PyTorch-1"

coderabbitai

🧹 Nitpick comments (1)

tensorrt_llm/_torch/attention_backend/trtllm.py (1)

1551-1556: Optional: add a one-time log when spec decoding is force-disabled.

This makes runtime fallback easier to diagnose in perf/feature validation.

Suggested tweak

-        if is_spec_decoding_enabled and self.is_mla_enable \
-                and self.is_sm_version_trtllm_gen_kernel(sm=get_sm_version()):
+        if (is_spec_decoding_enabled and self.is_mla_enable
+                and self.is_sm_version_trtllm_gen_kernel(sm=get_sm_version())):
+            logger.info(
+                "Disabling speculative decoding for MLA on Blackwell+ "
+                "because trtllmGen FMHA does not support this combination.")
             is_spec_decoding_enabled = False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 1551 - 1556,
Summary: Add a one-time log entry when spec decoding is force-disabled for MLA
on SM100+ to aid diagnostics. Update the block that checks
is_spec_decoding_enabled, self.is_mla_enable and
self.is_sm_version_trtllm_gen_kernel(sm=get_sm_version()) so that when you set
is_spec_decoding_enabled = False you also emit a single log message (including
the SM version and reason) but only once per process/instance; implement this by
adding and checking a boolean flag (e.g.,
self._has_logged_spec_decoding_disabled or a module-level
_has_logged_spec_decoding_disabled) before logging, and use the existing logger
(e.g., self.logger or logging.getLogger(__name__)) to record the one-time
message.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 1551-1556: Summary: Add a one-time log entry when spec decoding is
force-disabled for MLA on SM100+ to aid diagnostics. Update the block that
checks is_spec_decoding_enabled, self.is_mla_enable and
self.is_sm_version_trtllm_gen_kernel(sm=get_sm_version()) so that when you set
is_spec_decoding_enabled = False you also emit a single log message (including
the SM version and reason) but only once per process/instance; implement this by
adding and checking a boolean flag (e.g.,
self._has_logged_spec_decoding_disabled or a module-level
_has_logged_spec_decoding_disabled) before logging, and use the existing logger
(e.g., self.logger or logging.getLogger(__name__)) to record the one-time
message.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6df3f7d7-77e7-4f72-977d-939a7fd1bf7c

📥 Commits

Reviewing files that changed from the base of the PR and between 3605638 and 82758e5.

📒 Files selected for processing (3)

tensorrt_llm/_torch/attention_backend/interface.py
tensorrt_llm/_torch/attention_backend/trtllm.py
tensorrt_llm/_torch/pyexecutor/model_engine.py

tensorrt-cicd · 2026-04-13T13:37:13Z

PR_Github #43046 [ run ] triggered by Bot. Commit: 82758e5 Link to invocation

tensorrt-cicd · 2026-04-13T13:43:23Z

PR_Github #43048 [ run ] triggered by Bot. Commit: 82758e5 Link to invocation

tensorrt-cicd · 2026-04-13T13:43:27Z

PR_Github #43046 [ run ] completed with state ABORTED. Commit: 82758e5

Link to invocation

nv-guomingz · 2026-04-13T14:04:16Z

Hi @sunnyqgg ，I just waives several cases via this #13001 . If this PR fixes those cases, would u plz unwaive them in this PR?

tensorrt-cicd · 2026-04-13T15:22:10Z

PR_Github #43048 [ run ] completed with state SUCCESS. Commit: 82758e5
/LLM/main/L0_MergeRequest_PR pipeline #33693 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

brb-nv · 2026-04-13T15:58:42Z

/bot run --disable-fail-fast --extra-stage "DGX_H100-PyTorch-1"

tensorrt-cicd · 2026-04-13T16:05:40Z

PR_Github #43075 [ run ] triggered by Bot. Commit: 82758e5 Link to invocation

tensorrt-cicd · 2026-04-14T00:35:18Z

PR_Github #43075 [ run ] completed with state SUCCESS. Commit: 82758e5
/LLM/main/L0_MergeRequest_PR pipeline #33713 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Broaden the scope from MLA-only to all models on Blackwell (sm100+). The trtllmGen FMHA kernels do not yet support speculative decoding mode, which causes assertion failures for all model architectures on B200/GB200. Remove the is_mla_enable plumbing that is no longer needed. Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg · 2026-04-14T01:29:27Z

/bot run --disable-fail-fast --extra-stage "DGX_H100-PyTorch-1"

sunnyqgg · 2026-04-14T01:31:05Z

/bot run --disable-fail-fast

sunnyqgg · 2026-04-14T01:33:57Z

/bot run --disable-fail-fast

Signed-off-by: qgai <qgai@nvidia.com>

sunnyqgg · 2026-04-14T01:35:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-14T01:35:54Z

PR_Github #43129 [ run ] triggered by Bot. Commit: a728973 Link to invocation

tensorrt-cicd · 2026-04-14T01:37:01Z

PR_Github #43130 [ run ] triggered by Bot. Commit: a728973 Link to invocation

tensorrt-cicd · 2026-04-14T01:37:04Z

PR_Github #43129 [ run ] completed with state ABORTED. Commit: a728973

Link to invocation

tensorrt-cicd · 2026-04-14T01:39:46Z

PR_Github #43132 [ run ] triggered by Bot. Commit: a728973 Link to invocation

tensorrt-cicd · 2026-04-14T01:42:26Z

PR_Github #43133 [ run ] triggered by Bot. Commit: a728973 Link to invocation

tensorrt-cicd · 2026-04-14T06:22:04Z

PR_Github #43133 [ run ] completed with state SUCCESS. Commit: a728973
/LLM/main/L0_MergeRequest_PR pipeline #33765 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yiqingy0 · 2026-04-14T06:40:51Z

/bot --help

github-actions · 2026-04-14T06:41:01Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

yiqingy0 · 2026-04-14T06:41:23Z

/bot run --disable-fail-fast --high-priority

tensorrt-cicd · 2026-04-14T06:47:03Z

PR_Github #43198 [ run ] triggered by Bot. Commit: a728973 Link to invocation

tensorrt-cicd · 2026-04-14T06:47:04Z

PR_Github #43198 [ run ] completed with state DISABLED
Freeze main and open the PR merge only after CI is back to healthy https://nvidia.slack.com/archives/C059LSY62BT/p1776141760843319?thread_ts=1775985925.442509&cid=C059LSY62BT

Link to invocation

yiqingy0 · 2026-04-14T06:52:47Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-14T06:58:37Z

PR_Github #43199 [ run ] triggered by Bot. Commit: a728973 Link to invocation

yiqingy0 · 2026-04-14T10:27:28Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-04-14T10:31:01Z

PR_Github #43199 [ run ] completed with state SUCCESS. Commit: a728973
/LLM/main/L0_MergeRequest_PR pipeline #33804 completed with status: 'SUCCESS'

CI Report

Link to invocation

yiqingy0 · 2026-04-14T10:33:33Z

/bot run --disable-fail-fast --add-multi-gpu-test

tensorrt-cicd · 2026-04-14T10:39:12Z

PR_Github #43218 [ run ] triggered by Bot. Commit: a728973 Link to invocation

tensorrt-cicd · 2026-04-14T17:05:42Z

PR_Github #43218 [ run ] completed with state SUCCESS. Commit: a728973
/LLM/main/L0_MergeRequest_PR pipeline #33807 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

sunnyqgg requested review from a team as code owners April 13, 2026 13:30

sunnyqgg requested review from dongxuy04 and pengbowang-nv April 13, 2026 13:30

github-actions bot assigned sunnyqgg Apr 13, 2026

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

sunnyqgg changed the title ~~[None][fix] Disable spec decoding on Blackwell for MLA models~~ [https://nvbugs/6071081][fix] Disable spec decoding on Blackwell for MLA models Apr 13, 2026

mikeiovine approved these changes Apr 13, 2026

View reviewed changes

chore: retrigger CI

a728973

Signed-off-by: qgai <qgai@nvidia.com>

pengbowang-nv approved these changes Apr 14, 2026

View reviewed changes

sunnyqgg mentioned this pull request Apr 15, 2026

[TRTLLM-11540][feat] Revert revert of EAGLE3 dynamic tree speculative decoding support #13081

Open

3 tasks

Conversation

sunnyqgg commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Summary by CodeRabbit

Uh oh!

sunnyqgg commented Apr 13, 2026

Uh oh!

coderabbitai bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

sunnyqgg commented Apr 13, 2026

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

nv-guomingz commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

brb-nv commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 13, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

sunnyqgg commented Apr 14, 2026

Uh oh!

sunnyqgg commented Apr 14, 2026

Uh oh!

sunnyqgg commented Apr 14, 2026

Uh oh!

sunnyqgg commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

yiqingy0 commented Apr 14, 2026

Uh oh!

github-actions bot commented Apr 14, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

yiqingy0 commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

yiqingy0 commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

yiqingy0 commented Apr 14, 2026

Uh oh!

tensorrt-cicd commented Apr 14, 2026

Uh oh!

sunnyqgg commented Apr 13, 2026 •

edited

Loading

coderabbitai bot commented Apr 13, 2026 •

edited

Loading