Skip to content

[https://nvbugs/6071081][fix] Disable spec decoding on Blackwell for MLA models#13000

Open
sunnyqgg wants to merge 3 commits intoNVIDIA:mainfrom
sunnyqgg:bug_6071081
Open

[https://nvbugs/6071081][fix] Disable spec decoding on Blackwell for MLA models#13000
sunnyqgg wants to merge 3 commits intoNVIDIA:mainfrom
sunnyqgg:bug_6071081

Conversation

@sunnyqgg
Copy link
Copy Markdown
Collaborator

@sunnyqgg sunnyqgg commented Apr 13, 2026

Summary

  • Re-adds a guard to disable speculative decoding on Blackwell (sm100+) GPUs specifically for MLA models (DeepSeek-V2/V3/R1)
  • The blanket Blackwell spec-decoding guard was intentionally removed in 4ece13c to enable EAGLE3 dynamic tree support. This patch restores the guard selectively for MLA models only, keeping EAGLE3 functional for non-MLA models (e.g. LLaMA) on Blackwell
  • The trtllmGen FMHA kernel does not support spec-decoding with MLA's non-standard V head dimensions

Changes

  • interface.py: Add is_mla_enable field to AttentionMetadata base class
  • trtllm.py: Add guard in update_spec_dec_param to disable spec decoding when MLA + Blackwell
  • model_engine.py: Pass is_mla_enable when constructing attention metadata

Test plan

  • Verify non-MLA models (LLaMA) with EAGLE3 spec decoding on Blackwell still work
  • Verify MLA models (DeepSeek-V3) on Blackwell no longer hit trtllmGen FMHA assertion
  • Verify MLA models on Hopper with spec decoding are unaffected

Bug 6071081

Summary by CodeRabbit

  • New Features

    • Enhanced Multi-head Latent Attention (MLA) support with improved configuration handling.
  • Bug Fixes

    • Fixed spec decoding behavior on newer GPU architectures when using MLA.
  • Improvements

    • Optimized MLA enablement calculation for better performance.

On Blackwell (sm100+), the trtllmGen FMHA kernel does not support
spec-decoding with MLA's non-standard V head dimensions used by
DeepSeek-V2/V3/R1. The guard that previously blocked all spec
decoding on Blackwell was intentionally removed in 4ece13c to
enable EAGLE3 dynamic tree support. This patch re-adds the guard
selectively for MLA models only, keeping EAGLE3 working for
non-MLA models (e.g. LLaMA) on Blackwell.

Bug 6071081

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg sunnyqgg requested review from a team as code owners April 13, 2026 13:30
@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 13, 2026

📝 Walkthrough

Walkthrough

Three files are modified to introduce MLA (Multi-Latent Attention) enablement tracking. A new boolean field is_mla_enable is added to AttentionMetadata, propagated through model engine initialization, and used to conditionally disable spec decoding on Blackwell+ GPUs when MLA is active.

Changes

Cohort / File(s) Summary
MLA Enablement Field
tensorrt_llm/_torch/attention_backend/interface.py
Added new boolean field is_mla_enable: bool = False to AttentionMetadata class to track MLA enablement state.
Spec Decoding Override Logic
tensorrt_llm/_torch/attention_backend/trtllm.py
Conditional logic in update_spec_dec_param to forcibly disable spec decoding when MLA is enabled, attention backend uses MLA, and Blackwell+ GPU with TRTLLM-Gen FMHA kernel is active.
MLA Computation & Propagation
tensorrt_llm/_torch/pyexecutor/model_engine.py
Refactored MLA enablement calculation in _set_up_attn_metadata by caching is_mla_enable from config; new field passed to both KV-cache and non-KV-cache attention metadata initializations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The PR description clearly explains the issue, solution, changes, and test plan. All required sections are present and substantially filled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: disabling spec decoding on Blackwell GPUs for MLA models, which is the core objective of the PR.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

/bot run --extra-stage "DGX_H100-PyTorch-1"

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)

1551-1556: Optional: add a one-time log when spec decoding is force-disabled.

This makes runtime fallback easier to diagnose in perf/feature validation.

Suggested tweak
-        if is_spec_decoding_enabled and self.is_mla_enable \
-                and self.is_sm_version_trtllm_gen_kernel(sm=get_sm_version()):
+        if (is_spec_decoding_enabled and self.is_mla_enable
+                and self.is_sm_version_trtllm_gen_kernel(sm=get_sm_version())):
+            logger.info(
+                "Disabling speculative decoding for MLA on Blackwell+ "
+                "because trtllmGen FMHA does not support this combination.")
             is_spec_decoding_enabled = False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 1551 - 1556,
Summary: Add a one-time log entry when spec decoding is force-disabled for MLA
on SM100+ to aid diagnostics. Update the block that checks
is_spec_decoding_enabled, self.is_mla_enable and
self.is_sm_version_trtllm_gen_kernel(sm=get_sm_version()) so that when you set
is_spec_decoding_enabled = False you also emit a single log message (including
the SM version and reason) but only once per process/instance; implement this by
adding and checking a boolean flag (e.g.,
self._has_logged_spec_decoding_disabled or a module-level
_has_logged_spec_decoding_disabled) before logging, and use the existing logger
(e.g., self.logger or logging.getLogger(__name__)) to record the one-time
message.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 1551-1556: Summary: Add a one-time log entry when spec decoding is
force-disabled for MLA on SM100+ to aid diagnostics. Update the block that
checks is_spec_decoding_enabled, self.is_mla_enable and
self.is_sm_version_trtllm_gen_kernel(sm=get_sm_version()) so that when you set
is_spec_decoding_enabled = False you also emit a single log message (including
the SM version and reason) but only once per process/instance; implement this by
adding and checking a boolean flag (e.g.,
self._has_logged_spec_decoding_disabled or a module-level
_has_logged_spec_decoding_disabled) before logging, and use the existing logger
(e.g., self.logger or logging.getLogger(__name__)) to record the one-time
message.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6df3f7d7-77e7-4f72-977d-939a7fd1bf7c

📥 Commits

Reviewing files that changed from the base of the PR and between 3605638 and 82758e5.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/attention_backend/interface.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43046 [ run ] triggered by Bot. Commit: 82758e5 Link to invocation

@sunnyqgg sunnyqgg changed the title [None][fix] Disable spec decoding on Blackwell for MLA models [https://nvbugs/6071081][fix] Disable spec decoding on Blackwell for MLA models Apr 13, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43048 [ run ] triggered by Bot. Commit: 82758e5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43046 [ run ] completed with state ABORTED. Commit: 82758e5

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator

Hi @sunnyqgg ,I just waives several cases via this #13001 . If this PR fixes those cases, would u plz unwaive them in this PR?

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43048 [ run ] completed with state SUCCESS. Commit: 82758e5
/LLM/main/L0_MergeRequest_PR pipeline #33693 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@brb-nv
Copy link
Copy Markdown
Collaborator

brb-nv commented Apr 13, 2026

/bot run --disable-fail-fast --extra-stage "DGX_H100-PyTorch-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43075 [ run ] triggered by Bot. Commit: 82758e5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43075 [ run ] completed with state SUCCESS. Commit: 82758e5
/LLM/main/L0_MergeRequest_PR pipeline #33713 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Broaden the scope from MLA-only to all models on Blackwell (sm100+).
The trtllmGen FMHA kernels do not yet support speculative decoding mode,
which causes assertion failures for all model architectures on B200/GB200.

Remove the is_mla_enable plumbing that is no longer needed.

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --extra-stage "DGX_H100-PyTorch-1"

@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

Signed-off-by: qgai <qgai@nvidia.com>
@sunnyqgg
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43129 [ run ] triggered by Bot. Commit: a728973 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43130 [ run ] triggered by Bot. Commit: a728973 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43129 [ run ] completed with state ABORTED. Commit: a728973

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43132 [ run ] triggered by Bot. Commit: a728973 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43133 [ run ] triggered by Bot. Commit: a728973 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43133 [ run ] completed with state SUCCESS. Commit: a728973
/LLM/main/L0_MergeRequest_PR pipeline #33765 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yiqingy0
Copy link
Copy Markdown
Collaborator

/bot --help

@github-actions
Copy link
Copy Markdown

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@yiqingy0
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast --high-priority

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43198 [ run ] triggered by Bot. Commit: a728973 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43198 [ run ] completed with state DISABLED
Freeze main and open the PR merge only after CI is back to healthy https://nvidia.slack.com/archives/C059LSY62BT/p1776141760843319?thread_ts=1775985925.442509&cid=C059LSY62BT

Link to invocation

@yiqingy0
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43199 [ run ] triggered by Bot. Commit: a728973 Link to invocation

@yiqingy0
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43199 [ run ] completed with state SUCCESS. Commit: a728973
/LLM/main/L0_MergeRequest_PR pipeline #33804 completed with status: 'SUCCESS'

CI Report

Link to invocation

@yiqingy0
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast --add-multi-gpu-test

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43218 [ run ] triggered by Bot. Commit: a728973 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #43218 [ run ] completed with state SUCCESS. Commit: a728973
/LLM/main/L0_MergeRequest_PR pipeline #33807 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants