[None][feat] Enable speculative decoding in TrtllmGen attention backend by yihwang-nv · Pull Request #12267 · NVIDIA/TensorRT-LLM

yihwang-nv · 2026-03-17T02:30:25Z

Speculative decoding is now supported for the TRTLLM-Gen attention backend.
TRTLLM-Gen attention backend is now permanently enabled, simplifying configuration management.

Signed-off-by: Yihan Wang <yihwang@nvidia.com>

yihwang-nv · 2026-03-17T02:30:39Z

/bot run --disable-fail-fast

coderabbitai · 2026-03-17T02:33:31Z

📝 Walkthrough

Walkthrough

These changes enable TRTLLM-Gen attention backend by default and remove the speculative decoding guard constraint. The backend activation is converted from an environment variable toggle to hard-coded enablement, and speculative decoding is now considered a valid use case within the backend.

Changes

Cohort / File(s)	Summary
TRTLLM-Gen Backend Enablement `tensorrt_llm/_torch/attention_backend/trtllm.py`	Changed TRTLLM-Gen attention activation from environment variable `TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION` (defaulting to false) to hard-coded `True`, making the backend always active.
Speculative Decoding Support `tensorrt_llm/_torch/attention_backend/trtllm_gen.py`	Removed validation guard that previously rejected speculative decoding, allowing the backend to support speculative decoding scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	PR description is incomplete and does not follow the required template. Critical sections are missing: no issue/ticket reference in title, incomplete Description section, and PR Checklist items are unchecked with no evidence of review.	Add a proper PR title with ticket reference (e.g., [TRTLLM-XXXX][feat]). Complete the Description section explaining the changes and rationale. Check and verify all PR Checklist items before submission.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main feature being added: enabling speculative decoding in the TrtllmGen attention backend, which aligns with the code changes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable the changed files summary in the walkthrough.

Disable the reviews.changed_files_summary setting to disable the changed files summary in the walkthrough.

coderabbitai

🧹 Nitpick comments (1)

tensorrt_llm/_torch/attention_backend/trtllm.py (1)

33-35: Keep a default-on runtime kill switch instead of hard-coding True.

Line 35 removes the operational fallback path entirely. Safer is “enabled by default” with an env override so regressions can be mitigated without code changes.

Proposed change

-# Enable TRTLLM-Gen attention backend via environment variable.
-# _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION = os.environ.get(
-#     "TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "0") == "1"
-_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION = True
+# Enable TRTLLM-Gen attention backend by default, with runtime override.
+_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION = os.environ.get(
+    "TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "1"
+) == "1"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 33 - 35,
Restore a runtime kill-switch for the gen-attention flag instead of hardcoding
True: set _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION by reading an environment variable
(e.g., os.environ.get("TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "1") == "1") so it
is enabled by default but can be disabled at runtime; ensure os is imported and
replace the hardcoded assignment to _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION with
this env-driven expression.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 33-35: Restore a runtime kill-switch for the gen-attention flag
instead of hardcoding True: set _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION by reading
an environment variable (e.g.,
os.environ.get("TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "1") == "1") so it is
enabled by default but can be disabled at runtime; ensure os is imported and
replace the hardcoded assignment to _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION with
this env-driven expression.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7a114aa4-8d7f-46f4-97c1-9d1165df7967

📥 Commits

Reviewing files that changed from the base of the PR and between 5003d38 and a5b4a8b.

📒 Files selected for processing (2)

tensorrt_llm/_torch/attention_backend/trtllm.py
tensorrt_llm/_torch/attention_backend/trtllm_gen.py

💤 Files with no reviewable changes (1)

tensorrt_llm/_torch/attention_backend/trtllm_gen.py

tensorrt-cicd · 2026-03-17T02:36:36Z

PR_Github #39163 [ run ] triggered by Bot. Commit: a5b4a8b Link to invocation

tensorrt-cicd · 2026-03-17T10:38:05Z

PR_Github #39163 [ run ] completed with state SUCCESS. Commit: a5b4a8b
/LLM/main/L0_MergeRequest_PR pipeline #30421 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yihwang-nv · 2026-03-17T11:28:39Z

/bot run --disable-fail-fast

yihwang-nv · 2026-03-17T13:06:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-17T13:12:47Z

PR_Github #39259 [ run ] triggered by Bot. Commit: a5b4a8b Link to invocation

tensorrt-cicd · 2026-03-17T14:28:13Z

PR_Github #39259 [ run ] completed with state FAILURE. Commit: a5b4a8b
/LLM/main/L0_MergeRequest_PR pipeline #30510 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yihwang-nv · 2026-03-17T15:10:21Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-17T15:16:54Z

PR_Github #39277 [ run ] triggered by Bot. Commit: a5b4a8b Link to invocation

tensorrt-cicd · 2026-03-17T16:06:40Z

PR_Github #39277 [ run ] completed with state FAILURE. Commit: a5b4a8b
/LLM/main/L0_MergeRequest_PR pipeline #30524 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

yihwang-nv · 2026-03-18T13:15:18Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-18T13:21:12Z

PR_Github #39458 [ run ] triggered by Bot. Commit: a5b4a8b Link to invocation

yihwang-nv · 2026-03-18T14:06:58Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-18T14:13:05Z

PR_Github #39458 [ run ] completed with state FAILURE. Commit: a5b4a8b
/LLM/main/L0_MergeRequest_PR pipeline #30684 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

tensorrt-cicd · 2026-03-18T14:14:58Z

PR_Github #39469 [ run ] triggered by Bot. Commit: 0749891 Link to invocation

tensorrt-cicd · 2026-03-18T18:16:52Z

PR_Github #39469 [ run ] completed with state SUCCESS. Commit: 0749891
/LLM/main/L0_MergeRequest_PR pipeline #30693 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

yihwang-nv · 2026-03-19T02:23:35Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-19T02:29:51Z

PR_Github #39528 [ run ] triggered by Bot. Commit: 0749891 Link to invocation

tensorrt-cicd · 2026-03-19T03:58:12Z

PR_Github #39528 [ run ] completed with state SUCCESS. Commit: 0749891
/LLM/main/L0_MergeRequest_PR pipeline #30749 completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: Yihan Wang <yihwang@nvidia.com>

yihwang-nv · 2026-03-19T04:18:39Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-03-19T04:26:02Z

PR_Github #39538 [ run ] triggered by Bot. Commit: 12f865d Link to invocation

tensorrt-cicd · 2026-03-19T08:19:21Z

PR_Github #39538 [ run ] completed with state SUCCESS. Commit: 12f865d
/LLM/main/L0_MergeRequest_PR pipeline #30756 completed with status: 'SUCCESS'

CI Report

Link to invocation

[None][feat] Enable speculative decoding in TrtllmGen attention backend

a5b4a8b

Signed-off-by: Yihan Wang <yihwang@nvidia.com>

yihwang-nv requested a review from a team as a code owner March 17, 2026 02:30

yihwang-nv requested a review from PerkzZheng March 17, 2026 02:30

github-actions bot assigned yihwang-nv Mar 17, 2026

coderabbitai bot reviewed Mar 17, 2026

View reviewed changes

Merge branch 'main' into yihwang/trtllm_gen_attn_spec_decoding

0749891

yihwang-nv requested a review from yuxianq March 19, 2026 03:56

yuxianq approved these changes Mar 19, 2026

View reviewed changes

Disable trtllm_gen_attention hardcode and rerun thop.attention CI

12f865d

Signed-off-by: Yihan Wang <yihwang@nvidia.com>

yihwang-nv merged commit db92761 into NVIDIA:main Mar 19, 2026
5 checks passed

Conversation

yihwang-nv commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yihwang-nv commented Mar 17, 2026

Uh oh!

coderabbitai bot commented Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

yihwang-nv commented Mar 17, 2026

Uh oh!

yihwang-nv commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

yihwang-nv commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

yihwang-nv commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

yihwang-nv commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

yihwang-nv commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

yihwang-nv commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

tensorrt-cicd commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yihwang-nv commented Mar 17, 2026 •

edited

Loading

coderabbitai bot commented Mar 17, 2026 •

edited

Loading