Skip to content

[None][feat] Enable speculative decoding in TrtllmGen attention backend#12267

Merged
yihwang-nv merged 3 commits intoNVIDIA:mainfrom
yihwang-nv:yihwang/trtllm_gen_attn_spec_decoding
Mar 19, 2026
Merged

[None][feat] Enable speculative decoding in TrtllmGen attention backend#12267
yihwang-nv merged 3 commits intoNVIDIA:mainfrom
yihwang-nv:yihwang/trtllm_gen_attn_spec_decoding

Conversation

@yihwang-nv
Copy link
Collaborator

@yihwang-nv yihwang-nv commented Mar 17, 2026

  • Speculative decoding is now supported for the TRTLLM-Gen attention backend.
  • TRTLLM-Gen attention backend is now permanently enabled, simplifying configuration management.

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
@yihwang-nv yihwang-nv requested a review from a team as a code owner March 17, 2026 02:30
@yihwang-nv yihwang-nv requested a review from PerkzZheng March 17, 2026 02:30
@yihwang-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 17, 2026

📝 Walkthrough

Walkthrough

These changes enable TRTLLM-Gen attention backend by default and remove the speculative decoding guard constraint. The backend activation is converted from an environment variable toggle to hard-coded enablement, and speculative decoding is now considered a valid use case within the backend.

Changes

Cohort / File(s) Summary
TRTLLM-Gen Backend Enablement
tensorrt_llm/_torch/attention_backend/trtllm.py
Changed TRTLLM-Gen attention activation from environment variable TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION (defaulting to false) to hard-coded True, making the backend always active.
Speculative Decoding Support
tensorrt_llm/_torch/attention_backend/trtllm_gen.py
Removed validation guard that previously rejected speculative decoding, allowing the backend to support speculative decoding scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning PR description is incomplete and does not follow the required template. Critical sections are missing: no issue/ticket reference in title, incomplete Description section, and PR Checklist items are unchecked with no evidence of review. Add a proper PR title with ticket reference (e.g., [TRTLLM-XXXX][feat]). Complete the Description section explaining the changes and rationale. Check and verify all PR Checklist items before submission.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature being added: enabling speculative decoding in the TrtllmGen attention backend, which aligns with the code changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable the changed files summary in the walkthrough.

Disable the reviews.changed_files_summary setting to disable the changed files summary in the walkthrough.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)

33-35: Keep a default-on runtime kill switch instead of hard-coding True.

Line 35 removes the operational fallback path entirely. Safer is “enabled by default” with an env override so regressions can be mitigated without code changes.

Proposed change
-# Enable TRTLLM-Gen attention backend via environment variable.
-# _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION = os.environ.get(
-#     "TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "0") == "1"
-_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION = True
+# Enable TRTLLM-Gen attention backend by default, with runtime override.
+_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION = os.environ.get(
+    "TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "1"
+) == "1"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 33 - 35,
Restore a runtime kill-switch for the gen-attention flag instead of hardcoding
True: set _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION by reading an environment variable
(e.g., os.environ.get("TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "1") == "1") so it
is enabled by default but can be disabled at runtime; ensure os is imported and
replace the hardcoded assignment to _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION with
this env-driven expression.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 33-35: Restore a runtime kill-switch for the gen-attention flag
instead of hardcoding True: set _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION by reading
an environment variable (e.g.,
os.environ.get("TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "1") == "1") so it is
enabled by default but can be disabled at runtime; ensure os is imported and
replace the hardcoded assignment to _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION with
this env-driven expression.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7a114aa4-8d7f-46f4-97c1-9d1165df7967

📥 Commits

Reviewing files that changed from the base of the PR and between 5003d38 and a5b4a8b.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/attention_backend/trtllm_gen.py
💤 Files with no reviewable changes (1)
  • tensorrt_llm/_torch/attention_backend/trtllm_gen.py

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39163 [ run ] triggered by Bot. Commit: a5b4a8b Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39163 [ run ] completed with state SUCCESS. Commit: a5b4a8b
/LLM/main/L0_MergeRequest_PR pipeline #30421 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yihwang-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

1 similar comment
@yihwang-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39259 [ run ] triggered by Bot. Commit: a5b4a8b Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39259 [ run ] completed with state FAILURE. Commit: a5b4a8b
/LLM/main/L0_MergeRequest_PR pipeline #30510 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yihwang-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39277 [ run ] triggered by Bot. Commit: a5b4a8b Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39277 [ run ] completed with state FAILURE. Commit: a5b4a8b
/LLM/main/L0_MergeRequest_PR pipeline #30524 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@yihwang-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39458 [ run ] triggered by Bot. Commit: a5b4a8b Link to invocation

@yihwang-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39458 [ run ] completed with state FAILURE. Commit: a5b4a8b
/LLM/main/L0_MergeRequest_PR pipeline #30684 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39469 [ run ] triggered by Bot. Commit: 0749891 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39469 [ run ] completed with state SUCCESS. Commit: 0749891
/LLM/main/L0_MergeRequest_PR pipeline #30693 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

@yihwang-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39528 [ run ] triggered by Bot. Commit: 0749891 Link to invocation

@yihwang-nv yihwang-nv requested a review from yuxianq March 19, 2026 03:56
@tensorrt-cicd
Copy link
Collaborator

PR_Github #39528 [ run ] completed with state SUCCESS. Commit: 0749891
/LLM/main/L0_MergeRequest_PR pipeline #30749 completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: Yihan Wang <yihwang@nvidia.com>
@yihwang-nv
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39538 [ run ] triggered by Bot. Commit: 12f865d Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39538 [ run ] completed with state SUCCESS. Commit: 12f865d
/LLM/main/L0_MergeRequest_PR pipeline #30756 completed with status: 'SUCCESS'

CI Report

Link to invocation

@yihwang-nv yihwang-nv merged commit db92761 into NVIDIA:main Mar 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants