[None][feat] Enable speculative decoding in TrtllmGen attention backend#12267
Conversation
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThese changes enable TRTLLM-Gen attention backend by default and remove the speculative decoding guard constraint. The backend activation is converted from an environment variable toggle to hard-coded enablement, and speculative decoding is now considered a valid use case within the backend. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can disable the changed files summary in the walkthrough.Disable the |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
33-35: Keep a default-on runtime kill switch instead of hard-codingTrue.Line 35 removes the operational fallback path entirely. Safer is “enabled by default” with an env override so regressions can be mitigated without code changes.
Proposed change
-# Enable TRTLLM-Gen attention backend via environment variable. -# _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION = os.environ.get( -# "TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "0") == "1" -_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION = True +# Enable TRTLLM-Gen attention backend by default, with runtime override. +_TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION = os.environ.get( + "TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "1" +) == "1"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/attention_backend/trtllm.py` around lines 33 - 35, Restore a runtime kill-switch for the gen-attention flag instead of hardcoding True: set _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION by reading an environment variable (e.g., os.environ.get("TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "1") == "1") so it is enabled by default but can be disabled at runtime; ensure os is imported and replace the hardcoded assignment to _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION with this env-driven expression.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/_torch/attention_backend/trtllm.py`:
- Around line 33-35: Restore a runtime kill-switch for the gen-attention flag
instead of hardcoding True: set _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION by reading
an environment variable (e.g.,
os.environ.get("TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION", "1") == "1") so it is
enabled by default but can be disabled at runtime; ensure os is imported and
replace the hardcoded assignment to _TRTLLM_ENABLE_TRTLLM_GEN_ATTENTION with
this env-driven expression.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7a114aa4-8d7f-46f4-97c1-9d1165df7967
📒 Files selected for processing (2)
tensorrt_llm/_torch/attention_backend/trtllm.pytensorrt_llm/_torch/attention_backend/trtllm_gen.py
💤 Files with no reviewable changes (1)
- tensorrt_llm/_torch/attention_backend/trtllm_gen.py
|
PR_Github #39163 [ run ] triggered by Bot. Commit: |
|
PR_Github #39163 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #39259 [ run ] triggered by Bot. Commit: |
|
PR_Github #39259 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #39277 [ run ] triggered by Bot. Commit: |
|
PR_Github #39277 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #39458 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #39458 [ run ] completed with state
|
|
PR_Github #39469 [ run ] triggered by Bot. Commit: |
|
PR_Github #39469 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #39528 [ run ] triggered by Bot. Commit: |
|
PR_Github #39528 [ run ] completed with state |
Signed-off-by: Yihan Wang <yihwang@nvidia.com>
|
/bot run --disable-fail-fast |
|
PR_Github #39538 [ run ] triggered by Bot. Commit: |
|
PR_Github #39538 [ run ] completed with state |
Uh oh!
There was an error while loading. Please reload this page.