[None][fix] Replace assertions with warnings for unsupported logits/logprobs in speculative sampler by yifjiang · Pull Request #12547 · NVIDIA/TensorRT-LLM

yifjiang · 2026-03-25T17:07:31Z

Summary

When return_context_logits, return_generation_logits, or return_log_probs is requested with speculative decoding, the server crashes with an AssertionError in spec_sampler_base.py.
Replace these fatal assertions with logger.warning() so the server stays alive. The request completes normally — the unsupported fields are simply not populated.

Background

We encountered this crash in production on build.nvidia.com when serving Qwen3-Coder-480B with MTP speculative decoding. Some API clients send logprobs=True in their requests, which triggers the assertion and kills the engine. Repeated assertion failures may also cause resource leakage (unreleased KV cache blocks, dangling request state) in the serving integration layer before the crash.

After deploying this fix, the server handles logprobs=True requests gracefully — it returns a response without the logprobs field and logs a warning, instead of crashing.

History

These assertions have been present since April 2025 (PR #3221), originally in the individual MTP/Eagle sampler files. They were consolidated into the shared SpecSamplerBase class in March 2026 (PR #11434).

Test plan

Deployed in production on build.nvidia.com — no more crashes from logprobs requests
Send a request with logprobs=True to a model serving with speculative decoding (e.g. MTP) — verify server logs a warning instead of crashing
Verify the response completes successfully (without logprobs)
Existing speculative decoding tests pass

🤖 Generated with Claude Code

tensorrt_llm/_torch/speculative/spec_sampler_base.py

pengbowang-nv · 2026-04-02T04:39:54Z

/bot run --disable-fail-fast

coderabbitai · 2026-04-02T04:41:43Z

📝 Walkthrough

Walkthrough

Replace three assertion-based validation checks with conditional logic that logs warnings and skips unsupported py_return_* options per request ID instead of hard-failing the request.

Changes

Cohort / File(s)	Summary
Graceful error handling `tensorrt_llm/_torch/speculative/spec_sampler_base.py`	Updated `_request_common_handling` to replace three assertions for `py_return_context_logits`, `py_return_generation_logits`, and `py_return_log_probs` with conditional checks that log warnings and skip unsupported options rather than fail. Added logger import.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: replacing assertions with warnings for unsupported logits/logprobs handling in the speculative sampler.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check	✅ Passed	PR description is comprehensive and well-structured, clearly explaining the problem, solution, background, and test coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tensorrt_llm/_torch/speculative/spec_sampler_base.py (1)

147-158: Good fix — consider throttling warnings to avoid log noise.

The change correctly prevents server crashes by converting assertions to warnings. However, _request_common_handling is called on every decoding iteration (line 196), which means the same warning will be logged repeatedly for the same request throughout its entire generation lifecycle.

Consider logging only on the first iteration:

♻️ Optional: Log warning only once per request

     def _request_common_handling(
         self,
         request: LlmRequest,
         next_draft_tokens: list[list[int]],
         runtime_draft_len: Optional[int],
     ) -> None:
         """Common handling for both context and generation requests."""
-        if request.py_return_context_logits:
-            logger.warning(
-                "return_context_logits not supported with speculative decoding, "
-                "skipping for request %s", request.py_request_id)
-        if request.py_return_generation_logits:
-            logger.warning(
-                "return_generation_logits not supported with speculative decoding, "
-                "skipping for request %s", request.py_request_id)
-        if request.py_return_log_probs:
-            logger.warning(
-                "return_log_probs not supported with speculative decoding, "
-                "skipping for request %s", request.py_request_id)
+        if request.py_decoding_iter == 0:
+            if request.py_return_context_logits:
+                logger.warning(
+                    "return_context_logits not supported with speculative decoding, "
+                    "skipping for request %s", request.py_request_id)
+            if request.py_return_generation_logits:
+                logger.warning(
+                    "return_generation_logits not supported with speculative decoding, "
+                    "skipping for request %s", request.py_request_id)
+            if request.py_return_log_probs:
+                logger.warning(
+                    "return_log_probs not supported with speculative decoding, "
+                    "skipping for request %s", request.py_request_id)
         request.py_draft_tokens = next_draft_tokens[request.py_seq_slot][:runtime_draft_len]
         request.py_decoding_iter += 1

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/speculative/spec_sampler_base.py` around lines 147 - 158,
The warnings in _request_common_handling about request.py_return_context_logits,
request.py_return_generation_logits, and request.py_return_log_probs are emitted
every decode iteration and should be throttled; modify _request_common_handling
(or the SpecSamplerBase instance) to record that a given request (use
request.py_request_id or attach a bool like request._spec_warnings_logged) has
already had its warnings emitted and only log them the first time, e.g., check
the flag/set before calling logger.warning and set it after the first warning so
subsequent calls skip logging.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/speculative/spec_sampler_base.py`:
- Around line 147-158: The warnings in _request_common_handling about
request.py_return_context_logits, request.py_return_generation_logits, and
request.py_return_log_probs are emitted every decode iteration and should be
throttled; modify _request_common_handling (or the SpecSamplerBase instance) to
record that a given request (use request.py_request_id or attach a bool like
request._spec_warnings_logged) has already had its warnings emitted and only log
them the first time, e.g., check the flag/set before calling logger.warning and
set it after the first warning so subsequent calls skip logging.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c6530550-0dd3-4eed-b35e-f4b6d9ef53ca

📥 Commits

Reviewing files that changed from the base of the PR and between e92ee4f and 6a8abf9.

📒 Files selected for processing (1)

tensorrt_llm/_torch/speculative/spec_sampler_base.py

tensorrt-cicd · 2026-04-02T04:46:28Z

PR_Github #41339 [ run ] triggered by Bot. Commit: 6a8abf9 Link to invocation

pengbowang-nv · 2026-04-02T06:27:11Z

/bot run --disable-fail-fast

pengbowang-nv · 2026-04-02T06:30:34Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T06:33:06Z

PR_Github #41358 [ run ] triggered by Bot. Commit: a21787d Link to invocation

tensorrt-cicd · 2026-04-02T06:36:06Z

PR_Github #41359 [ run ] triggered by Bot. Commit: a21787d Link to invocation

tensorrt-cicd · 2026-04-02T06:36:08Z

PR_Github #41358 [ run ] completed with state ABORTED. Commit: a21787d

Link to invocation

tensorrt-cicd · 2026-04-02T12:18:27Z

PR_Github #41359 [ run ] completed with state SUCCESS. Commit: a21787d
/LLM/main/L0_MergeRequest_PR pipeline #32303 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

pengbowang-nv · 2026-04-02T15:20:20Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-02T15:26:17Z

PR_Github #41450 [ run ] triggered by Bot. Commit: c556c54 Link to invocation

tensorrt-cicd · 2026-04-02T20:38:39Z

PR_Github #41450 [ run ] completed with state SUCCESS. Commit: c556c54
/LLM/main/L0_MergeRequest_PR pipeline #32381 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

pengbowang-nv · 2026-04-03T02:00:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-03T02:06:11Z

PR_Github #41542 [ run ] triggered by Bot. Commit: c556c54 Link to invocation

pengbowang-nv · 2026-04-03T02:06:49Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-03T02:12:44Z

PR_Github #41544 [ run ] triggered by Bot. Commit: c556c54 Link to invocation

pengbowang-nv · 2026-04-03T04:57:29Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-03T05:02:45Z

PR_Github #41592 [ run ] triggered by Bot. Commit: 2fc537c Link to invocation

tensorrt-cicd · 2026-04-03T05:02:48Z

PR_Github #41544 [ run ] completed with state ABORTED. Commit: c556c54
/LLM/main/L0_MergeRequest_PR pipeline #32458 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

pengbowang-nv · 2026-04-03T06:03:56Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-03T06:10:40Z

PR_Github #41603 [ run ] triggered by Bot. Commit: 2fc537c Link to invocation

pengbowang-nv · 2026-04-03T09:51:34Z

/bot kill

tensorrt-cicd · 2026-04-03T09:57:59Z

PR_Github #41647 [ kill ] triggered by Bot. Commit: 2fc537c Link to invocation

tensorrt-cicd · 2026-04-03T09:58:42Z

PR_Github #41647 [ kill ] completed with state SUCCESS. Commit: 2fc537c
Successfully killed previous jobs for commit 2fc537c

Link to invocation

…ogprobs in speculative sampler When return_context_logits, return_generation_logits, or return_log_probs is requested with speculative decoding, the server crashes with an AssertionError. Replace these assertions with warnings so the server stays alive and the request completes without the unsupported fields. Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>

pengbowang-nv · 2026-04-03T16:22:07Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-03T16:28:05Z

PR_Github #41704 [ run ] triggered by Bot. Commit: 8cbc209 Link to invocation

tensorrt-cicd · 2026-04-03T21:00:16Z

PR_Github #41704 [ run ] completed with state SUCCESS. Commit: 8cbc209
/LLM/main/L0_MergeRequest_PR pipeline #32606 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

pengbowang-nv · 2026-04-04T02:01:26Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-04T02:07:49Z

PR_Github #41785 [ run ] triggered by Bot. Commit: 8cbc209 Link to invocation

tensorrt-cicd · 2026-04-04T05:05:21Z

PR_Github #41785 [ run ] completed with state SUCCESS. Commit: 8cbc209
/LLM/main/L0_MergeRequest_PR pipeline #32679 completed with status: 'SUCCESS'

CI Report

Link to invocation

…ogprobs in speculative sampler (NVIDIA#12547) With this change, the server returns a response without the logprobs/logits fields populated — the request completes normally, just without the unsupported data. This avoids too many assertion error crashing Dynamo. Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>

yifjiang requested a review from a team as a code owner March 25, 2026 17:07

yifjiang requested a review from ziyixiong-nv March 25, 2026 17:07

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 25, 2026

mikeiovine approved these changes Mar 25, 2026

View reviewed changes

pengbowang-nv reviewed Mar 31, 2026

View reviewed changes

tensorrt_llm/_torch/speculative/spec_sampler_base.py Outdated Show resolved Hide resolved

pengbowang-nv approved these changes Apr 2, 2026

View reviewed changes

yifjiang force-pushed the fix/spec-sampler-no-crash-v2 branch from ad27254 to 6a8abf9 Compare April 2, 2026 04:37

coderabbitai bot reviewed Apr 2, 2026

View reviewed changes

yifjiang force-pushed the fix/spec-sampler-no-crash-v2 branch from 6a8abf9 to 3710c3b Compare April 2, 2026 06:26

yifjiang force-pushed the fix/spec-sampler-no-crash-v2 branch from 3710c3b to a21787d Compare April 2, 2026 06:29

pengbowang-nv force-pushed the fix/spec-sampler-no-crash-v2 branch from a21787d to c556c54 Compare April 2, 2026 15:20

pengbowang-nv enabled auto-merge (squash) April 2, 2026 15:21

auto-merge was automatically disabled April 3, 2026 03:30
Head branch was pushed to by a user without write access

yifjiang force-pushed the fix/spec-sampler-no-crash-v2 branch from c556c54 to 2fc537c Compare April 3, 2026 03:30

pengbowang-nv enabled auto-merge (squash) April 3, 2026 07:34

pengbowang-nv force-pushed the fix/spec-sampler-no-crash-v2 branch from 2fc537c to 8cbc209 Compare April 3, 2026 16:21

pengbowang-nv merged commit 9ab5cef into NVIDIA:main Apr 4, 2026
5 checks passed

Conversation

yifjiang commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Background

History

Test plan

Uh oh!

Uh oh!

pengbowang-nv commented Apr 2, 2026

Uh oh!

coderabbitai bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

pengbowang-nv commented Apr 2, 2026

Uh oh!

pengbowang-nv commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

pengbowang-nv commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

tensorrt-cicd commented Apr 2, 2026

Uh oh!

pengbowang-nv commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

pengbowang-nv commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

pengbowang-nv commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

pengbowang-nv commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

pengbowang-nv commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

pengbowang-nv commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

tensorrt-cicd commented Apr 3, 2026

Uh oh!

pengbowang-nv commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

tensorrt-cicd commented Apr 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

yifjiang commented Mar 25, 2026 •

edited

Loading

coderabbitai bot commented Apr 2, 2026 •

edited

Loading