Skip to content

[None][fix] Replace assertions with warnings for unsupported logits/logprobs in speculative sampler#12547

Merged
pengbowang-nv merged 1 commit intoNVIDIA:mainfrom
yifjiang:fix/spec-sampler-no-crash-v2
Apr 4, 2026
Merged

[None][fix] Replace assertions with warnings for unsupported logits/logprobs in speculative sampler#12547
pengbowang-nv merged 1 commit intoNVIDIA:mainfrom
yifjiang:fix/spec-sampler-no-crash-v2

Conversation

@yifjiang
Copy link
Copy Markdown
Contributor

@yifjiang yifjiang commented Mar 25, 2026

Summary

  • When return_context_logits, return_generation_logits, or return_log_probs is requested with speculative decoding, the server crashes with an AssertionError in spec_sampler_base.py.
  • Replace these fatal assertions with logger.warning() so the server stays alive. The request completes normally — the unsupported fields are simply not populated.

Background

We encountered this crash in production on build.nvidia.com when serving Qwen3-Coder-480B with MTP speculative decoding. Some API clients send logprobs=True in their requests, which triggers the assertion and kills the engine. Repeated assertion failures may also cause resource leakage (unreleased KV cache blocks, dangling request state) in the serving integration layer before the crash.

After deploying this fix, the server handles logprobs=True requests gracefully — it returns a response without the logprobs field and logs a warning, instead of crashing.

History

These assertions have been present since April 2025 (PR #3221), originally in the individual MTP/Eagle sampler files. They were consolidated into the shared SpecSamplerBase class in March 2026 (PR #11434).

Test plan

  • Deployed in production on build.nvidia.com — no more crashes from logprobs requests
  • Send a request with logprobs=True to a model serving with speculative decoding (e.g. MTP) — verify server logs a warning instead of crashing
  • Verify the response completes successfully (without logprobs)
  • Existing speculative decoding tests pass

🤖 Generated with Claude Code

@yifjiang yifjiang requested a review from a team as a code owner March 25, 2026 17:07
@yifjiang yifjiang requested a review from ziyixiong-nv March 25, 2026 17:07
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Mar 25, 2026
@yifjiang yifjiang force-pushed the fix/spec-sampler-no-crash-v2 branch from ad27254 to 6a8abf9 Compare April 2, 2026 04:37
@pengbowang-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

Replace three assertion-based validation checks with conditional logic that logs warnings and skips unsupported py_return_* options per request ID instead of hard-failing the request.

Changes

Cohort / File(s) Summary
Graceful error handling
tensorrt_llm/_torch/speculative/spec_sampler_base.py
Updated _request_common_handling to replace three assertions for py_return_context_logits, py_return_generation_logits, and py_return_log_probs with conditional checks that log warnings and skip unsupported options rather than fail. Added logger import.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: replacing assertions with warnings for unsupported logits/logprobs handling in the speculative sampler.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed PR description is comprehensive and well-structured, clearly explaining the problem, solution, background, and test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/speculative/spec_sampler_base.py (1)

147-158: Good fix — consider throttling warnings to avoid log noise.

The change correctly prevents server crashes by converting assertions to warnings. However, _request_common_handling is called on every decoding iteration (line 196), which means the same warning will be logged repeatedly for the same request throughout its entire generation lifecycle.

Consider logging only on the first iteration:

♻️ Optional: Log warning only once per request
     def _request_common_handling(
         self,
         request: LlmRequest,
         next_draft_tokens: list[list[int]],
         runtime_draft_len: Optional[int],
     ) -> None:
         """Common handling for both context and generation requests."""
-        if request.py_return_context_logits:
-            logger.warning(
-                "return_context_logits not supported with speculative decoding, "
-                "skipping for request %s", request.py_request_id)
-        if request.py_return_generation_logits:
-            logger.warning(
-                "return_generation_logits not supported with speculative decoding, "
-                "skipping for request %s", request.py_request_id)
-        if request.py_return_log_probs:
-            logger.warning(
-                "return_log_probs not supported with speculative decoding, "
-                "skipping for request %s", request.py_request_id)
+        if request.py_decoding_iter == 0:
+            if request.py_return_context_logits:
+                logger.warning(
+                    "return_context_logits not supported with speculative decoding, "
+                    "skipping for request %s", request.py_request_id)
+            if request.py_return_generation_logits:
+                logger.warning(
+                    "return_generation_logits not supported with speculative decoding, "
+                    "skipping for request %s", request.py_request_id)
+            if request.py_return_log_probs:
+                logger.warning(
+                    "return_log_probs not supported with speculative decoding, "
+                    "skipping for request %s", request.py_request_id)
         request.py_draft_tokens = next_draft_tokens[request.py_seq_slot][:runtime_draft_len]
         request.py_decoding_iter += 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/speculative/spec_sampler_base.py` around lines 147 - 158,
The warnings in _request_common_handling about request.py_return_context_logits,
request.py_return_generation_logits, and request.py_return_log_probs are emitted
every decode iteration and should be throttled; modify _request_common_handling
(or the SpecSamplerBase instance) to record that a given request (use
request.py_request_id or attach a bool like request._spec_warnings_logged) has
already had its warnings emitted and only log them the first time, e.g., check
the flag/set before calling logger.warning and set it after the first warning so
subsequent calls skip logging.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/speculative/spec_sampler_base.py`:
- Around line 147-158: The warnings in _request_common_handling about
request.py_return_context_logits, request.py_return_generation_logits, and
request.py_return_log_probs are emitted every decode iteration and should be
throttled; modify _request_common_handling (or the SpecSamplerBase instance) to
record that a given request (use request.py_request_id or attach a bool like
request._spec_warnings_logged) has already had its warnings emitted and only log
them the first time, e.g., check the flag/set before calling logger.warning and
set it after the first warning so subsequent calls skip logging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c6530550-0dd3-4eed-b35e-f4b6d9ef53ca

📥 Commits

Reviewing files that changed from the base of the PR and between e92ee4f and 6a8abf9.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/speculative/spec_sampler_base.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41339 [ run ] triggered by Bot. Commit: 6a8abf9 Link to invocation

@yifjiang yifjiang force-pushed the fix/spec-sampler-no-crash-v2 branch from 6a8abf9 to 3710c3b Compare April 2, 2026 06:26
@pengbowang-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@yifjiang yifjiang force-pushed the fix/spec-sampler-no-crash-v2 branch from 3710c3b to a21787d Compare April 2, 2026 06:29
@pengbowang-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41358 [ run ] triggered by Bot. Commit: a21787d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41359 [ run ] triggered by Bot. Commit: a21787d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41358 [ run ] completed with state ABORTED. Commit: a21787d

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41359 [ run ] completed with state SUCCESS. Commit: a21787d
/LLM/main/L0_MergeRequest_PR pipeline #32303 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@pengbowang-nv pengbowang-nv force-pushed the fix/spec-sampler-no-crash-v2 branch from a21787d to c556c54 Compare April 2, 2026 15:20
@pengbowang-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@pengbowang-nv pengbowang-nv enabled auto-merge (squash) April 2, 2026 15:21
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41450 [ run ] triggered by Bot. Commit: c556c54 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41450 [ run ] completed with state SUCCESS. Commit: c556c54
/LLM/main/L0_MergeRequest_PR pipeline #32381 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@pengbowang-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41542 [ run ] triggered by Bot. Commit: c556c54 Link to invocation

@pengbowang-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41544 [ run ] triggered by Bot. Commit: c556c54 Link to invocation

auto-merge was automatically disabled April 3, 2026 03:30

Head branch was pushed to by a user without write access

@yifjiang yifjiang force-pushed the fix/spec-sampler-no-crash-v2 branch from c556c54 to 2fc537c Compare April 3, 2026 03:30
@pengbowang-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41592 [ run ] triggered by Bot. Commit: 2fc537c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41544 [ run ] completed with state ABORTED. Commit: c556c54
/LLM/main/L0_MergeRequest_PR pipeline #32458 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@pengbowang-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41603 [ run ] triggered by Bot. Commit: 2fc537c Link to invocation

@pengbowang-nv pengbowang-nv enabled auto-merge (squash) April 3, 2026 07:34
@pengbowang-nv
Copy link
Copy Markdown
Collaborator

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41647 [ kill ] triggered by Bot. Commit: 2fc537c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41647 [ kill ] completed with state SUCCESS. Commit: 2fc537c
Successfully killed previous jobs for commit 2fc537c

Link to invocation

…ogprobs in speculative sampler

When return_context_logits, return_generation_logits, or
return_log_probs is requested with speculative decoding, the server
crashes with an AssertionError. Replace these assertions with warnings
so the server stays alive and the request completes without the
unsupported fields.

Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
@pengbowang-nv pengbowang-nv force-pushed the fix/spec-sampler-no-crash-v2 branch from 2fc537c to 8cbc209 Compare April 3, 2026 16:21
@pengbowang-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41704 [ run ] triggered by Bot. Commit: 8cbc209 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41704 [ run ] completed with state SUCCESS. Commit: 8cbc209
/LLM/main/L0_MergeRequest_PR pipeline #32606 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@pengbowang-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41785 [ run ] triggered by Bot. Commit: 8cbc209 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41785 [ run ] completed with state SUCCESS. Commit: 8cbc209
/LLM/main/L0_MergeRequest_PR pipeline #32679 completed with status: 'SUCCESS'

CI Report

Link to invocation

@pengbowang-nv pengbowang-nv merged commit 9ab5cef into NVIDIA:main Apr 4, 2026
5 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026
…ogprobs in speculative sampler (NVIDIA#12547)

With this change, the server returns a response without the logprobs/logits fields populated — the request completes normally, just without the unsupported data. This avoids too many assertion error crashing Dynamo.
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
…ogprobs in speculative sampler (NVIDIA#12547)

With this change, the server returns a response without the logprobs/logits fields populated — the request completes normally, just without the unsupported data. This avoids too many assertion error crashing Dynamo.
Signed-off-by: yifjiang <19356972+yifjiang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants