Skip to content

[https://nvbugs/5615248][perf] Early emission of first token with overlap scheduling#14061

Merged
brb-nv merged 3 commits into
NVIDIA:mainfrom
brb-nv:user/brb/early-first-token-emission
May 20, 2026
Merged

[https://nvbugs/5615248][perf] Early emission of first token with overlap scheduling#14061
brb-nv merged 3 commits into
NVIDIA:mainfrom
brb-nv:user/brb/early-first-token-emission

Conversation

@brb-nv
Copy link
Copy Markdown
Collaborator

@brb-nv brb-nv commented May 13, 2026

Description

This MR advances enqueue_responses() of requests with first generated tokens before sample_async when using overlap scheduling. This is particularly useful for small language models with tight TTFT expectations.

It's experimental and opt-in, defaulting to false.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

  • New Features

    • Enabled early emission of first-token responses in overlap scheduling mode for improved Time-To-First-Token (TTFT) performance.
    • Enhanced streaming response behavior to maintain parity across scheduling modes.
  • Bug Fixes

    • Prevented duplicate first-token response emission during request processing.
  • Tests

    • Added comprehensive streaming validation tests for overlap scheduler with various configurations.

Review Change Stack

@brb-nv brb-nv requested a review from a team as a code owner May 13, 2026 00:30
@brb-nv brb-nv requested a review from achartier May 13, 2026 00:30
@brb-nv brb-nv requested a review from reasonsolo May 13, 2026 00:30
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 13, 2026

📝 Walkthrough

Walkthrough

This PR implements early first-token response emission in the overlap scheduler to reduce time-to-first-token, adds comprehensive shutdown debug instrumentation across executor components, and validates the feature with streaming tests.

Changes

Early First-Token Response Emission for TTFT Optimization

Layer / File(s) Summary
Early response emission and deduplication
tensorrt_llm/_torch/pyexecutor/py_executor.py
Adds _emit_first_token_responses() to create and enqueue first-token responses early after KV transfer, before sampling. Updates _handle_responses() to accept emit_first_iter flag to prevent duplicate response creation. _executor_loop_overlap calls the new method after _send_kv_async() for previous batches and enqueues empty responses when skipping previous-batch processing to maintain collective symmetry.

Shutdown Debug Instrumentation

Layer / File(s) Summary
Shutdown logging infrastructure
tensorrt_llm/_torch/pyexecutor/py_executor.py, tensorrt_llm/_torch/pyexecutor/executor_request_queue.py, tensorrt_llm/executor/proxy.py, tensorrt_llm/executor/worker.py
Adds module-level _shutdown_log() helper in each file that logs via project logger with sys.stderr fallback for interpreter teardown robustness.
PyExecutor lifecycle tracing
tensorrt_llm/_torch/pyexecutor/py_executor.py
Instruments _event_loop_wrapper (entry, return, exception, finally cleanup), _executor_loop_cleanup (entry/exit), main shutdown() method (entry, enqueue/wait phases, join, exit), and both _executor_loop and _executor_loop_overlap (entry, sentinel break) with shutdown logs. Logs shutdown-state counters in _prepare_and_schedule_batch and SHUTDOWN_REQUEST processing.
Request queue and orchestration shutdown
tensorrt_llm/_torch/pyexecutor/executor_request_queue.py, tensorrt_llm/executor/proxy.py
ExecutorRequestQueue.enqueue_shutdown_request() logs entry/exit. GenerationExecutorProxy.pre_shutdown() and shutdown() log entry/exit, early-return cases, sentinel enqueue decision (SENT vs SKIPPED), per-MPI-future status and exceptions, and MPI session shutdown sequence.
Worker shutdown tracing
tensorrt_llm/executor/worker.py
Instruments GenerationExecutorWorker.shutdown() (entry, early return, thread/engine operations, exit) and worker_main() entry, request-queue sentinel receipt, and exception paths. notify_proxy_threads_to_quit() logs queue type and count.

Overlap Scheduler Streaming Tests

Layer / File(s) Summary
Streaming test infrastructure
tests/unittest/_torch/executor/test_overlap_scheduler.py
Extends create_llm() to accept stream_interval parameter; adds _collect_streaming_chunks() helper to gather per-request streamed chunks.
Chunk parity and single-token validation
tests/unittest/_torch/executor/test_overlap_scheduler.py
Introduces test_overlap_scheduler_streaming_chunk_parity() comparing chunks between overlap-enabled and disabled across stream_interval values, and test_overlap_scheduler_streaming_single_token() validating single-token streaming behavior when overlap is enabled.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description is incomplete; missing Test Coverage section and several important PR checklist items are incomplete or unchecked. Complete the Test Coverage section listing relevant test cases. Check off completed checklist items and clarify API compatibility, dependency scanning, CODEOWNERS updates, and documentation status. Include the PR type (e.g., [feat], [fix]) and ticket ID in the title.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title accurately describes the main change: early emission of first tokens with overlap scheduling to reduce TTFT latency.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 4091-4096: The bug is that create_response(...) is called before
perf bookkeeping so early-final responses miss the current step's metrics;
update the perf bookkeeping (append to time_breakdown_metrics and update C++
perf/timing fields) for the request before calling request.create_response(...)
in the early-emission branch, or instead skip creating an early response and let
the finished request continue through _handle_responses(...) so it receives the
same metric updates; modify the branch where py_first_token_response_sent is set
(and where new_responses is appended with (request.py_request_id, response) and
response.result.cached_tokens is assigned) to run the same perf-update logic
used in _handle_responses() (or route to that function) so timing/last-token
metrics are populated.
- Around line 4159-4166: The current logic clears
request.py_first_token_response_sent and skips creating a final response
unconditionally, which drops terminal/cancel responses when a request becomes
finished after _emit_first_token_responses(); change it so the skip only applies
if the request is still non-terminal: when request.py_first_token_response_sent
is true, check request.is_finished — if finished, do not suppress building the
response (set request_done = True and allow downstream _handle_responses() to
create the terminal/cancel response); if not finished, clear the flag and set
request_done accordingly to skip creating a duplicate first-token response. This
involves editing the branch handling request.py_first_token_response_sent in the
same block that references request.py_decoding_iter and request.is_finished so
only non-terminal requests get the early-skip behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8caaf8b9-64e1-41d9-9c28-2b594bc3a1bf

📥 Commits

Reviewing files that changed from the base of the PR and between 2225b7f and 8c9b7cb.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
@brb-nv brb-nv force-pushed the user/brb/early-first-token-emission branch from 8c9b7cb to 2cdd17b Compare May 13, 2026 00:40
@brb-nv brb-nv changed the title [https://nvbugs/5615248][perf] Early emission of first-token with ovelap scheduling [https://nvbugs/5615248][perf] Early emission of first-token with overlap scheduling May 13, 2026
@brb-nv brb-nv force-pushed the user/brb/early-first-token-emission branch from 7c601ae to d31a112 Compare May 13, 2026 01:05
@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented May 13, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48194 [ run ] triggered by Bot. Commit: d31a112 Link to invocation

@brb-nv brb-nv force-pushed the user/brb/early-first-token-emission branch from d31a112 to 778a634 Compare May 13, 2026 18:19
Copy link
Copy Markdown
Collaborator

@achartier achartier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@brb-nv brb-nv force-pushed the user/brb/early-first-token-emission branch from 778a634 to 8b97604 Compare May 14, 2026 02:18
@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented May 14, 2026

/bot run --disable-fail-fast

@brb-nv brb-nv enabled auto-merge (squash) May 14, 2026 02:40
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48278 [ run ] triggered by Bot. Commit: 8b97604 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48278 [ run ] completed with state SUCCESS. Commit: 8b97604
/LLM/main/L0_MergeRequest_PR pipeline #38090 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@brb-nv brb-nv force-pushed the user/brb/early-first-token-emission branch from 8b97604 to a83b767 Compare May 14, 2026 16:49
@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented May 14, 2026

/bot run --disable-fail-fast

@brb-nv brb-nv force-pushed the user/brb/early-first-token-emission branch from a83b767 to edb8118 Compare May 14, 2026 17:47
@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented May 14, 2026

/bot run --stage-list "DGX_B200-PyTorch-4" --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48412 [ run ] triggered by Bot. Commit: edb8118 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48412 [ run ] completed with state SUCCESS. Commit: edb8118
/LLM/main/L0_MergeRequest_PR pipeline #38213 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented May 14, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48441 [ run ] triggered by Bot. Commit: edb8118 Link to invocation

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented May 15, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48496 [ run ] triggered by Bot. Commit: f4027e5 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48809 [ run ] triggered by Bot. Commit: 98cf5ac Link to invocation

Comment thread tests/integration/test_lists/test-db/l0_h100.yml Outdated
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48792 [ run ] completed with state ABORTED. Commit: 7dbadb9

Link to invocation

@brb-nv brb-nv force-pushed the user/brb/early-first-token-emission branch from 98cf5ac to 8012f83 Compare May 18, 2026 02:36
@brb-nv brb-nv requested a review from a team as a code owner May 18, 2026 02:36
@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented May 18, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48811 [ run ] triggered by Bot. Commit: 8012f83 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48809 [ run ] completed with state ABORTED. Commit: 98cf5ac

Link to invocation

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
Comment thread tensorrt_llm/llmapi/llm_args.py Outdated
@brb-nv brb-nv force-pushed the user/brb/early-first-token-emission branch from 8012f83 to 30da359 Compare May 18, 2026 16:58
…rlap scheduling

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
@brb-nv brb-nv force-pushed the user/brb/early-first-token-emission branch from 30da359 to a48362d Compare May 18, 2026 17:16
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48811 [ run ] completed with state SUCCESS. Commit: 8012f83
/LLM/main/L0_MergeRequest_PR pipeline #38573 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

brb-nv added 2 commits May 18, 2026 17:49
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
@brb-nv brb-nv requested a review from QiJune May 18, 2026 18:03
@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented May 18, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48955 [ run ] triggered by Bot. Commit: b02a69f Link to invocation

Copy link
Copy Markdown
Collaborator

@QiJune QiJune left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@brb-nv brb-nv enabled auto-merge (squash) May 19, 2026 02:11
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48955 [ run ] completed with state SUCCESS. Commit: b02a69f
/LLM/main/L0_MergeRequest_PR pipeline #38701 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@pcastonguay
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49201 [ run ] triggered by Bot. Commit: b02a69f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49201 [ run ] completed with state FAILURE. Commit: b02a69f
/LLM/main/L0_MergeRequest_PR pipeline #38876 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@brb-nv
Copy link
Copy Markdown
Collaborator Author

brb-nv commented May 20, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49294 [ run ] triggered by Bot. Commit: b02a69f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49294 [ run ] completed with state SUCCESS. Commit: b02a69f
/LLM/main/L0_MergeRequest_PR pipeline #38956 completed with status: 'SUCCESS'

CI Report

Link to invocation

@brb-nv brb-nv merged commit cf87a8b into NVIDIA:main May 20, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api-compatible Accepted LLM API contract change that is backwards-compatible

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants