[https://nvbugs/5615248][perf] Early emission of first token with overlap scheduling by brb-nv · Pull Request #14061 · NVIDIA/TensorRT-LLM

brb-nv · 2026-05-13T00:30:42Z

Description

This MR advances enqueue_responses() of requests with first generated tokens before sample_async when using overlap scheduling. This is particularly useful for small language models with tight TTFT expectations.

It's experimental and opt-in, defaulting to false.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

New Features
- Enabled early emission of first-token responses in overlap scheduling mode for improved Time-To-First-Token (TTFT) performance.
- Enhanced streaming response behavior to maintain parity across scheduling modes.
Bug Fixes
- Prevented duplicate first-token response emission during request processing.
Tests
- Added comprehensive streaming validation tests for overlap scheduler with various configurations.

coderabbitai · 2026-05-13T00:36:06Z

📝 Walkthrough

Walkthrough

This PR implements early first-token response emission in the overlap scheduler to reduce time-to-first-token, adds comprehensive shutdown debug instrumentation across executor components, and validates the feature with streaming tests.

Changes

Early First-Token Response Emission for TTFT Optimization

Layer / File(s)	Summary
Early response emission and deduplication `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Adds `_emit_first_token_responses()` to create and enqueue first-token responses early after KV transfer, before sampling. Updates `_handle_responses()` to accept `emit_first_iter` flag to prevent duplicate response creation. `_executor_loop_overlap` calls the new method after `_send_kv_async()` for previous batches and enqueues empty responses when skipping previous-batch processing to maintain collective symmetry.

Shutdown Debug Instrumentation

Layer / File(s)	Summary
Shutdown logging infrastructure `tensorrt_llm/_torch/pyexecutor/py_executor.py`, `tensorrt_llm/_torch/pyexecutor/executor_request_queue.py`, `tensorrt_llm/executor/proxy.py`, `tensorrt_llm/executor/worker.py`	Adds module-level `_shutdown_log()` helper in each file that logs via project logger with `sys.stderr` fallback for interpreter teardown robustness.
PyExecutor lifecycle tracing `tensorrt_llm/_torch/pyexecutor/py_executor.py`	Instruments `_event_loop_wrapper` (entry, return, exception, finally cleanup), `_executor_loop_cleanup` (entry/exit), main `shutdown()` method (entry, enqueue/wait phases, join, exit), and both `_executor_loop` and `_executor_loop_overlap` (entry, sentinel break) with shutdown logs. Logs shutdown-state counters in `_prepare_and_schedule_batch` and SHUTDOWN_REQUEST processing.
Request queue and orchestration shutdown `tensorrt_llm/_torch/pyexecutor/executor_request_queue.py`, `tensorrt_llm/executor/proxy.py`	`ExecutorRequestQueue.enqueue_shutdown_request()` logs entry/exit. `GenerationExecutorProxy.pre_shutdown()` and `shutdown()` log entry/exit, early-return cases, sentinel enqueue decision (SENT vs SKIPPED), per-MPI-future status and exceptions, and MPI session shutdown sequence.
Worker shutdown tracing `tensorrt_llm/executor/worker.py`	Instruments `GenerationExecutorWorker.shutdown()` (entry, early return, thread/engine operations, exit) and `worker_main()` entry, request-queue sentinel receipt, and exception paths. `notify_proxy_threads_to_quit()` logs queue type and count.

Overlap Scheduler Streaming Tests

Layer / File(s)	Summary
Streaming test infrastructure `tests/unittest/_torch/executor/test_overlap_scheduler.py`	Extends `create_llm()` to accept `stream_interval` parameter; adds `_collect_streaming_chunks()` helper to gather per-request streamed chunks.
Chunk parity and single-token validation `tests/unittest/_torch/executor/test_overlap_scheduler.py`	Introduces `test_overlap_scheduler_streaming_chunk_parity()` comparing chunks between overlap-enabled and disabled across `stream_interval` values, and `test_overlap_scheduler_streaming_single_token()` validating single-token streaming behavior when overlap is enabled.

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	PR description is incomplete; missing Test Coverage section and several important PR checklist items are incomplete or unchecked.	Complete the Test Coverage section listing relevant test cases. Check off completed checklist items and clarify API compatibility, dependency scanning, CODEOWNERS updates, and documentation status. Include the PR type (e.g., [feat], [fix]) and ticket ID in the title.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title accurately describes the main change: early emission of first tokens with overlap scheduling to reduce TTFT latency.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 4091-4096: The bug is that create_response(...) is called before
perf bookkeeping so early-final responses miss the current step's metrics;
update the perf bookkeeping (append to time_breakdown_metrics and update C++
perf/timing fields) for the request before calling request.create_response(...)
in the early-emission branch, or instead skip creating an early response and let
the finished request continue through _handle_responses(...) so it receives the
same metric updates; modify the branch where py_first_token_response_sent is set
(and where new_responses is appended with (request.py_request_id, response) and
response.result.cached_tokens is assigned) to run the same perf-update logic
used in _handle_responses() (or route to that function) so timing/last-token
metrics are populated.
- Around line 4159-4166: The current logic clears
request.py_first_token_response_sent and skips creating a final response
unconditionally, which drops terminal/cancel responses when a request becomes
finished after _emit_first_token_responses(); change it so the skip only applies
if the request is still non-terminal: when request.py_first_token_response_sent
is true, check request.is_finished — if finished, do not suppress building the
response (set request_done = True and allow downstream _handle_responses() to
create the terminal/cancel response); if not finished, clear the flag and set
request_done accordingly to skip creating a duplicate first-token response. This
involves editing the branch handling request.py_first_token_response_sent in the
same block that references request.py_decoding_iter and request.is_finished so
only non-terminal requests get the early-skip behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8caaf8b9-64e1-41d9-9c28-2b594bc3a1bf

📥 Commits

Reviewing files that changed from the base of the PR and between 2225b7f and 8c9b7cb.

📒 Files selected for processing (2)

tensorrt_llm/_torch/pyexecutor/llm_request.py
tensorrt_llm/_torch/pyexecutor/py_executor.py

brb-nv · 2026-05-13T14:51:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-13T14:57:14Z

PR_Github #48194 [ run ] triggered by Bot. Commit: d31a112 Link to invocation

achartier

LGTM

brb-nv · 2026-05-14T02:40:14Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-14T02:46:22Z

PR_Github #48278 [ run ] triggered by Bot. Commit: 8b97604 Link to invocation

tensorrt-cicd · 2026-05-14T16:43:21Z

PR_Github #48278 [ run ] completed with state SUCCESS. Commit: 8b97604
/LLM/main/L0_MergeRequest_PR pipeline #38090 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

brb-nv · 2026-05-14T16:55:57Z

/bot run --disable-fail-fast

brb-nv · 2026-05-14T17:49:13Z

/bot run --stage-list "DGX_B200-PyTorch-4" --disable-fail-fast

tensorrt-cicd · 2026-05-14T17:55:43Z

PR_Github #48412 [ run ] triggered by Bot. Commit: edb8118 Link to invocation

tensorrt-cicd · 2026-05-14T21:18:10Z

PR_Github #48412 [ run ] completed with state SUCCESS. Commit: edb8118
/LLM/main/L0_MergeRequest_PR pipeline #38213 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

brb-nv · 2026-05-14T21:37:44Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-14T21:45:32Z

PR_Github #48441 [ run ] triggered by Bot. Commit: edb8118 Link to invocation

brb-nv · 2026-05-15T03:10:17Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-15T03:17:30Z

PR_Github #48496 [ run ] triggered by Bot. Commit: f4027e5 Link to invocation

tensorrt-cicd · 2026-05-18T02:28:03Z

PR_Github #48809 [ run ] triggered by Bot. Commit: 98cf5ac Link to invocation

tensorrt-cicd · 2026-05-18T02:32:10Z

PR_Github #48792 [ run ] completed with state ABORTED. Commit: 7dbadb9

Link to invocation

brb-nv · 2026-05-18T02:37:40Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-18T02:43:24Z

PR_Github #48811 [ run ] triggered by Bot. Commit: 8012f83 Link to invocation

tensorrt-cicd · 2026-05-18T02:46:53Z

PR_Github #48809 [ run ] completed with state ABORTED. Commit: 98cf5ac

Link to invocation

…rlap scheduling Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

tensorrt-cicd · 2026-05-18T17:40:03Z

PR_Github #48811 [ run ] completed with state SUCCESS. Commit: 8012f83
/LLM/main/L0_MergeRequest_PR pipeline #38573 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv · 2026-05-18T18:03:58Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-18T18:10:06Z

PR_Github #48955 [ run ] triggered by Bot. Commit: b02a69f Link to invocation

QiJune

LGTM

tensorrt-cicd · 2026-05-19T13:09:45Z

PR_Github #48955 [ run ] completed with state SUCCESS. Commit: b02a69f
/LLM/main/L0_MergeRequest_PR pipeline #38701 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

pcastonguay · 2026-05-19T14:00:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-19T14:06:23Z

PR_Github #49201 [ run ] triggered by Bot. Commit: b02a69f Link to invocation

tensorrt-cicd · 2026-05-20T01:04:46Z

PR_Github #49201 [ run ] completed with state FAILURE. Commit: b02a69f
/LLM/main/L0_MergeRequest_PR pipeline #38876 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

brb-nv · 2026-05-20T01:07:17Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-20T01:13:23Z

PR_Github #49294 [ run ] triggered by Bot. Commit: b02a69f Link to invocation

tensorrt-cicd · 2026-05-20T02:31:21Z

PR_Github #49294 [ run ] completed with state SUCCESS. Commit: b02a69f
/LLM/main/L0_MergeRequest_PR pipeline #38956 completed with status: 'SUCCESS'

CI Report

Link to invocation

brb-nv requested a review from a team as a code owner May 13, 2026 00:30

brb-nv requested a review from achartier May 13, 2026 00:30

github-actions Bot assigned brb-nv May 13, 2026

brb-nv requested a review from reasonsolo May 13, 2026 00:30

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated

brb-nv force-pushed the user/brb/early-first-token-emission branch from 8c9b7cb to 2cdd17b Compare May 13, 2026 00:40

brb-nv changed the title ~~[https://nvbugs/5615248][perf] Early emission of first-token with ovelap scheduling~~ [https://nvbugs/5615248][perf] Early emission of first-token with overlap scheduling May 13, 2026

brb-nv force-pushed the user/brb/early-first-token-emission branch from 7c601ae to d31a112 Compare May 13, 2026 01:05

reasonsolo approved these changes May 13, 2026

View reviewed changes

brb-nv force-pushed the user/brb/early-first-token-emission branch from d31a112 to 778a634 Compare May 13, 2026 18:19

achartier approved these changes May 13, 2026

View reviewed changes

brb-nv force-pushed the user/brb/early-first-token-emission branch from 778a634 to 8b97604 Compare May 14, 2026 02:18

brb-nv enabled auto-merge (squash) May 14, 2026 02:40

brb-nv force-pushed the user/brb/early-first-token-emission branch from 8b97604 to a83b767 Compare May 14, 2026 16:49

brb-nv force-pushed the user/brb/early-first-token-emission branch from a83b767 to edb8118 Compare May 14, 2026 17:47

Tabrizian approved these changes May 14, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated

StanleySun639 approved these changes May 18, 2026

View reviewed changes

Comment thread tests/integration/test_lists/test-db/l0_h100.yml Outdated

brb-nv force-pushed the user/brb/early-first-token-emission branch from 98cf5ac to 8012f83 Compare May 18, 2026 02:36

brb-nv requested a review from a team as a code owner May 18, 2026 02:36

QiJune reviewed May 18, 2026

View reviewed changes

Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated

QiJune reviewed May 18, 2026

View reviewed changes

Comment thread tensorrt_llm/llmapi/llm_args.py Outdated

brb-nv force-pushed the user/brb/early-first-token-emission branch from 8012f83 to 30da359 Compare May 18, 2026 16:58

[https://nvbugs/5615248][perf] Early emission of first token with ove…

a48362d

…rlap scheduling Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv force-pushed the user/brb/early-first-token-emission branch from 30da359 to a48362d Compare May 18, 2026 17:16

brb-nv added 2 commits May 18, 2026 17:49

rename to enable_early_first_token_response

9de7aac

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

update tests to cover returning logits

b02a69f

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv requested a review from QiJune May 18, 2026 18:03

QiJune approved these changes May 19, 2026

View reviewed changes

brb-nv enabled auto-merge (squash) May 19, 2026 02:11

yingguo-trt approved these changes May 20, 2026

View reviewed changes

brb-nv merged commit cf87a8b into NVIDIA:main May 20, 2026
8 checks passed

Conversation

brb-nv commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

brb-nv commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

achartier left a comment

Choose a reason for hiding this comment

Uh oh!

brb-nv commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

brb-nv commented May 14, 2026

Uh oh!

brb-nv commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

brb-nv commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

Uh oh!

Uh oh!

brb-nv commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 15, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

brb-nv commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

brb-nv commented May 18, 2026

Uh oh!

tensorrt-cicd commented May 18, 2026

Uh oh!

QiJune left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

pcastonguay commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 19, 2026

Uh oh!

tensorrt-cicd commented May 20, 2026

brb-nv commented May 13, 2026 •

edited

Loading

coderabbitai Bot commented May 13, 2026 •

edited

Loading