[https://nvbugs/5615248][perf] Early emission of first token with overlap scheduling#14061
Conversation
📝 WalkthroughWalkthroughThis PR implements early first-token response emission in the overlap scheduler to reduce time-to-first-token, adds comprehensive shutdown debug instrumentation across executor components, and validates the feature with streaming tests. ChangesEarly First-Token Response Emission for TTFT Optimization
Shutdown Debug Instrumentation
Overlap Scheduler Streaming Tests
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 4091-4096: The bug is that create_response(...) is called before
perf bookkeeping so early-final responses miss the current step's metrics;
update the perf bookkeeping (append to time_breakdown_metrics and update C++
perf/timing fields) for the request before calling request.create_response(...)
in the early-emission branch, or instead skip creating an early response and let
the finished request continue through _handle_responses(...) so it receives the
same metric updates; modify the branch where py_first_token_response_sent is set
(and where new_responses is appended with (request.py_request_id, response) and
response.result.cached_tokens is assigned) to run the same perf-update logic
used in _handle_responses() (or route to that function) so timing/last-token
metrics are populated.
- Around line 4159-4166: The current logic clears
request.py_first_token_response_sent and skips creating a final response
unconditionally, which drops terminal/cancel responses when a request becomes
finished after _emit_first_token_responses(); change it so the skip only applies
if the request is still non-terminal: when request.py_first_token_response_sent
is true, check request.is_finished — if finished, do not suppress building the
response (set request_done = True and allow downstream _handle_responses() to
create the terminal/cancel response); if not finished, clear the flag and set
request_done accordingly to skip creating a duplicate first-token response. This
involves editing the branch handling request.py_first_token_response_sent in the
same block that references request.py_decoding_iter and request.is_finished so
only non-terminal requests get the early-skip behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 8caaf8b9-64e1-41d9-9c28-2b594bc3a1bf
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/llm_request.pytensorrt_llm/_torch/pyexecutor/py_executor.py
8c9b7cb to
2cdd17b
Compare
7c601ae to
d31a112
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48194 [ run ] triggered by Bot. Commit: |
d31a112 to
778a634
Compare
778a634 to
8b97604
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48278 [ run ] triggered by Bot. Commit: |
|
PR_Github #48278 [ run ] completed with state
|
8b97604 to
a83b767
Compare
|
/bot run --disable-fail-fast |
a83b767 to
edb8118
Compare
|
/bot run --stage-list "DGX_B200-PyTorch-4" --disable-fail-fast |
|
PR_Github #48412 [ run ] triggered by Bot. Commit: |
|
PR_Github #48412 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #48441 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #48496 [ run ] triggered by Bot. Commit: |
|
PR_Github #48809 [ run ] triggered by Bot. Commit: |
|
PR_Github #48792 [ run ] completed with state |
98cf5ac to
8012f83
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #48811 [ run ] triggered by Bot. Commit: |
|
PR_Github #48809 [ run ] completed with state |
8012f83 to
30da359
Compare
…rlap scheduling Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
30da359 to
a48362d
Compare
|
PR_Github #48811 [ run ] completed with state
|
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #48955 [ run ] triggered by Bot. Commit: |
|
PR_Github #48955 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #49201 [ run ] triggered by Bot. Commit: |
|
PR_Github #49201 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #49294 [ run ] triggered by Bot. Commit: |
|
PR_Github #49294 [ run ] completed with state |
Description
This MR advances
enqueue_responses()of requests with first generated tokens beforesample_asyncwhen using overlap scheduling. This is particularly useful for small language models with tight TTFT expectations.It's experimental and opt-in, defaulting to false.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit
New Features
Bug Fixes
Tests