[https://nvbugs/5963665][refactor] Refactor warmup orchestration in M…#12407
[https://nvbugs/5963665][refactor] Refactor warmup orchestration in M…#12407liji-nv wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
|
/bot run |
📝 WalkthroughWalkthroughRefactored the model engine's warmup flow by adding two helper methods for deriving warmup request configurations, restructuring the Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip You can disable sequence diagrams in the walkthrough.Disable the |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 683-688: The curr_max_num_tokens calculation is using
get_num_available_tokens with max_num_draft_tokens=self.original_max_draft_len
and no batch_size which differs from the later warmup path; change the call to
kv_cache_manager.get_num_available_tokens to pass the real constraints used for
warmups — use max_num_draft_tokens=self.max_total_draft_tokens and include
batch_size=self.batch_size (and keep token_num_upper_bound as-is) so
curr_max_num_tokens reflects actual KV limits and avoids generating impossible
“max-shape” configs.
- Around line 748-752: The new pre-specialization call currently invokes
_general_warmup (via forward) while cuda graph capture may still be enabled,
which can prematurely populate the CUDA-graph cache; change the sequence so the
pre-specialization pass created by _get_full_general_warmup_requests runs with
CUDA-graph capture disabled (e.g., temporarily set cuda_graph_runner.enabled =
False or use a non-capturing runner/context around the call to
self._general_warmup), then restore the original cuda_graph_runner state and
afterwards call self._run_cuda_graph_warmup() so the dedicated CUDA-graph warmup
is the only path that populates graph cache. Ensure you reference and modify the
runner enable flag around calls to _general_warmup, and keep forward() unchanged
except for observing the temporary non-capturing runner state.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8861ba34-93f9-4691-9252-05fb84e1dd51
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py
|
PR_Github #39719 [ run ] triggered by Bot. Commit: |
…odelEngine Refactor the warmup flow to improve clarity and structure: - Extract _get_max_shape_warmup_requests and _get_full_general_warmup_requests to separate warmup config construction from execution - Replace _run_torch_compile_warmup with a unified _general_warmup call that handles both torch.compile specialization and memory pool pre-population - Consolidate the can_run_general_warmup condition into a named variable for readability - Fix deduplication in warmup config list construction Signed-off-by: Jin Li <59594262+liji-nv@users.noreply.github.com>
ac89ebb to
850f28a
Compare
|
/bot run |
|
PR_Github #39720 [ run ] triggered by Bot. Commit: |
|
PR_Github #39720 [ run ] completed with state
|
…odelEngine
Refactor the warmup flow to improve clarity and structure:
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.