[None][fix] Gate cudaProfilerStart/Stop on iter_counter, not loop counter#13744
Conversation
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThe ChangesProfiling Counter Source Update
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py (1)
1054-1060:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse
self.iter_counterin the early-exit stop log for consistency.The
finallypath still logs stop iteration using localit, which can diverge from the new gate source and produce misleading stop logs on early exit.Suggested fix
- logger.info(f"Profiling stopped at iteration {it}, " + logger.info( + f"Profiling stopped at iteration {self.iter_counter}, " f"trace saved to {torch_trace_path}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 1054 - 1060, The early-exit profiling stop log uses the local variable it which can differ from the actual iteration counter; change the log to use self.iter_counter for consistency. In the block guarded by enable_torch_trace where torch_profiler.stop(), torch_profiler.export_chrome_trace(torch_trace_path) and logger.info(...) are called, replace the usage of it with self.iter_counter in the f-string so the message reflects the real iteration tracked by the executor (refer to symbols torch_profiler, torch_trace_path, logger, and self.iter_counter).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1054-1060: The early-exit profiling stop log uses the local
variable it which can differ from the actual iteration counter; change the log
to use self.iter_counter for consistency. In the block guarded by
enable_torch_trace where torch_profiler.stop(),
torch_profiler.export_chrome_trace(torch_trace_path) and logger.info(...) are
called, replace the usage of it with self.iter_counter in the f-string so the
message reflects the real iteration tracked by the executor (refer to symbols
torch_profiler, torch_trace_path, logger, and self.iter_counter).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: bf98dde4-4c93-4d07-8363-c39d6e42955a
📒 Files selected for processing (1)
tensorrt_llm/_torch/pyexecutor/py_executor.py
|
/bot run --disable-fail-fast |
|
PR_Github #47099 [ run ] triggered by Bot. Commit: |
|
PR_Github #47099 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
…nter The local ``it`` variable in ``profile_step`` increments on every call — including idle worker-loop spins where no batch is scheduled (common on disagg gen workers waiting for KV transfer to complete). The user-facing iter logs and ``self.iter_counter`` only advance when an actual forward pass executes, so the two diverge: ``it`` reaches ``profile_start_iters`` during pre-benchmark spin-wait and fires ``cudaProfilerStart`` before any real kernels run, leaving the captured nsys trace with no kernel data. Switch the profile gating to ``self.iter_counter`` so it lines up with the iter the user specified via ``TLLM_PROFILE_START_STOP``. Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>
|
/bot run --disable-fail-fast |
d7ddc3f to
83567a0
Compare
|
PR_Github #47228 [ run ] triggered by Bot. Commit: |
|
PR_Github #47228 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47355 [ run ] triggered by Bot. Commit: |
|
PR_Github #47355 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #47462 [ run ] triggered by Bot. Commit: |
|
PR_Github #47462 [ run ] completed with state |
Summary by CodeRabbit
Description
The local
itvariable insideprofile_step()(py_executor.py) increments on every call toprofile_step()— which includes worker-loop iterations that perform no actual forward pass (e.g. disagg gen workers spinning while waiting for KV transfer to complete from the ctx side).The user-facing iter log line and
self.iter_counteronly advance when a real forward pass executes, soitandself.iter_counterdiverge dramatically during gen-side startup. With the existing gating onit,cudaProfilerStartfires after N idle spins (where N happens to land onprofile_start_iters) — long before any benchmark request reaches the gen worker. The captured nsys trace then contains only empty loop iterations: no kernel events, no GPU activity. The user-visible "Profiling started at iteration 5" message is misleading because it's logged withit, not the real iter the user asked for viaTLLM_PROFILE_START_STOP.This switches the profile-start/stop gating (and the log lines) to
self.iter_counter, which matches the iter semantics used in the per-iter log and initer_stats.iter. After the fix,cudaProfilerStartfires when the real benchmark iter reaches the configured range, and the captured nsys trace contains the intended decode kernels.Test Coverage
Reproduced the bug and verified the fix on the disagg gen-server path:
TLLM_PROFILE_START_STOP=5-15on the gen worker.nsys_worker_proc_GEN_0_*.nsys-repfiles ~0.6 MB each, no CUDA kernel rows in the SQLite export. Log showsProfiling started at iteration 5whileiter = 0is logged repeatedly throughout the run.Profiling started at iteration 5andProfiling stopped at iteration 15log lines now matchself.iter_counter.Manually verified on a 4-node gb300 disagg cluster; no behavior change for non-disagg workloads (
itandself.iter_counteradvance in lockstep there).PR Checklist