[None][fix] Gate cudaProfilerStart/Stop on iter_counter, not loop counter by Tabrizian · Pull Request #13744 · NVIDIA/TensorRT-LLM

Tabrizian · 2026-05-05T01:20:58Z

Summary by CodeRabbit

Bug Fixes
- Improved profiling state consistency in the PyTorch executor by synchronizing CUDA and PyTorch profiling start/stop events with the executor's iteration counter, ensuring accurate profiling metrics and reliable trace exports during execution.

Description

The local it variable inside profile_step() (py_executor.py) increments on every call to profile_step() — which includes worker-loop iterations that perform no actual forward pass (e.g. disagg gen workers spinning while waiting for KV transfer to complete from the ctx side).

The user-facing iter log line and self.iter_counter only advance when a real forward pass executes, so it and self.iter_counter diverge dramatically during gen-side startup. With the existing gating on it, cudaProfilerStart fires after N idle spins (where N happens to land on profile_start_iters) — long before any benchmark request reaches the gen worker. The captured nsys trace then contains only empty loop iterations: no kernel events, no GPU activity. The user-visible "Profiling started at iteration 5" message is misleading because it's logged with it, not the real iter the user asked for via TLLM_PROFILE_START_STOP.

This switches the profile-start/stop gating (and the log lines) to self.iter_counter, which matches the iter semantics used in the per-iter log and in iter_stats.iter. After the fix, cudaProfilerStart fires when the real benchmark iter reaches the configured range, and the captured nsys trace contains the intended decode kernels.

Test Coverage

Reproduced the bug and verified the fix on the disagg gen-server path:

DSv4-Pro 8k/1k DEP8 conc=512 with TLLM_PROFILE_START_STOP=5-15 on the gen worker.
Before: gen nsys_worker_proc_GEN_0_*.nsys-rep files ~0.6 MB each, no CUDA kernel rows in the SQLite export. Log shows Profiling started at iteration 5 while iter = 0 is logged repeatedly throughout the run.
After: gen traces ~2 MB each, CUDA kernel rows present. Profiling started at iteration 5 and Profiling stopped at iteration 15 log lines now match self.iter_counter.

Manually verified on a 4-node gb300 disagg cluster; no behavior change for non-disagg workloads (it and self.iter_counter advance in lockstep there).

PR Checklist

I have read the contributing guidelines.
My change is appropriately scoped (single-file change, surgical fix).
New tests cover this code path. (Existing profile-gating logic has no unit-test harness; manual repro on a multi-node cluster is the standard verification path.)
Pre-commit hooks pass.

Tabrizian · 2026-05-05T01:22:55Z

/bot run --disable-fail-fast

coderabbitai · 2026-05-05T01:24:13Z

📝 Walkthrough

Walkthrough

The _profiler() method's profile_step() function now uses self.iter_counter (the executor's iteration counter) instead of the local it counter to gate CUDA/PyTorch profiling start and stop events. Profiling operations, assertions, and logging are updated to reference the executor's counter for consistency.

Changes

Profiling Counter Source Update

Layer / File(s)	Summary
Profiling Stop Logic `tensorrt_llm/_torch/pyexecutor/py_executor.py` (lines 977–988)	Condition `it in self.profile_stop_iters` is replaced with `self.iter_counter in self.profile_stop_iters`. The torch profiler stop, trace export, and stop iteration log now use `self.iter_counter`.
Profiling Start Logic `tensorrt_llm/_torch/pyexecutor/py_executor.py` (lines 1030–1039)	Condition `it in self.profile_start_iters` is replaced with `self.iter_counter in self.profile_start_iters`. Profiling state assertion, calibrator/CUDA profiler start, optional torch profiler start, and start iteration log now use `self.iter_counter`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: switching profiler gating from loop counter to iter_counter to fix profiling behavior.
Description check	✅ Passed	The PR description is thorough and well-structured, covering the problem, solution, test verification, and checklist items as per the template requirements.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py (1)

1054-1060: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use self.iter_counter in the early-exit stop log for consistency.

The finally path still logs stop iteration using local it, which can diverge from the new gate source and produce misleading stop logs on early exit.

Suggested fix

-                    logger.info(f"Profiling stopped at iteration {it}, "
+                    logger.info(
+                        f"Profiling stopped at iteration {self.iter_counter}, "
                                 f"trace saved to {torch_trace_path}")

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py` around lines 1054 - 1060, The
early-exit profiling stop log uses the local variable it which can differ from
the actual iteration counter; change the log to use self.iter_counter for
consistency. In the block guarded by enable_torch_trace where
torch_profiler.stop(), torch_profiler.export_chrome_trace(torch_trace_path) and
logger.info(...) are called, replace the usage of it with self.iter_counter in
the f-string so the message reflects the real iteration tracked by the executor
(refer to symbols torch_profiler, torch_trace_path, logger, and
self.iter_counter).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 1054-1060: The early-exit profiling stop log uses the local
variable it which can differ from the actual iteration counter; change the log
to use self.iter_counter for consistency. In the block guarded by
enable_torch_trace where torch_profiler.stop(),
torch_profiler.export_chrome_trace(torch_trace_path) and logger.info(...) are
called, replace the usage of it with self.iter_counter in the f-string so the
message reflects the real iteration tracked by the executor (refer to symbols
torch_profiler, torch_trace_path, logger, and self.iter_counter).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: bf98dde4-4c93-4d07-8363-c39d6e42955a

📥 Commits

Reviewing files that changed from the base of the PR and between ad2fc22 and d7ddc3f.

📒 Files selected for processing (1)

tensorrt_llm/_torch/pyexecutor/py_executor.py

lfr-0531 · 2026-05-07T03:31:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-07T03:38:14Z

PR_Github #47099 [ run ] triggered by Bot. Commit: d7ddc3f Link to invocation

tensorrt-cicd · 2026-05-07T12:57:24Z

PR_Github #47099 [ run ] completed with state SUCCESS. Commit: d7ddc3f
/LLM/main/L0_MergeRequest_PR pipeline #37068 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

pcastonguay · 2026-05-07T15:08:55Z

/bot run --disable-fail-fast

…nter The local ``it`` variable in ``profile_step`` increments on every call — including idle worker-loop spins where no batch is scheduled (common on disagg gen workers waiting for KV transfer to complete). The user-facing iter logs and ``self.iter_counter`` only advance when an actual forward pass executes, so the two diverge: ``it`` reaches ``profile_start_iters`` during pre-benchmark spin-wait and fires ``cudaProfilerStart`` before any real kernels run, leaving the captured nsys trace with no kernel data. Switch the profile gating to ``self.iter_counter`` so it lines up with the iter the user specified via ``TLLM_PROFILE_START_STOP``. Signed-off-by: Iman Tabrizian <10105175+tabrizian@users.noreply.github.com>

Tabrizian · 2026-05-07T17:09:24Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-07T17:15:06Z

PR_Github #47228 [ run ] triggered by Bot. Commit: 83567a0 Link to invocation

tensorrt-cicd · 2026-05-08T03:47:12Z

PR_Github #47228 [ run ] completed with state SUCCESS. Commit: 83567a0
/LLM/main/L0_MergeRequest_PR pipeline #37183 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lfr-0531 · 2026-05-08T08:09:59Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-08T08:16:41Z

PR_Github #47355 [ run ] triggered by Bot. Commit: 83567a0 Link to invocation

tensorrt-cicd · 2026-05-08T14:57:42Z

PR_Github #47355 [ run ] completed with state SUCCESS. Commit: 83567a0
/LLM/main/L0_MergeRequest_PR pipeline #37291 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

lfr-0531 · 2026-05-09T01:57:15Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-09T02:02:29Z

PR_Github #47462 [ run ] triggered by Bot. Commit: 83567a0 Link to invocation

tensorrt-cicd · 2026-05-09T02:59:04Z

PR_Github #47462 [ run ] completed with state SUCCESS. Commit: 83567a0
/LLM/main/L0_MergeRequest_PR pipeline #37383 completed with status: 'SUCCESS'

CI Report

Link to invocation

…nter (NVIDIA#13744)

Tabrizian requested a review from a team as a code owner May 5, 2026 01:20

Tabrizian requested a review from lancelly May 5, 2026 01:21

github-actions Bot assigned Tabrizian May 5, 2026

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

pcastonguay approved these changes May 5, 2026

View reviewed changes

Tabrizian force-pushed the user/imant/fix-profile-iter-counter branch from d7ddc3f to 83567a0 Compare May 7, 2026 17:09

Tabrizian merged commit f44f752 into NVIDIA:main May 9, 2026
5 of 6 checks passed

Tabrizian mentioned this pull request May 10, 2026

[None][fix] Gate cudaProfilerStart/Stop on iter_counter, not loop counter #13958

Merged

4 tasks

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026

[None][fix] Gate cudaProfilerStart/Stop on iter_counter, not loop cou…

d55bcb3

…nter (NVIDIA#13744)

Conversation

Tabrizian commented May 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

Uh oh!

Tabrizian commented May 5, 2026

Uh oh!

coderabbitai Bot commented May 5, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

lfr-0531 commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

pcastonguay commented May 7, 2026

Uh oh!

Tabrizian commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 7, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

lfr-0531 commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

tensorrt-cicd commented May 8, 2026

Uh oh!

lfr-0531 commented May 9, 2026

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

tensorrt-cicd commented May 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Tabrizian commented May 5, 2026 •

edited by coderabbitai Bot

Loading