Skip to content

[None][fix] Fix Cuda event crash with perf metrics#12639

Merged
jthomson04 merged 2 commits intoNVIDIA:mainfrom
jthomson04:jthomson04/fix-crash
Apr 5, 2026
Merged

[None][fix] Fix Cuda event crash with perf metrics#12639
jthomson04 merged 2 commits intoNVIDIA:mainfrom
jthomson04:jthomson04/fix-crash

Conversation

@jthomson04
Copy link
Copy Markdown
Collaborator

@jthomson04 jthomson04 commented Mar 31, 2026

Summary by CodeRabbit

Bug Fixes

  • Fixed GPU timing measurements to ensure accurate elapsed time calculations by properly synchronizing timing events before computation.

When return_perf_metrics=True, compute_batch_gpu_times() in tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py can call torch.cuda.Event.elapsed_time() before the recorded CUDA events have actually completed. This causes:

Exception in thread Thread-3 (_event_loop_wrapper):
Traceback (most recent call last):
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 595, in _event_loop_wrapper
    raise e
  File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 591, in _event_loop_wrapper
    self.event_loop()
  File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1982, in _executor_loop
    self.perf_manager.compute_batch_gpu_times(
  File "/opt/dynamo/venv/lib/python3.12/site-packages/tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py", line 166, in compute_batch_gpu_times
    batch_gpu_forward_time = perf.gpu_forward_start_event.elapsed_time(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/dynamo/venv/lib/python3.12/site-packages/torch/cuda/streams.py", line 233, in elapsed_time
    return super().elapsed_time(end_event)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Both events must be completed before calculating elapsed time.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@jthomson04 jthomson04 requested a review from a team as a code owner March 31, 2026 20:25
@jthomson04 jthomson04 requested a review from achartier March 31, 2026 20:25
@jthomson04 jthomson04 changed the title [None][fix] Fix Cuda event crash [None][fix] Fix Cuda event crash with perf metrics Mar 31, 2026
@jthomson04
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 31, 2026

📝 Walkthrough

Walkthrough

Added synchronization checks in compute_batch_gpu_times to ensure GPU timing events are completed before computing elapsed times. The code now verifies event completion using query() and calls synchronize() if events haven't finished before calculating GPU timings.

Changes

Cohort / File(s) Summary
GPU Timing Synchronization
tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py
Added event completion checks and synchronization calls before computing GPU forward and sample elapsed times to prevent incomplete event data from being used in timing calculations.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description explains the issue well with a detailed error traceback, but the required template sections for Description, Test Coverage, and PR Checklist are incomplete or only partially filled. Complete the Description section explaining the solution, add Test Coverage details listing relevant tests, and verify all PR Checklist items are properly addressed.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing a CUDA event crash in the perf metrics manager when GPU timing events aren't completed before elapsed_time() is called.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py (1)

166-169: Consider adding a regression test for incomplete-event timing.

This bug was timing-dependent; a focused test that exercises compute_batch_gpu_times() when events are not yet complete would help prevent regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py` around lines 166 -
169, Add a regression test that exercises compute_batch_gpu_times() when CUDA
events are still incomplete: create a PerfMetrics-like object with
gpu_forward_end_event and gpu_sample_end_event that initially return False for
query() (or use real CUDA events recorded without synchronization) and verify
compute_batch_gpu_times() calls synchronize() and returns correct timings
without hanging; specifically target the compute_batch_gpu_times function and
assert that gpu_forward_end_event.synchronize() and
gpu_sample_end_event.synchronize() are invoked (or that returned times are
finite/non-zero) to catch regressions in the code paths handling incomplete
events.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py`:
- Around line 166-169: Add a regression test that exercises
compute_batch_gpu_times() when CUDA events are still incomplete: create a
PerfMetrics-like object with gpu_forward_end_event and gpu_sample_end_event that
initially return False for query() (or use real CUDA events recorded without
synchronization) and verify compute_batch_gpu_times() calls synchronize() and
returns correct timings without hanging; specifically target the
compute_batch_gpu_times function and assert that
gpu_forward_end_event.synchronize() and gpu_sample_end_event.synchronize() are
invoked (or that returned times are finite/non-zero) to catch regressions in the
code paths handling incomplete events.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cdcffe51-dad7-41ef-9d5b-e1e32cdfa9e9

📥 Commits

Reviewing files that changed from the base of the PR and between 6ac5c15 and cc973cf.

📒 Files selected for processing (1)
  • tensorrt_llm/_torch/pyexecutor/perf_metrics_manager.py

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41010 [ run ] triggered by Bot. Commit: cc973cf Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41010 [ run ] completed with state FAILURE. Commit: cc973cf
/LLM/main/L0_MergeRequest_PR pipeline #31990 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@jthomson04 jthomson04 force-pushed the jthomson04/fix-crash branch from cc973cf to 2cc23b0 Compare April 1, 2026 16:24
@jthomson04
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@jthomson04 jthomson04 enabled auto-merge (squash) April 1, 2026 16:25
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41227 [ run ] triggered by Bot. Commit: 2cc23b0 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41227 [ run ] completed with state SUCCESS. Commit: 2cc23b0
/LLM/main/L0_MergeRequest_PR pipeline #32188 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@jthomson04 jthomson04 force-pushed the jthomson04/fix-crash branch from 2cc23b0 to fdde119 Compare April 3, 2026 01:56
@jthomson04
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41743 [ run ] triggered by Bot. Commit: fdde119 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41743 [ run ] completed with state SUCCESS. Commit: fdde119
/LLM/main/L0_MergeRequest_PR pipeline #32644 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
@jthomson04 jthomson04 force-pushed the jthomson04/fix-crash branch from fdde119 to 4784555 Compare April 5, 2026 02:53
@jthomson04
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41845 [ run ] triggered by Bot. Commit: 4784555 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41845 [ run ] completed with state SUCCESS. Commit: 4784555
/LLM/main/L0_MergeRequest_PR pipeline #32714 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

@jthomson04 jthomson04 merged commit 24d2340 into NVIDIA:main Apr 5, 2026
5 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
Signed-off-by: jthomson04 <jwillthomson19@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants