[None][infra] Fix hang when generating report by EmmaQiaoCh · Pull Request #14625 · NVIDIA/TensorRT-LLM

EmmaQiaoCh · 2026-05-27T08:49:00Z

Summary by CodeRabbit

Chores
- Updated test execution to enforce a 15-minute timeout limit, preventing tests from running indefinitely and improving overall test reliability.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

EmmaQiaoCh · 2026-05-27T08:49:13Z

/bot run

coderabbitai · 2026-05-27T08:51:52Z

📝 Walkthrough

Walkthrough

The PR adds a 15-minute timeout wrapper around the rerun report generation step in the LLM test execution pipeline, preventing indefinite execution if the report generation hangs or performs unexpectedly.

Changes

Rerun Report Generation Timeout

Layer / File(s)	Summary
Timeout wrapper for rerun report generation `jenkins/L0_Test.groovy`	The `generateRerunReport` call is wrapped in a 15-minute timeout block, aborting the report step if it exceeds the time limit.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is completely empty of substantive content—all sections (Description, Test Coverage, and checklist items) are unfilled template placeholders with no actual explanation of the issue or solution.	Add a clear description of the hang issue, explain why the timeout fix resolves it, and document relevant test coverage that validates the fix.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][infra] Fix hang when generating report' directly matches the changeset which wraps report generation in a timeout to prevent hangs.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

jenkins/L0_Test.groovy (1)

3285-3287: ⚖️ Poor tradeoff

Consider graceful timeout handling to avoid masking test failures.

The timeout wrapper correctly prevents indefinite hangs during report generation. However, if the timeout fires, it will throw an exception that fails the stage before the rerunFailed check at line 3290, potentially masking actual test failures. Since rerun report generation is supplementary to the core test results, consider wrapping the timeout block in a try-catch to log a warning and continue:

♻️ Proposed fix for graceful timeout handling

         // Generate comprehensive rerun report if any reruns occurred
         stage ("Generate Report") {
-            timeout(time: 15, unit: 'MINUTES'){
-                generateRerunReport(stageName, llmSrc)
+            try {
+                timeout(time: 15, unit: 'MINUTES'){
+                    generateRerunReport(stageName, llmSrc)
+                }
+            } catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e) {
+                if (e.causes.any { it instanceof org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.ExceededTimeout }) {
+                    echo "WARNING: Rerun report generation timed out after 15 minutes. Continuing without rerun report."
+                } else {
+                    throw e
+                }
             }
         }

This ensures that test failures are properly reported even if report generation hangs.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/L0_Test.groovy` around lines 3285 - 3287, Wrap the timeout(...) {
generateRerunReport(stageName, llmSrc) } call in a try-catch so that if the
timeout throws it is caught, a warning is logged, and execution continues to the
existing rerunFailed check; specifically, catch the timeout/Exception around the
timeout block that invokes generateRerunReport and use the pipeline logger to
emit a warning (including the exception message) rather than letting the
exception fail the stage.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@jenkins/L0_Test.groovy`:
- Around line 3285-3287: Wrap the timeout(...) { generateRerunReport(stageName,
llmSrc) } call in a try-catch so that if the timeout throws it is caught, a
warning is logged, and execution continues to the existing rerunFailed check;
specifically, catch the timeout/Exception around the timeout block that invokes
generateRerunReport and use the pipeline logger to emit a warning (including the
exception message) rather than letting the exception fail the stage.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 72595a57-6b81-4d8f-9007-da096ebb2caf

📥 Commits

Reviewing files that changed from the base of the PR and between 8cd28ed and 40ec89e.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy

tensorrt-cicd · 2026-05-27T08:55:32Z

PR_Github #50508 [ run ] triggered by Bot. Commit: 40ec89e Link to invocation

tensorrt-cicd · 2026-05-27T10:51:22Z

PR_Github #50508 [ run ] completed with state SUCCESS. Commit: 40ec89e
/LLM/main/L0_MergeRequest_PR pipeline #40016 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

tburt-nv · 2026-05-27T15:10:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-27T15:17:08Z

PR_Github #50565 [ run ] triggered by Bot. Commit: 40ec89e Link to invocation

tensorrt-cicd · 2026-05-27T22:56:54Z

PR_Github #50565 [ run ] completed with state FAILURE. Commit: 40ec89e
/LLM/main/L0_MergeRequest_PR pipeline #40066 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

EmmaQiaoCh · 2026-05-28T02:32:24Z

/bot skip --comment "The timeout will not impact CI flow"

tensorrt-cicd · 2026-05-28T02:38:44Z

PR_Github #50670 [ skip ] triggered by Bot. Commit: 40ec89e Link to invocation

tensorrt-cicd · 2026-05-28T02:44:37Z

PR_Github #50670 [ skip ] completed with state SUCCESS. Commit: 40ec89e
Skipping testing for commit 40ec89e

Link to invocation

Fix hang when generating report

40ec89e

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>

EmmaQiaoCh requested review from a team as code owners May 27, 2026 08:49

EmmaQiaoCh requested review from ZhanruiSunCh and dpitman-nvda May 27, 2026 08:49

github-actions Bot assigned EmmaQiaoCh May 27, 2026

coderabbitai Bot reviewed May 27, 2026

View reviewed changes

tburt-nv approved these changes May 27, 2026

View reviewed changes

mzweilz approved these changes May 28, 2026

View reviewed changes

EmmaQiaoCh enabled auto-merge (squash) May 28, 2026 02:32

EmmaQiaoCh merged commit fd5fa61 into NVIDIA:main May 28, 2026
14 of 15 checks passed

coderabbitai Bot mentioned this pull request May 29, 2026

[TRTLLMINF-113][infra] Add timeout protection to Setup/Initialize stages #14682

Open

1 task

Conversation

EmmaQiaoCh commented May 27, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

EmmaQiaoCh commented May 27, 2026

Uh oh!

coderabbitai Bot commented May 27, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tburt-nv commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

EmmaQiaoCh commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

tensorrt-cicd commented May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

EmmaQiaoCh commented May 27, 2026 •

edited by coderabbitai Bot

Loading