Skip to content

[None][infra] Fix hang when generating report#14625

Merged
EmmaQiaoCh merged 1 commit into
NVIDIA:mainfrom
EmmaQiaoCh:emma/fix_generate_report_hang
May 28, 2026
Merged

[None][infra] Fix hang when generating report#14625
EmmaQiaoCh merged 1 commit into
NVIDIA:mainfrom
EmmaQiaoCh:emma/fix_generate_report_hang

Conversation

@EmmaQiaoCh
Copy link
Copy Markdown
Collaborator

@EmmaQiaoCh EmmaQiaoCh commented May 27, 2026

Summary by CodeRabbit

  • Chores
    • Updated test execution to enforce a 15-minute timeout limit, preventing tests from running indefinitely and improving overall test reliability.

Review Change Stack

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: EmmaQiaoCh <qqiao@nvidia.com>
@EmmaQiaoCh
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 27, 2026

📝 Walkthrough

Walkthrough

The PR adds a 15-minute timeout wrapper around the rerun report generation step in the LLM test execution pipeline, preventing indefinite execution if the report generation hangs or performs unexpectedly.

Changes

Rerun Report Generation Timeout

Layer / File(s) Summary
Timeout wrapper for rerun report generation
jenkins/L0_Test.groovy
The generateRerunReport call is wrapped in a 15-minute timeout block, aborting the report step if it exceeds the time limit.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is completely empty of substantive content—all sections (Description, Test Coverage, and checklist items) are unfilled template placeholders with no actual explanation of the issue or solution. Add a clear description of the hang issue, explain why the timeout fix resolves it, and document relevant test coverage that validates the fix.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][infra] Fix hang when generating report' directly matches the changeset which wraps report generation in a timeout to prevent hangs.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

3285-3287: ⚖️ Poor tradeoff

Consider graceful timeout handling to avoid masking test failures.

The timeout wrapper correctly prevents indefinite hangs during report generation. However, if the timeout fires, it will throw an exception that fails the stage before the rerunFailed check at line 3290, potentially masking actual test failures. Since rerun report generation is supplementary to the core test results, consider wrapping the timeout block in a try-catch to log a warning and continue:

♻️ Proposed fix for graceful timeout handling
         // Generate comprehensive rerun report if any reruns occurred
         stage ("Generate Report") {
-            timeout(time: 15, unit: 'MINUTES'){
-                generateRerunReport(stageName, llmSrc)
+            try {
+                timeout(time: 15, unit: 'MINUTES'){
+                    generateRerunReport(stageName, llmSrc)
+                }
+            } catch (org.jenkinsci.plugins.workflow.steps.FlowInterruptedException e) {
+                if (e.causes.any { it instanceof org.jenkinsci.plugins.workflow.steps.TimeoutStepExecution.ExceededTimeout }) {
+                    echo "WARNING: Rerun report generation timed out after 15 minutes. Continuing without rerun report."
+                } else {
+                    throw e
+                }
             }
         }

This ensures that test failures are properly reported even if report generation hangs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/L0_Test.groovy` around lines 3285 - 3287, Wrap the timeout(...) {
generateRerunReport(stageName, llmSrc) } call in a try-catch so that if the
timeout throws it is caught, a warning is logged, and execution continues to the
existing rerunFailed check; specifically, catch the timeout/Exception around the
timeout block that invokes generateRerunReport and use the pipeline logger to
emit a warning (including the exception message) rather than letting the
exception fail the stage.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@jenkins/L0_Test.groovy`:
- Around line 3285-3287: Wrap the timeout(...) { generateRerunReport(stageName,
llmSrc) } call in a try-catch so that if the timeout throws it is caught, a
warning is logged, and execution continues to the existing rerunFailed check;
specifically, catch the timeout/Exception around the timeout block that invokes
generateRerunReport and use the pipeline logger to emit a warning (including the
exception message) rather than letting the exception fail the stage.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 72595a57-6b81-4d8f-9007-da096ebb2caf

📥 Commits

Reviewing files that changed from the base of the PR and between 8cd28ed and 40ec89e.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50508 [ run ] triggered by Bot. Commit: 40ec89e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50508 [ run ] completed with state SUCCESS. Commit: 40ec89e
/LLM/main/L0_MergeRequest_PR pipeline #40016 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@tburt-nv
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50565 [ run ] triggered by Bot. Commit: 40ec89e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50565 [ run ] completed with state FAILURE. Commit: 40ec89e
/LLM/main/L0_MergeRequest_PR pipeline #40066 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@EmmaQiaoCh
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "The timeout will not impact CI flow"

@EmmaQiaoCh EmmaQiaoCh enabled auto-merge (squash) May 28, 2026 02:32
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50670 [ skip ] triggered by Bot. Commit: 40ec89e Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50670 [ skip ] completed with state SUCCESS. Commit: 40ec89e
Skipping testing for commit 40ec89e

Link to invocation

@EmmaQiaoCh EmmaQiaoCh merged commit fd5fa61 into NVIDIA:main May 28, 2026
14 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants