[None][fix] Reuse prior-attempt passes when infra retry fires by dpitman-nvda · Pull Request #14002 · NVIDIA/TensorRT-LLM

dpitman-nvda · 2026-05-11T15:31:12Z

Summary by CodeRabbit

Chores
- Enhanced infrastructure retry handling with improved recovery mechanism that reuses previously passing tests from earlier pipeline attempts
- Added capability to extract and identify passed test cases from JUnit XML result files for better test tracking

Description

When the infra-failure retry loop fires (e.g. a stage hits the SLURM walltime "DUE TO TIME LIMIT" pattern, the K8s pod gets evicted with "Reason: Evicted", etc.), each retry attempt today runs the test list from scratch even though the prior attempt may have completed many passes before the failure. The existing reusePassedTestResults() only queries OpenSearch, which is populated after a pipeline run completes -- it has no visibility into passes from the current run's earlier attempts because those attempts never reached completion.

Observed in a real build of L0_Test-SBSA-Multi-GPU #974: the first attempt timed out partway through; the retry attempt re-ran the entire test list including everything that had already passed.

Fix: reusePassedTestResults() now also downloads any prior-attempt tarballs from this build's own Artifactory upload path and parses their results.xml for passes, merging them with the OpenSearch list before appending the union to waives.txt as SKIPs.

reusePassedTestResults() takes a new postTag arg.
New helper priorAttemptTags(postTag) decodes the suffix and returns the postTag values used by earlier attempts of the same stage in this build.
For each prior tag, wget the tarball from ${UPLOAD_PATH}/test-results/. 404 is benign and skipped.
Untar, find results.xml, feed into a new test_rerun.py extract_passed_tests mode that walks JUnit XML and emits passed test names via the existing parse_name helper.
Merge with OpenSearch results, dedupe, append to waives.txt with the same SKIP-reason marker -- downstream pytest waive-list handling is unchanged.

Threading:

runLLMTestlistOnPlatformImpl(): new postTag parameter.
runLLMTestlistOnPlatform() and executeLLMTestOnSlurm() pass it through.
The two reusePassedTestResults() call sites (SLURM path at L1298, K8s path at L3176) now pass postTag.

If REUSE_TEST is explicitly disabled, the whole function is short- circuited at the call site, so the new behaviour is opt-in via the existing flag.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

dpitman-nvda · 2026-05-11T15:31:36Z

/bot run

coderabbitai · 2026-05-11T15:37:45Z

📝 Walkthrough

Walkthrough

This PR extends the Jenkins CI pipeline's test reuse logic to recover passed tests from previous infra retry attempts. When a pipeline retries due to infrastructure failures, it now extracts passed tests from prior attempt artifacts, deduplicates them, and reuses them. A new test extraction utility parses JUnit XML, a helper decodes retry tags, and the core reuse function orchestrates recovery across SLURM and Kubernetes platforms.

Changes

Prior Attempt Test Recovery

Layer / File(s)	Summary
Test Extraction Utility `jenkins/scripts/test_rerun.py`	New `extract_passed_tests(output_file, xml_filenames)` parses JUnit XML files, identifies testcases with no failure/error/skipped elements, deduplicates test names, and writes to output file. CLI subcommand `extract_passed_tests` added with `--output-file` and `--input-files` arguments.
Prior Attempt Tag Decoder `jenkins/L0_Test.groovy`	New `priorAttemptTags(String postTag)` helper function decodes retry attempt suffix format (e.g., `-attempt-N`) and returns prior attempt tags to search for recovery; returns empty list when no retry context detected.
Reuse Logic with Prior Attempt Recovery `jenkins/L0_Test.groovy`	`reusePassedTestResults(llmSrc, stageName, waivesTxt, String postTag)` rewritten to accept postTag parameter. When postTag indicates an infra retry, downloads prior attempt tarballs (`results-${stageName}${priorTag}.tar.gz`) from Artifactory, extracts passed tests via `test_rerun.py extract_passed_tests`, deduplicates across sources (OpenSearch + prior attempts), and appends to waives file with `SKIP (Reused from previous pipeline attempt)` marker.
Function Signature Extension `jenkins/L0_Test.groovy`	`runLLMTestlistOnPlatformImpl` signature extended with `String postTag=""` trailing parameter to enable threading of retry attempt tag through test execution path.
Call Site Wiring `jenkins/L0_Test.groovy`	Four call sites updated to pass postTag: (1) SLURM runner in `executeLLMTestOnSlurm` passes `typeCheck` and `postTag` to `runLLMTestlistOnPlatformImpl`; (2) SLURM-sbatch path passes `postTag` to `reusePassedTestResults`; (3) Kubernetes/non-sbatch path passes `postTag` to `reusePassedTestResults`; (4) Main wrapper finally block forwards `typeCheck` and `postTag` into platform implementation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: reusing passed tests from prior attempts when infrastructure retries occur.
Description check	✅ Passed	The PR description provides comprehensive explanation of the problem, solution, and implementation details including threading of parameters. However, the Test Coverage section is empty, which is a required section in the template.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 2825-2831: Only treat a genuine HTTP 404 from the prior-artifact
fetch as benign: change the wget + fetchStatus logic to capture wget/curl
response and inspect for a 404 string (or use curl -fS to get HTTP status) and
only on an explicit 404 return/skip; for any other non-zero fetchStatus
(network/auth/other HTTP errors) fail the build or surface the error instead of
silently skipping. Also remove the unconditional "|| true" from the tar
extraction so that tar failures surface; locate the variables/commands
fetchStatus, priorDir, tarUrl, tarName and the wget/tar invocations in this
block and implement the conditional 404 check and fail-on-other-errors behavior.

In `@jenkins/scripts/test_rerun.py`:
- Around line 19-33: The extract_passed_tests function is parsing untrusted
JUnit XML artifacts with xml.etree.ElementTree (vulnerable to XXE/XML bombs);
switch to a hardened parser by replacing uses/imports of xml.etree.ElementTree
with defusedxml.ElementTree (e.g., import defusedxml.ElementTree as ET) in
jenkins/scripts/test_rerun.py so ET.parse(...) uses the defused implementation,
and add defusedxml to the project's requirements.txt so the package is
installed.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2ed10fac-b4bb-4d46-bb89-ab81b15aedb3

📥 Commits

Reviewing files that changed from the base of the PR and between f3570a8 and 8ceb366.

📒 Files selected for processing (2)

jenkins/L0_Test.groovy
jenkins/scripts/test_rerun.py

tensorrt-cicd · 2026-05-11T15:38:07Z

PR_Github #47766 [ run ] triggered by Bot. Commit: 8ceb366 Link to invocation

tensorrt-cicd · 2026-05-11T16:14:29Z

PR_Github #47766 [ run ] completed with state ABORTED. Commit: 8ceb366

Link to invocation

dpitman-nvda · 2026-05-11T16:18:59Z

/bot run

tensorrt-cicd · 2026-05-11T16:25:38Z

PR_Github #47774 [ run ] triggered by Bot. Commit: 698d449 Link to invocation

tensorrt-cicd · 2026-05-11T18:22:07Z

PR_Github #47774 [ run ] completed with state SUCCESS. Commit: 698d449
/LLM/main/L0_MergeRequest_PR pipeline #37666 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dpitman-nvda · 2026-05-11T18:27:50Z

/bot run

tensorrt-cicd · 2026-05-11T18:33:30Z

PR_Github #47787 [ run ] triggered by Bot. Commit: 698d449 Link to invocation

tensorrt-cicd · 2026-05-11T23:10:46Z

PR_Github #47787 [ run ] completed with state SUCCESS. Commit: 698d449
/LLM/main/L0_MergeRequest_PR pipeline #37678 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dpitman-nvda · 2026-05-12T16:41:38Z

/bot run

tensorrt-cicd · 2026-05-12T16:46:49Z

PR_Github #47996 [ run ] triggered by Bot. Commit: e54f9db Link to invocation

tensorrt-cicd · 2026-05-12T22:14:15Z

PR_Github #47996 [ run ] completed with state FAILURE. Commit: e54f9db
/LLM/main/L0_MergeRequest_PR pipeline #37833 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dpitman-nvda · 2026-05-13T12:35:31Z

/bot run

When the infra-failure retry loop fires (e.g. a stage hits the SLURM walltime "DUE TO TIME LIMIT" pattern, the K8s pod gets evicted with "Reason: Evicted", etc.), each retry attempt today runs the test list from scratch even though the prior attempt may have completed many passes before the failure. The existing reusePassedTestResults() only queries OpenSearch, which is populated *after* a pipeline run completes -- it has no visibility into passes from the current run's earlier attempts because those attempts never reached completion. Observed in a real build of L0_Test-SBSA-Multi-GPU NVIDIA#974: the first attempt timed out partway through; the retry attempt re-ran the entire test list including everything that had already passed. Fix: reusePassedTestResults() now also downloads any prior-attempt tarballs from this build's own Artifactory upload path and parses their results.xml for passes, merging them with the OpenSearch list before appending the union to waives.txt as SKIPs. - reusePassedTestResults() takes a new postTag arg. - New helper priorAttemptTags(postTag) decodes the suffix and returns the postTag values used by earlier attempts of the same stage in this build. - For each prior tag, wget the tarball from ${UPLOAD_PATH}/test-results/. 404 is benign and skipped. - Untar, find results.xml, feed into a new test_rerun.py extract_passed_tests mode that walks JUnit XML and emits passed test names via the existing parse_name helper. - Merge with OpenSearch results, dedupe, append to waives.txt with the same SKIP-reason marker -- downstream pytest waive-list handling is unchanged. Threading: - runLLMTestlistOnPlatformImpl(): new postTag parameter. - runLLMTestlistOnPlatform() and executeLLMTestOnSlurm() pass it through. - The two reusePassedTestResults() call sites (SLURM path at L1298, K8s path at L3176) now pass postTag. If REUSE_TEST is explicitly disabled, the whole function is short- circuited at the call site, so the new behaviour is opt-in via the existing flag. Signed-off-by: Derek Pitman <dpitman@nvidia.com>

dpitman-nvda · 2026-05-13T15:06:56Z

/bot skip --comment "We got through single-GPU runs already, this should be good to go"

tensorrt-cicd · 2026-05-13T15:13:11Z

PR_Github #48199 [ skip ] triggered by Bot. Commit: 313b2a2 Link to invocation

tensorrt-cicd · 2026-05-13T15:19:04Z

PR_Github #48199 [ skip ] completed with state SUCCESS. Commit: 313b2a2
Skipping testing for commit 313b2a2

Link to invocation

Signed-off-by: Derek Pitman <dpitman@nvidia.com>

Obsolete given the dependency has been dropped

dpitman-nvda · 2026-05-14T14:56:50Z

/bot skip --comment "No material changes since last update"

tensorrt-cicd · 2026-05-14T15:02:13Z

PR_Github #48382 [ skip ] triggered by Bot. Commit: 5596c4f Link to invocation

tensorrt-cicd · 2026-05-14T15:08:24Z

PR_Github #48382 [ skip ] completed with state SUCCESS. Commit: 5596c4f
Skipping testing for commit 5596c4f

Link to invocation

dpitman-nvda requested review from a team as code owners May 11, 2026 15:31

dpitman-nvda requested review from EmmaQiaoCh and mlefeb01 May 11, 2026 15:31

github-actions Bot assigned dpitman-nvda May 11, 2026

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

Comment thread jenkins/L0_Test.groovy Outdated

Comment thread jenkins/scripts/test_rerun.py

dpitman-nvda force-pushed the fix/reuse-passed-tests-on-retry branch from 8ceb366 to 4a432e7 Compare May 11, 2026 16:16

dpitman-nvda requested a review from a team as a code owner May 11, 2026 16:16

litaotju previously requested changes May 12, 2026

View reviewed changes

Comment thread requirements.txt Outdated

dpitman-nvda force-pushed the fix/reuse-passed-tests-on-retry branch from 698d449 to d13e931 Compare May 12, 2026 15:57

dpitman-nvda force-pushed the fix/reuse-passed-tests-on-retry branch from e54f9db to 313b2a2 Compare May 13, 2026 14:50

mzweilz approved these changes May 14, 2026

View reviewed changes

tburt-nv requested a review from yiqingy0 May 14, 2026 09:19

tburt-nv reviewed May 14, 2026

View reviewed changes

Comment thread requirements-dev.txt Outdated

Comment thread requirements-dev.txt Outdated

tburt-nv requested a review from litaotju May 14, 2026 09:20

tburt-nv approved these changes May 14, 2026

View reviewed changes

fixup! [None][fix] Reuse prior-attempt passes when infra retry fires

5596c4f

Signed-off-by: Derek Pitman <dpitman@nvidia.com>

dpitman-nvda merged commit ed89fb6 into NVIDIA:main May 14, 2026
7 checks passed

This was referenced May 18, 2026

[None][fix] Prevent SLURM dispatcher retry duplicate-upload error #14269

Merged

[TRTLLMINF-89][feat] Make L0 retries timeout-budget aware #14323

Merged

Conversation

dpitman-nvda commented May 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

dpitman-nvda commented May 11, 2026

Uh oh!

coderabbitai Bot commented May 11, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

dpitman-nvda commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

dpitman-nvda commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

tensorrt-cicd commented May 11, 2026

Uh oh!

Uh oh!

dpitman-nvda commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

tensorrt-cicd commented May 12, 2026

Uh oh!

dpitman-nvda commented May 13, 2026

Uh oh!

dpitman-nvda commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

tensorrt-cicd commented May 13, 2026

Uh oh!

Uh oh!

Uh oh!

dpitman-nvda commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

tensorrt-cicd commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

dpitman-nvda commented May 11, 2026 •

edited by coderabbitai Bot

Loading