Skip to content

[None][fix] Reuse prior-attempt passes when infra retry fires#14002

Merged
dpitman-nvda merged 2 commits into
NVIDIA:mainfrom
dpitman-nvda:fix/reuse-passed-tests-on-retry
May 14, 2026
Merged

[None][fix] Reuse prior-attempt passes when infra retry fires#14002
dpitman-nvda merged 2 commits into
NVIDIA:mainfrom
dpitman-nvda:fix/reuse-passed-tests-on-retry

Conversation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator

@dpitman-nvda dpitman-nvda commented May 11, 2026

Summary by CodeRabbit

  • Chores
    • Enhanced infrastructure retry handling with improved recovery mechanism that reuses previously passing tests from earlier pipeline attempts
    • Added capability to extract and identify passed test cases from JUnit XML result files for better test tracking

Review Change Stack

Description

When the infra-failure retry loop fires (e.g. a stage hits the SLURM walltime "DUE TO TIME LIMIT" pattern, the K8s pod gets evicted with "Reason: Evicted", etc.), each retry attempt today runs the test list from scratch even though the prior attempt may have completed many passes before the failure. The existing reusePassedTestResults() only queries OpenSearch, which is populated after a pipeline run completes -- it has no visibility into passes from the current run's earlier attempts because those attempts never reached completion.

Observed in a real build of L0_Test-SBSA-Multi-GPU #974: the first attempt timed out partway through; the retry attempt re-ran the entire test list including everything that had already passed.

Fix: reusePassedTestResults() now also downloads any prior-attempt tarballs from this build's own Artifactory upload path and parses their results.xml for passes, merging them with the OpenSearch list before appending the union to waives.txt as SKIPs.

  • reusePassedTestResults() takes a new postTag arg.
  • New helper priorAttemptTags(postTag) decodes the suffix and returns the postTag values used by earlier attempts of the same stage in this build.
  • For each prior tag, wget the tarball from ${UPLOAD_PATH}/test-results/. 404 is benign and skipped.
  • Untar, find results.xml, feed into a new test_rerun.py extract_passed_tests mode that walks JUnit XML and emits passed test names via the existing parse_name helper.
  • Merge with OpenSearch results, dedupe, append to waives.txt with the same SKIP-reason marker -- downstream pytest waive-list handling is unchanged.

Threading:

  • runLLMTestlistOnPlatformImpl(): new postTag parameter.
  • runLLMTestlistOnPlatform() and executeLLMTestOnSlurm() pass it through.
  • The two reusePassedTestResults() call sites (SLURM path at L1298, K8s path at L3176) now pass postTag.

If REUSE_TEST is explicitly disabled, the whole function is short- circuited at the call site, so the new behaviour is opt-in via the existing flag.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 11, 2026

📝 Walkthrough

Walkthrough

This PR extends the Jenkins CI pipeline's test reuse logic to recover passed tests from previous infra retry attempts. When a pipeline retries due to infrastructure failures, it now extracts passed tests from prior attempt artifacts, deduplicates them, and reuses them. A new test extraction utility parses JUnit XML, a helper decodes retry tags, and the core reuse function orchestrates recovery across SLURM and Kubernetes platforms.

Changes

Prior Attempt Test Recovery

Layer / File(s) Summary
Test Extraction Utility
jenkins/scripts/test_rerun.py
New extract_passed_tests(output_file, xml_filenames) parses JUnit XML files, identifies testcases with no failure/error/skipped elements, deduplicates test names, and writes to output file. CLI subcommand extract_passed_tests added with --output-file and --input-files arguments.
Prior Attempt Tag Decoder
jenkins/L0_Test.groovy
New priorAttemptTags(String postTag) helper function decodes retry attempt suffix format (e.g., -attempt-N) and returns prior attempt tags to search for recovery; returns empty list when no retry context detected.
Reuse Logic with Prior Attempt Recovery
jenkins/L0_Test.groovy
reusePassedTestResults(llmSrc, stageName, waivesTxt, String postTag) rewritten to accept postTag parameter. When postTag indicates an infra retry, downloads prior attempt tarballs (results-${stageName}${priorTag}.tar.gz) from Artifactory, extracts passed tests via test_rerun.py extract_passed_tests, deduplicates across sources (OpenSearch + prior attempts), and appends to waives file with SKIP (Reused from previous pipeline attempt) marker.
Function Signature Extension
jenkins/L0_Test.groovy
runLLMTestlistOnPlatformImpl signature extended with String postTag="" trailing parameter to enable threading of retry attempt tag through test execution path.
Call Site Wiring
jenkins/L0_Test.groovy
Four call sites updated to pass postTag: (1) SLURM runner in executeLLMTestOnSlurm passes typeCheck and postTag to runLLMTestlistOnPlatformImpl; (2) SLURM-sbatch path passes postTag to reusePassedTestResults; (3) Kubernetes/non-sbatch path passes postTag to reusePassedTestResults; (4) Main wrapper finally block forwards typeCheck and postTag into platform implementation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: reusing passed tests from prior attempts when infrastructure retries occur.
Description check ✅ Passed The PR description provides comprehensive explanation of the problem, solution, and implementation details including threading of parameters. However, the Test Coverage section is empty, which is a required section in the template.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 2825-2831: Only treat a genuine HTTP 404 from the prior-artifact
fetch as benign: change the wget + fetchStatus logic to capture wget/curl
response and inspect for a 404 string (or use curl -fS to get HTTP status) and
only on an explicit 404 return/skip; for any other non-zero fetchStatus
(network/auth/other HTTP errors) fail the build or surface the error instead of
silently skipping. Also remove the unconditional "|| true" from the tar
extraction so that tar failures surface; locate the variables/commands
fetchStatus, priorDir, tarUrl, tarName and the wget/tar invocations in this
block and implement the conditional 404 check and fail-on-other-errors behavior.

In `@jenkins/scripts/test_rerun.py`:
- Around line 19-33: The extract_passed_tests function is parsing untrusted
JUnit XML artifacts with xml.etree.ElementTree (vulnerable to XXE/XML bombs);
switch to a hardened parser by replacing uses/imports of xml.etree.ElementTree
with defusedxml.ElementTree (e.g., import defusedxml.ElementTree as ET) in
jenkins/scripts/test_rerun.py so ET.parse(...) uses the defused implementation,
and add defusedxml to the project's requirements.txt so the package is
installed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 2ed10fac-b4bb-4d46-bb89-ab81b15aedb3

📥 Commits

Reviewing files that changed from the base of the PR and between f3570a8 and 8ceb366.

📒 Files selected for processing (2)
  • jenkins/L0_Test.groovy
  • jenkins/scripts/test_rerun.py

Comment thread jenkins/L0_Test.groovy Outdated
Comment thread jenkins/scripts/test_rerun.py
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47766 [ run ] triggered by Bot. Commit: 8ceb366 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47766 [ run ] completed with state ABORTED. Commit: 8ceb366

Link to invocation

@dpitman-nvda dpitman-nvda force-pushed the fix/reuse-passed-tests-on-retry branch from 8ceb366 to 4a432e7 Compare May 11, 2026 16:16
@dpitman-nvda dpitman-nvda requested a review from a team as a code owner May 11, 2026 16:16
@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47774 [ run ] triggered by Bot. Commit: 698d449 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47774 [ run ] completed with state SUCCESS. Commit: 698d449
/LLM/main/L0_MergeRequest_PR pipeline #37666 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47787 [ run ] triggered by Bot. Commit: 698d449 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47787 [ run ] completed with state SUCCESS. Commit: 698d449
/LLM/main/L0_MergeRequest_PR pipeline #37678 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Comment thread requirements.txt Outdated
@dpitman-nvda dpitman-nvda force-pushed the fix/reuse-passed-tests-on-retry branch from 698d449 to d13e931 Compare May 12, 2026 15:57
@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47996 [ run ] triggered by Bot. Commit: e54f9db Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47996 [ run ] completed with state FAILURE. Commit: e54f9db
/LLM/main/L0_MergeRequest_PR pipeline #37833 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot run

When the infra-failure retry loop fires (e.g. a stage hits the SLURM
walltime "DUE TO TIME LIMIT" pattern, the K8s pod gets evicted with
"Reason: Evicted", etc.), each retry attempt today runs the test list
from scratch even though the prior attempt may have completed many
passes before the failure. The existing reusePassedTestResults() only
queries OpenSearch, which is populated *after* a pipeline run completes
-- it has no visibility into passes from the current run's earlier
attempts because those attempts never reached completion.

Observed in a real build of L0_Test-SBSA-Multi-GPU NVIDIA#974: the first
attempt timed out partway through; the retry attempt re-ran the entire
test list including everything that had already passed.

Fix: reusePassedTestResults() now also downloads any prior-attempt
tarballs from this build's own Artifactory upload path and parses their
results.xml for passes, merging them with the OpenSearch list before
appending the union to waives.txt as SKIPs.

- reusePassedTestResults() takes a new postTag arg.
- New helper priorAttemptTags(postTag) decodes the suffix and returns
  the postTag values used by earlier attempts of the same stage in
  this build.
- For each prior tag, wget the tarball from ${UPLOAD_PATH}/test-results/.
  404 is benign and skipped.
- Untar, find results.xml, feed into a new test_rerun.py extract_passed_tests
  mode that walks JUnit XML and emits passed test names via the existing
  parse_name helper.
- Merge with OpenSearch results, dedupe, append to waives.txt with the
  same SKIP-reason marker -- downstream pytest waive-list handling is
  unchanged.

Threading:
- runLLMTestlistOnPlatformImpl(): new postTag parameter.
- runLLMTestlistOnPlatform() and executeLLMTestOnSlurm() pass it through.
- The two reusePassedTestResults() call sites (SLURM path at L1298, K8s
  path at L3176) now pass postTag.

If REUSE_TEST is explicitly disabled, the whole function is short-
circuited at the call site, so the new behaviour is opt-in via the
existing flag.

Signed-off-by: Derek Pitman <dpitman@nvidia.com>
@dpitman-nvda dpitman-nvda force-pushed the fix/reuse-passed-tests-on-retry branch from e54f9db to 313b2a2 Compare May 13, 2026 14:50
@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "We got through single-GPU runs already, this should be good to go"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48199 [ skip ] triggered by Bot. Commit: 313b2a2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48199 [ skip ] completed with state SUCCESS. Commit: 313b2a2
Skipping testing for commit 313b2a2

Link to invocation

@tburt-nv tburt-nv requested a review from yiqingy0 May 14, 2026 09:19
Comment thread requirements-dev.txt Outdated
Comment thread requirements-dev.txt Outdated
@tburt-nv tburt-nv requested a review from litaotju May 14, 2026 09:20
Signed-off-by: Derek Pitman <dpitman@nvidia.com>
@dpitman-nvda dpitman-nvda dismissed litaotju’s stale review May 14, 2026 14:56

Obsolete given the dependency has been dropped

@dpitman-nvda
Copy link
Copy Markdown
Collaborator Author

/bot skip --comment "No material changes since last update"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48382 [ skip ] triggered by Bot. Commit: 5596c4f Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #48382 [ skip ] completed with state SUCCESS. Commit: 5596c4f
Skipping testing for commit 5596c4f

Link to invocation

@dpitman-nvda dpitman-nvda merged commit ed89fb6 into NVIDIA:main May 14, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants