[None][fix] Reuse prior-attempt passes when infra retry fires#14002
Conversation
|
/bot run |
📝 WalkthroughWalkthroughThis PR extends the Jenkins CI pipeline's test reuse logic to recover passed tests from previous infra retry attempts. When a pipeline retries due to infrastructure failures, it now extracts passed tests from prior attempt artifacts, deduplicates them, and reuses them. A new test extraction utility parses JUnit XML, a helper decodes retry tags, and the core reuse function orchestrates recovery across SLURM and Kubernetes platforms. ChangesPrior Attempt Test Recovery
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 2825-2831: Only treat a genuine HTTP 404 from the prior-artifact
fetch as benign: change the wget + fetchStatus logic to capture wget/curl
response and inspect for a 404 string (or use curl -fS to get HTTP status) and
only on an explicit 404 return/skip; for any other non-zero fetchStatus
(network/auth/other HTTP errors) fail the build or surface the error instead of
silently skipping. Also remove the unconditional "|| true" from the tar
extraction so that tar failures surface; locate the variables/commands
fetchStatus, priorDir, tarUrl, tarName and the wget/tar invocations in this
block and implement the conditional 404 check and fail-on-other-errors behavior.
In `@jenkins/scripts/test_rerun.py`:
- Around line 19-33: The extract_passed_tests function is parsing untrusted
JUnit XML artifacts with xml.etree.ElementTree (vulnerable to XXE/XML bombs);
switch to a hardened parser by replacing uses/imports of xml.etree.ElementTree
with defusedxml.ElementTree (e.g., import defusedxml.ElementTree as ET) in
jenkins/scripts/test_rerun.py so ET.parse(...) uses the defused implementation,
and add defusedxml to the project's requirements.txt so the package is
installed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 2ed10fac-b4bb-4d46-bb89-ab81b15aedb3
📒 Files selected for processing (2)
jenkins/L0_Test.groovyjenkins/scripts/test_rerun.py
|
PR_Github #47766 [ run ] triggered by Bot. Commit: |
|
PR_Github #47766 [ run ] completed with state |
8ceb366 to
4a432e7
Compare
|
/bot run |
|
PR_Github #47774 [ run ] triggered by Bot. Commit: |
|
PR_Github #47774 [ run ] completed with state
|
|
/bot run |
|
PR_Github #47787 [ run ] triggered by Bot. Commit: |
|
PR_Github #47787 [ run ] completed with state
|
698d449 to
d13e931
Compare
|
/bot run |
|
PR_Github #47996 [ run ] triggered by Bot. Commit: |
|
PR_Github #47996 [ run ] completed with state
|
|
/bot run |
When the infra-failure retry loop fires (e.g. a stage hits the SLURM walltime "DUE TO TIME LIMIT" pattern, the K8s pod gets evicted with "Reason: Evicted", etc.), each retry attempt today runs the test list from scratch even though the prior attempt may have completed many passes before the failure. The existing reusePassedTestResults() only queries OpenSearch, which is populated *after* a pipeline run completes -- it has no visibility into passes from the current run's earlier attempts because those attempts never reached completion. Observed in a real build of L0_Test-SBSA-Multi-GPU NVIDIA#974: the first attempt timed out partway through; the retry attempt re-ran the entire test list including everything that had already passed. Fix: reusePassedTestResults() now also downloads any prior-attempt tarballs from this build's own Artifactory upload path and parses their results.xml for passes, merging them with the OpenSearch list before appending the union to waives.txt as SKIPs. - reusePassedTestResults() takes a new postTag arg. - New helper priorAttemptTags(postTag) decodes the suffix and returns the postTag values used by earlier attempts of the same stage in this build. - For each prior tag, wget the tarball from ${UPLOAD_PATH}/test-results/. 404 is benign and skipped. - Untar, find results.xml, feed into a new test_rerun.py extract_passed_tests mode that walks JUnit XML and emits passed test names via the existing parse_name helper. - Merge with OpenSearch results, dedupe, append to waives.txt with the same SKIP-reason marker -- downstream pytest waive-list handling is unchanged. Threading: - runLLMTestlistOnPlatformImpl(): new postTag parameter. - runLLMTestlistOnPlatform() and executeLLMTestOnSlurm() pass it through. - The two reusePassedTestResults() call sites (SLURM path at L1298, K8s path at L3176) now pass postTag. If REUSE_TEST is explicitly disabled, the whole function is short- circuited at the call site, so the new behaviour is opt-in via the existing flag. Signed-off-by: Derek Pitman <dpitman@nvidia.com>
e54f9db to
313b2a2
Compare
|
/bot skip --comment "We got through single-GPU runs already, this should be good to go" |
|
PR_Github #48199 [ skip ] triggered by Bot. Commit: |
|
PR_Github #48199 [ skip ] completed with state |
Signed-off-by: Derek Pitman <dpitman@nvidia.com>
Obsolete given the dependency has been dropped
|
/bot skip --comment "No material changes since last update" |
|
PR_Github #48382 [ skip ] triggered by Bot. Commit: |
|
PR_Github #48382 [ skip ] completed with state |
Summary by CodeRabbit
Description
When the infra-failure retry loop fires (e.g. a stage hits the SLURM walltime "DUE TO TIME LIMIT" pattern, the K8s pod gets evicted with "Reason: Evicted", etc.), each retry attempt today runs the test list from scratch even though the prior attempt may have completed many passes before the failure. The existing reusePassedTestResults() only queries OpenSearch, which is populated after a pipeline run completes -- it has no visibility into passes from the current run's earlier attempts because those attempts never reached completion.
Observed in a real build of L0_Test-SBSA-Multi-GPU #974: the first attempt timed out partway through; the retry attempt re-ran the entire test list including everything that had already passed.
Fix: reusePassedTestResults() now also downloads any prior-attempt tarballs from this build's own Artifactory upload path and parses their results.xml for passes, merging them with the OpenSearch list before appending the union to waives.txt as SKIPs.
Threading:
If REUSE_TEST is explicitly disabled, the whole function is short- circuited at the call site, so the new behaviour is opt-in via the existing flag.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.