fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures#8300
Closed
hsubramanianaks wants to merge 1 commit intoAzure:mainfrom
Closed
fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures#8300hsubramanianaks wants to merge 1 commit intoAzure:mainfrom
hsubramanianaks wants to merge 1 commit intoAzure:mainfrom
Conversation
28e4d09 to
cc1e823
Compare
Add --rerun-fails=1 to the gotestsum command so that failed tests are automatically rerun once before reporting failure. This handles transient infrastructure issues like systemd bookkeeping races (e.g., cloud-init temp mount units entering failed state) without requiring a full pipeline re-run. Only failed tests are rerun, not the entire suite, so the cost is minimal. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
cc1e823 to
0ecb590
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a lightweight retry mechanism for flaky e2e failures by rerunning only failed Go tests once via gotestsum.
Changes:
- Adds
--rerun-fails=1to the gotestsum invocation in the e2e pipeline script - Documents the rationale for the rerun behavior inline (transient/flaky infra issues)
Contributor
Author
|
Closing: PR validation requires branch to be in Azure/AgentBaker, not from a fork. Will re-create from an internal branch. |
| # Run the tests! Yey! | ||
| test_exit_code=0 | ||
| ./bin/gotestsum --format testdox --junitfile "${BUILD_SRC_DIR}/e2e/report.xml" --jsonfile "${BUILD_SRC_DIR}/e2e/test-log.json" -- -parallel 150 -timeout "${E2E_GO_TEST_TIMEOUT}" || test_exit_code=$? | ||
| ./bin/gotestsum --format testdox --rerun-fails=1 --junitfile "${BUILD_SRC_DIR}/e2e/report.xml" --jsonfile "${BUILD_SRC_DIR}/e2e/test-log.json" -- -parallel 150 -timeout "${E2E_GO_TEST_TIMEOUT}" || test_exit_code=$? |
Contributor
There was a problem hiding this comment.
not sure about directly enabling retries like this - might be worth gating this behind a pipeline flag or something
Collaborator
There was a problem hiding this comment.
we want quick feedback, we want to fix flakyness, not hide them.
Contributor
Author
|
Superseded by #8308 (created from internal branch). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Add automatic retry for failed e2e tests using gotestsum
--rerun-fails=1.Why
E2e tests have no retry mechanism today. When a transient infrastructure issue causes a single test to fail (e.g., a cloud-init temp mount entering failed systemd state), the entire pipeline fails and requires manual re-run.
The VHD build pipelines already have
retryCountOnTaskFailure: 3, but the e2e pipeline has nothing — no ADO-level retry and no Go-level retry.Example flaky failure (Build 160089239)
One test failed due to a transient
run-cloud\x2dinit-tmp-tmpde1rbvp9.mountsystemd unit entering failed state. All 159 other tests passed. A retry would have likely passed.Changes
--rerun-fails=1to the gotestsum command in.pipelines/scripts/e2e_run.shRisk
🟢 Low — gotestsum
--rerun-failsis a well-established feature. It only reruns failed tests, does not affect passing tests, and the JUnit report correctly reflects the final outcome.