fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures by hsubramanianaks · Pull Request #8300 · Azure/AgentBaker

hsubramanianaks · 2026-04-14T14:53:20Z

What

Add automatic retry for failed e2e tests using gotestsum --rerun-fails=1.

Why

E2e tests have no retry mechanism today. When a transient infrastructure issue causes a single test to fail (e.g., a cloud-init temp mount entering failed systemd state), the entire pipeline fails and requires manual re-run.

The VHD build pipelines already have retryCountOnTaskFailure: 3, but the e2e pipeline has nothing — no ADO-level retry and no Go-level retry.

Example flaky failure (Build 160089239)

DONE 160 tests, 68 skipped, 1 failure in 528.088s

One test failed due to a transient run-cloud\x2dinit-tmp-tmpde1rbvp9.mount systemd unit entering failed state. All 159 other tests passed. A retry would have likely passed.

Changes

Added --rerun-fails=1 to the gotestsum command in .pipelines/scripts/e2e_run.sh
Only failed tests are rerun (not the entire suite), so overhead is minimal
If the test passes on retry, the suite passes — consistent with how gotestsum handles flaky tests

Risk

🟢 Low — gotestsum --rerun-fails is a well-established feature. It only reruns failed tests, does not affect passing tests, and the JUnit report correctly reflects the final outcome.

Add --rerun-fails=1 to the gotestsum command so that failed tests are automatically rerun once before reporting failure. This handles transient infrastructure issues like systemd bookkeeping races (e.g., cloud-init temp mount units entering failed state) without requiring a full pipeline re-run. Only failed tests are rerun, not the entire suite, so the cost is minimal. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a lightweight retry mechanism for flaky e2e failures by rerunning only failed Go tests once via gotestsum.

Changes:

Adds --rerun-fails=1 to the gotestsum invocation in the e2e pipeline script
Documents the rationale for the rerun behavior inline (transient/flaky infra issues)

hsubramanianaks · 2026-04-14T15:21:22Z

Closing: PR validation requires branch to be in Azure/AgentBaker, not from a fork. Will re-create from an internal branch.

cameronmeissner · 2026-04-14T19:08:19Z

.pipelines/scripts/e2e_run.sh

 # Run the tests! Yey!
 test_exit_code=0
-./bin/gotestsum --format testdox --junitfile "${BUILD_SRC_DIR}/e2e/report.xml" --jsonfile "${BUILD_SRC_DIR}/e2e/test-log.json" -- -parallel 150 -timeout "${E2E_GO_TEST_TIMEOUT}" || test_exit_code=$?
+./bin/gotestsum --format testdox --rerun-fails=1 --junitfile "${BUILD_SRC_DIR}/e2e/report.xml" --jsonfile "${BUILD_SRC_DIR}/e2e/test-log.json" -- -parallel 150 -timeout "${E2E_GO_TEST_TIMEOUT}" || test_exit_code=$?


not sure about directly enabling retries like this - might be worth gating this behind a pipeline flag or something

we want quick feedback, we want to fix flakyness, not hide them.

hsubramanianaks · 2026-04-14T19:12:42Z

Superseded by #8308 (created from internal branch).

Copilot AI review requested due to automatic review settings April 14, 2026 14:53

hsubramanianaks temporarily deployed to test April 14, 2026 14:53 — with GitHub Actions Inactive

hsubramanianaks force-pushed the fix/e2e-retry-flaky-tests branch from 28e4d09 to cc1e823 Compare April 14, 2026 14:54

hsubramanianaks requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners April 14, 2026 14:54

hsubramanianaks temporarily deployed to test April 14, 2026 14:55 — with GitHub Actions Inactive

hsubramanianaks force-pushed the fix/e2e-retry-flaky-tests branch from cc1e823 to 0ecb590 Compare April 14, 2026 14:58

hsubramanianaks temporarily deployed to test April 14, 2026 14:58 — with GitHub Actions Inactive

Copilot AI reviewed Apr 14, 2026

View reviewed changes

hsubramanianaks closed this Apr 14, 2026

Copilot started reviewing on behalf of hsubramanianaks April 14, 2026 15:26 View session

hsubramanianaks reopened this Apr 14, 2026

hsubramanianaks temporarily deployed to test April 14, 2026 15:26 — with GitHub Actions Inactive

cameronmeissner reviewed Apr 14, 2026

View reviewed changes

hsubramanianaks temporarily deployed to test April 14, 2026 19:12 — with GitHub Actions Inactive

hsubramanianaks closed this Apr 14, 2026

hsubramanianaks mentioned this pull request Apr 14, 2026

fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures #8308

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures#8300

fix(e2e): add gotestsum --rerun-fails to retry flaky e2e test failures#8300
hsubramanianaks wants to merge 1 commit intoAzure:mainfrom
hsubramanianaks:fix/e2e-retry-flaky-tests

hsubramanianaks commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

hsubramanianaks commented Apr 14, 2026

Uh oh!

cameronmeissner Apr 14, 2026

Uh oh!

djsly Apr 14, 2026

Uh oh!

hsubramanianaks commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hsubramanianaks commented Apr 14, 2026

What

Why

Example flaky failure (Build 160089239)

Changes

Risk

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

hsubramanianaks commented Apr 14, 2026

Uh oh!

cameronmeissner Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

djsly Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

hsubramanianaks commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants