fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable by ganeshkumarashok · Pull Request #8373 · Azure/AgentBaker

ganeshkumarashok · 2026-04-22T16:55:20Z

Summary

Fix GalleryImageNotFound cascading failures in AgentBaker E2E. ensureReplication now waits on per-region RegionalReplicationStatus.State == Completed instead of trusting the parent provisioning LRO. The vmss.go retry loop classifies the 404 against actual replication state — failing fast on terminal Failed, blocking on Replicating, retrying once on the genuinely transient fabric-consistency case.

Background

On PR #8228's GPU E2E run (build 161141895), 22 of 368 tests failed with the same shape:

failed to create VMSS after 10 retries:
RESPONSE 404: GalleryImageNotFound
"The gallery image .../2404gen2containerd/versions/1.1776726423.26642
 is not available in westus2 region."

The pipeline log shows ensureReplication had already emitted Replication took: 12m2.7s for that region — i.e. it returned success — and then VMSS create kept 404'ing for ~10 more minutes.

Root cause

replicatedToCurrentRegion checks PublishingProfile.TargetRegions, which records intent to replicate, not the actual regional state. The Azure Compute Gallery Get API returns provisioningState=Succeeded once the parent LRO finishes, but RegionalReplicationStatus.State for the new region can remain Replicating for several more minutes before it transitions to Completed. During that window, the imageReference is unresolvable from Microsoft.Compute/virtualMachineScaleSets create — yielding the 404.

The existing retry loop in vmss.go (10 attempts × 5s = ~50 s) was orders of magnitude shorter than that window, and it surfaced the failure as failed to create VMSS after 10 retries — visually indistinguishable from a real test failure. PR authors had no way to tell that the change-under-test was never actually exercised.

Why fix here, not in the pipeline YAML

Considered and rejected:

Per-pipeline YAML pre-flight gate — duplicates the framework's responsibility, and tests in one E2E run pull different images for different regions, so a static gate would either over-replicate or under-gate.
t.Skip on GalleryImageNotFound — silently masks real bugs. PR authors would think their change was validated when it wasn't.
Bumping the retry budget — doesn't distinguish Failed from Replicating, wastes CRP quota, and still produces a misleading error message.

The framework already owns "ensure image is usable in this region" (ensureReplication). The right fix is to make that primitive honor its contract.

What changed

`e2e/config/azure.go`

New waitForRegionalReplicationCompleted: polls the gallery image version with Expand=ReplicationStatusTypesReplicationStatus (15s interval, 20 min cap) until RegionalReplicationStatus.State == Completed. Treats Replicating/Unknown as transient. Fails fast with attached Details on terminal Failed.
ensureReplication now calls it on both paths:
- "Region is already a target": the LRO may have completed but the regional replica may still be Replicating (this is exactly what bit PR feat: skip MIG reboot for H100 and newer GPU architectures #8228).
- "Newly added target": the new replicate LRO can return Succeeded with the regional state still Replicating.
New public WaitForImageVersionReplicatedToRegion for the retry loop in vmss.go — fetches the live version and delegates to the same poller.

`e2e/vmss.go`

CreateVMSSWithRetry's classifier now distinguishes GalleryImageNotFound from AllocationFailed. On the 404 it consults the actual regional replication state via WaitForImageVersionReplicatedToRegion:
- Completed → fall through, retry once (true fabric-consistency race, sub-minute window).
- Replicating → block until Completed under the poller's 20-min cap, then retry.
- Failed → return immediately with the precise infra error wrapped, instead of retrying nine more times.

`e2e/config/azure_replication_test.go`

New table-driven unit tests for findRegionalReplicationStatus covering nil status, case/whitespace normalization both ways (ARM returns "West US 2", our location strings are "westus2"), nil-entry tolerance, and missing-region.

Behavior changes

Before: 22 noisy red failed to create VMSS after 10 retries failures every time a fresh VHD ran multi-region E2E. Test logs blamed the test code.
After: ensureReplication blocks until the regional replica is actually serving — usual case, no retry pressure on CRP at all. If something is genuinely wrong with replication, you get a clear regional replication of <id> to <region> failed: <Azure-provided details> instead of failed to create VMSS after 10 retries. If it's the rare fabric-eventual-consistency case (LRO+ReplicationStatus both Completed but compute fabric not yet propagated), the existing retry catches it without a long fixed sleep.

Test plan

go build ./... clean
go test ./config/ -run TestFindRegionalReplicationStatus -v -vet=off passes (vet-off only because of an unrelated pre-existing nit in config/config.go:182)
CI: full e2e + unit-test gates on this PR
Post-merge: rerun GPU E2E on PR feat: skip MIG reboot for H100 and newer GPU architectures #8228 and confirm the 22 GalleryImageNotFound failures don't recur and the MIG H100 change gets a real signal

…ions E2E was reporting "Replication took: 12m" via ensureReplication then immediately failing on VMSS create with `404 GalleryImageNotFound` for ~10 more minutes. Root cause: replicatedToCurrentRegion only checks PublishingProfile.TargetRegions (intent), and the parent LRO returns Succeeded before per-region replicas are actually serving traffic. The vmss.go retry loop (10 x 5s) was far too short to absorb the gap and surfaced the infra failure as `failed to create VMSS after 10 retries`, indistinguishable from a real test failure. Fix: - ensureReplication now polls Get with Expand=ReplicationStatus until RegionalReplicationStatus.State == Completed (20 min timeout, 15s interval). Fails fast on terminal Failed. - CreateVMSSWithRetry no longer burns fixed retries on 404 GalleryImageNotFound. It calls the replication poller, which either returns quickly when the region is Completed (fabric-eventual- consistency race -> retry once), blocks until Completed, or fails fast on Failed with the real cause attached. - Add a public WaitForImageVersionReplicatedToRegion entrypoint and unit tests for the region-name normalization in findRegionalReplicationStatus. This should eliminate the GalleryImageNotFound class of false E2E failures across Linux VHD, Windows VHD, GPU, and base AgentBaker E2E pipelines.

ganeshkumarashok · 2026-04-22T16:57:34Z

Superseded by #8374 — recreated from a branch on Azure/AgentBaker so validate-pull-request-source passes (this repo blocks PRs from forks).

Copilot

Pull request overview

This PR improves AgentBaker E2E reliability by ensuring Azure Compute Gallery (SIG) images are actually serving in a target region (via RegionalReplicationStatus.State == Completed) before the framework proceeds to VMSS creation, reducing noisy GalleryImageNotFound cascades.

Changes:

Update SIG replication logic to wait on per-region ReplicationStatus rather than relying on the parent provisioning LRO.
Enhance VMSS create retry handling to consult live replication state when GalleryImageNotFound occurs.
Add unit tests for region/status lookup normalization and nil-tolerance.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`e2e/config/azure.go`	Adds regional replication polling and exposes a public helper for VMSS retry logic to confirm regional availability.
`e2e/vmss.go`	Classifies `GalleryImageNotFound` and consults replication-state poller before retrying VMSS creation.
`e2e/config/azure_replication_test.go`	Adds table-driven tests for locating regional replication status across casing/spacing variants and nil entries.

Copilot · 2026-04-22T16:58:49Z

+		if isGalleryImageMissing && s.VHD != nil && s.Runtime != nil && s.Runtime.Cluster != nil && s.Runtime.Cluster.Model != nil && s.Runtime.Cluster.Model.Location != nil {
+			location := *s.Runtime.Cluster.Model.Location
+			toolkit.Logf(ctx, "GalleryImageNotFound on VMSS create in %s; consulting regional replication status before retrying", location)


The long nil-check chain before calling WaitForImageVersionReplicatedToRegion is effectively redundant here because this function already dereferences s.Runtime.Cluster.Model.Properties.NodeResourceGroup earlier in the loop. Either rely on the existing invariants and simplify this condition, or restructure the code to safely extract location/nodeResourceGroup once up-front (so the nil checks actually add safety).

Copilot · 2026-04-22T16:58:50Z

+		resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, *version.Name, getOpts)
+		if err != nil {
+			// Transient API errors should not abort the wait; the SDK already retries throttling.
+			toolkit.Logf(ctx, "transient error getting image version replication status (will retry): %v", err)
+			return false, nil
+		}


waitForRegionalReplicationCompleted currently treats any Get error as transient and keeps polling. This will mask permanent failures (e.g., 401/403 auth/RBAC issues, 404 if the version was deleted, malformed request) and can turn them into a 20-minute timeout with a generic error. Suggest inspecting err as *azcore.ResponseError and failing fast for non-retryable status codes, only continuing on clearly transient cases (e.g., 429/5xx/timeouts).

Copilot · 2026-04-22T16:58:50Z

+	if err != nil {
+		return fmt.Errorf("create image version client: %w", err)
+	}
+	resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil)


The doc comment for WaitForImageVersionReplicatedToRegion says it "fetches the live image version with ReplicationStatus expanded", but the initial Get(...) uses nil options (no Expand). Either update the comment to reflect that the polling Get uses Expand=ReplicationStatus, or pass the expand option to this initial Get to match the comment.

Suggested change

resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil)

resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, &armcompute.GalleryImageVersionsClientGetOptions{

Expand: to.Ptr("ReplicationStatus"),

})

Copilot · 2026-04-22T16:58:51Z

+				// Surface the precise infra cause instead of a generic "failed after N retries".
+				return vm, fmt.Errorf("VMSS create returned GalleryImageNotFound and regional replication is not usable: %w (original error: %v)", waitErr, err)
+			}
+			toolkit.Logf(ctx, "Regional replication confirmed Completed; retrying VMSS create immediately")


In the GalleryImageNotFound path, the continue bypasses the attempt >= maxAttempts guard and the normal delay backoff. If the compute fabric keeps returning 404 even after replication is Completed, this loop can spin indefinitely and/or hot-loop without sleeping, which is both noisy and can trigger throttling. Consider enforcing the max-attempts check before continue, and apply the normal delay (or explicitly limit this branch to a single immediate retry).

Suggested change

toolkit.Logf(ctx, "Regional replication confirmed Completed; retrying VMSS create immediately")

if attempt >= maxAttempts {

return vm, fmt.Errorf("failed to create VMSS after %d retries: %w", maxAttempts, err)

}

toolkit.Logf(ctx, "Regional replication confirmed Completed; retrying VMSS create after %v", delay)

select {

case <-ctx.Done():

return vm, err

case <-time.After(delay):

}

Copilot AI review requested due to automatic review settings April 22, 2026 16:55

ganeshkumarashok requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners April 22, 2026 16:55

ganeshkumarashok temporarily deployed to test April 22, 2026 16:55 — with GitHub Actions Inactive

Copilot started reviewing on behalf of ganeshkumarashok April 22, 2026 16:56 View session

ganeshkumarashok temporarily deployed to test April 22, 2026 16:57 — with GitHub Actions Inactive

ganeshkumarashok closed this Apr 22, 2026

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable#8373

fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable#8373
ganeshkumarashok wants to merge 1 commit intoAzure:mainfrom
ganeshkumarashok:e2e-fix-gallery-image-replication-wait

ganeshkumarashok commented Apr 22, 2026

Uh oh!

ganeshkumarashok commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Copilot AI Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-	resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil)
+	resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, &armcompute.GalleryImageVersionsClientGetOptions{
+		Expand: to.Ptr("ReplicationStatus"),
+	})

-			toolkit.Logf(ctx, "Regional replication confirmed Completed; retrying VMSS create immediately")
+			if attempt >= maxAttempts {
+				return vm, fmt.Errorf("failed to create VMSS after %d retries: %w", maxAttempts, err)
+			}
+			toolkit.Logf(ctx, "Regional replication confirmed Completed; retrying VMSS create after %v", delay)
+			select {
+			case <-ctx.Done():
+				return vm, err
+			case <-time.After(delay):
+			}

Conversation

ganeshkumarashok commented Apr 22, 2026

Summary

Background

Root cause

Why fix here, not in the pipeline YAML

What changed

e2e/config/azure.go

e2e/vmss.go

e2e/config/azure_replication_test.go

Behavior changes

Test plan

Uh oh!

ganeshkumarashok commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

`e2e/config/azure.go`

`e2e/vmss.go`

`e2e/config/azure_replication_test.go`