fix(e2e): wait on RegionalReplicationStatus before declaring SIG image usable#8373
Conversation
…ions E2E was reporting "Replication took: 12m" via ensureReplication then immediately failing on VMSS create with `404 GalleryImageNotFound` for ~10 more minutes. Root cause: replicatedToCurrentRegion only checks PublishingProfile.TargetRegions (intent), and the parent LRO returns Succeeded before per-region replicas are actually serving traffic. The vmss.go retry loop (10 x 5s) was far too short to absorb the gap and surfaced the infra failure as `failed to create VMSS after 10 retries`, indistinguishable from a real test failure. Fix: - ensureReplication now polls Get with Expand=ReplicationStatus until RegionalReplicationStatus.State == Completed (20 min timeout, 15s interval). Fails fast on terminal Failed. - CreateVMSSWithRetry no longer burns fixed retries on 404 GalleryImageNotFound. It calls the replication poller, which either returns quickly when the region is Completed (fabric-eventual- consistency race -> retry once), blocks until Completed, or fails fast on Failed with the real cause attached. - Add a public WaitForImageVersionReplicatedToRegion entrypoint and unit tests for the region-name normalization in findRegionalReplicationStatus. This should eliminate the GalleryImageNotFound class of false E2E failures across Linux VHD, Windows VHD, GPU, and base AgentBaker E2E pipelines.
|
Superseded by #8374 — recreated from a branch on Azure/AgentBaker so validate-pull-request-source passes (this repo blocks PRs from forks). |
There was a problem hiding this comment.
Pull request overview
This PR improves AgentBaker E2E reliability by ensuring Azure Compute Gallery (SIG) images are actually serving in a target region (via RegionalReplicationStatus.State == Completed) before the framework proceeds to VMSS creation, reducing noisy GalleryImageNotFound cascades.
Changes:
- Update SIG replication logic to wait on per-region
ReplicationStatusrather than relying on the parent provisioning LRO. - Enhance VMSS create retry handling to consult live replication state when
GalleryImageNotFoundoccurs. - Add unit tests for region/status lookup normalization and nil-tolerance.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
e2e/config/azure.go |
Adds regional replication polling and exposes a public helper for VMSS retry logic to confirm regional availability. |
e2e/vmss.go |
Classifies GalleryImageNotFound and consults replication-state poller before retrying VMSS creation. |
e2e/config/azure_replication_test.go |
Adds table-driven tests for locating regional replication status across casing/spacing variants and nil entries. |
| if isGalleryImageMissing && s.VHD != nil && s.Runtime != nil && s.Runtime.Cluster != nil && s.Runtime.Cluster.Model != nil && s.Runtime.Cluster.Model.Location != nil { | ||
| location := *s.Runtime.Cluster.Model.Location | ||
| toolkit.Logf(ctx, "GalleryImageNotFound on VMSS create in %s; consulting regional replication status before retrying", location) |
There was a problem hiding this comment.
The long nil-check chain before calling WaitForImageVersionReplicatedToRegion is effectively redundant here because this function already dereferences s.Runtime.Cluster.Model.Properties.NodeResourceGroup earlier in the loop. Either rely on the existing invariants and simplify this condition, or restructure the code to safely extract location/nodeResourceGroup once up-front (so the nil checks actually add safety).
| resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, *version.Name, getOpts) | ||
| if err != nil { | ||
| // Transient API errors should not abort the wait; the SDK already retries throttling. | ||
| toolkit.Logf(ctx, "transient error getting image version replication status (will retry): %v", err) | ||
| return false, nil | ||
| } |
There was a problem hiding this comment.
waitForRegionalReplicationCompleted currently treats any Get error as transient and keeps polling. This will mask permanent failures (e.g., 401/403 auth/RBAC issues, 404 if the version was deleted, malformed request) and can turn them into a 20-minute timeout with a generic error. Suggest inspecting err as *azcore.ResponseError and failing fast for non-retryable status codes, only continuing on clearly transient cases (e.g., 429/5xx/timeouts).
| if err != nil { | ||
| return fmt.Errorf("create image version client: %w", err) | ||
| } | ||
| resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil) |
There was a problem hiding this comment.
The doc comment for WaitForImageVersionReplicatedToRegion says it "fetches the live image version with ReplicationStatus expanded", but the initial Get(...) uses nil options (no Expand). Either update the comment to reflect that the polling Get uses Expand=ReplicationStatus, or pass the expand option to this initial Get to match the comment.
| resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, nil) | |
| resp, err := imgVersionClient.Get(ctx, image.Gallery.ResourceGroupName, image.Gallery.Name, image.Name, image.Version, &armcompute.GalleryImageVersionsClientGetOptions{ | |
| Expand: to.Ptr("ReplicationStatus"), | |
| }) |
| // Surface the precise infra cause instead of a generic "failed after N retries". | ||
| return vm, fmt.Errorf("VMSS create returned GalleryImageNotFound and regional replication is not usable: %w (original error: %v)", waitErr, err) | ||
| } | ||
| toolkit.Logf(ctx, "Regional replication confirmed Completed; retrying VMSS create immediately") |
There was a problem hiding this comment.
In the GalleryImageNotFound path, the continue bypasses the attempt >= maxAttempts guard and the normal delay backoff. If the compute fabric keeps returning 404 even after replication is Completed, this loop can spin indefinitely and/or hot-loop without sleeping, which is both noisy and can trigger throttling. Consider enforcing the max-attempts check before continue, and apply the normal delay (or explicitly limit this branch to a single immediate retry).
| toolkit.Logf(ctx, "Regional replication confirmed Completed; retrying VMSS create immediately") | |
| if attempt >= maxAttempts { | |
| return vm, fmt.Errorf("failed to create VMSS after %d retries: %w", maxAttempts, err) | |
| } | |
| toolkit.Logf(ctx, "Regional replication confirmed Completed; retrying VMSS create after %v", delay) | |
| select { | |
| case <-ctx.Done(): | |
| return vm, err | |
| case <-time.After(delay): | |
| } |
Summary
GalleryImageNotFoundcascading failures in AgentBaker E2E.ensureReplicationnow waits on per-regionRegionalReplicationStatus.State == Completedinstead of trusting the parent provisioning LRO. Thevmss.goretry loop classifies the 404 against actual replication state — failing fast on terminalFailed, blocking onReplicating, retrying once on the genuinely transient fabric-consistency case.Background
On PR #8228's GPU E2E run (build 161141895), 22 of 368 tests failed with the same shape:
The pipeline log shows
ensureReplicationhad already emittedReplication took: 12m2.7sfor that region — i.e. it returned success — and then VMSS create kept 404'ing for ~10 more minutes.Root cause
replicatedToCurrentRegionchecksPublishingProfile.TargetRegions, which records intent to replicate, not the actual regional state. The Azure Compute Gallery Get API returnsprovisioningState=Succeededonce the parent LRO finishes, butRegionalReplicationStatus.Statefor the new region can remainReplicatingfor several more minutes before it transitions toCompleted. During that window, theimageReferenceis unresolvable fromMicrosoft.Compute/virtualMachineScaleSetscreate — yielding the 404.The existing retry loop in
vmss.go(10 attempts × 5s = ~50 s) was orders of magnitude shorter than that window, and it surfaced the failure asfailed to create VMSS after 10 retries— visually indistinguishable from a real test failure. PR authors had no way to tell that the change-under-test was never actually exercised.Why fix here, not in the pipeline YAML
Considered and rejected:
t.SkiponGalleryImageNotFound— silently masks real bugs. PR authors would think their change was validated when it wasn't.FailedfromReplicating, wastes CRP quota, and still produces a misleading error message.The framework already owns "ensure image is usable in this region" (
ensureReplication). The right fix is to make that primitive honor its contract.What changed
e2e/config/azure.gowaitForRegionalReplicationCompleted: polls the gallery image version withExpand=ReplicationStatusTypesReplicationStatus(15s interval, 20 min cap) untilRegionalReplicationStatus.State == Completed. TreatsReplicating/Unknownas transient. Fails fast with attachedDetailson terminalFailed.ensureReplicationnow calls it on both paths:Replicating(this is exactly what bit PR feat: skip MIG reboot for H100 and newer GPU architectures #8228).Succeededwith the regional state stillReplicating.WaitForImageVersionReplicatedToRegionfor the retry loop invmss.go— fetches the live version and delegates to the same poller.e2e/vmss.goCreateVMSSWithRetry's classifier now distinguishesGalleryImageNotFoundfromAllocationFailed. On the 404 it consults the actual regional replication state viaWaitForImageVersionReplicatedToRegion:e2e/config/azure_replication_test.gofindRegionalReplicationStatuscovering nil status, case/whitespace normalization both ways (ARM returns"West US 2", our location strings are"westus2"), nil-entry tolerance, and missing-region.Behavior changes
failed to create VMSS after 10 retriesfailures every time a fresh VHD ran multi-region E2E. Test logs blamed the test code.ensureReplicationblocks until the regional replica is actually serving — usual case, no retry pressure on CRP at all. If something is genuinely wrong with replication, you get a clearregional replication of <id> to <region> failed: <Azure-provided details>instead offailed to create VMSS after 10 retries. If it's the rare fabric-eventual-consistency case (LRO+ReplicationStatus both Completed but compute fabric not yet propagated), the existing retry catches it without a long fixed sleep.Test plan
go build ./...cleango test ./config/ -run TestFindRegionalReplicationStatus -v -vet=offpasses (vet-off only because of an unrelated pre-existing nit inconfig/config.go:182)