Skip to content

fix(e2e): retry VMSS creation on GalleryImageNotFound error#8239

Merged
ganeshkumarashok merged 2 commits intomainfrom
suraj/retry-on-gallery-image-not-found
Apr 3, 2026
Merged

fix(e2e): retry VMSS creation on GalleryImageNotFound error#8239
ganeshkumarashok merged 2 commits intomainfrom
suraj/retry-on-gallery-image-not-found

Conversation

@surajssd
Copy link
Copy Markdown
Member

@surajssd surajssd commented Apr 3, 2026

What this PR does / why we need it:

Adds GalleryImageNotFound as a retryable error in CreateVMSSWithRetry (e2e/vmss.go).

After VHD image replication to a non-default region completes (e.g., uaenorth for H100 GPU tests), Azure's gallery API reports success, but the compute fabric in the target region may not have fully propagated the image yet. This causes VMSS creation to fail with a transient GalleryImageNotFound error.

Previously, the retryOn closure in CreateVMSSWithRetry only retried on AllocationFailed errors. This change expands it to also retry on GalleryImageNotFound, allowing the existing retry loop (10 attempts, 5s delay) to handle the eventual consistency gap. The closure is also refactored with an early return for non-API errors for better readability.

This was observed in PR #8227's GPU E2E pipeline, where Test_Ubuntu2404_GPU_H100 and Test_Ubuntu2404_NvidiaDevicePluginRunning_MIG_H100_NoReboot failed after an 8-minute replication to uaenorth completed successfully but the subsequent VMSS PUT returned 404.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the reliability of E2E VMSS provisioning by expanding the set of transient Azure API errors that trigger retries during VMSS creation, addressing eventual-consistency delays after SIG image replication to non-default regions.

Changes:

  • Refactors the retryOn predicate in CreateVMSSWithRetry to early-return on non-Azure-API errors.
  • Adds GalleryImageNotFound as a retryable Azure API error code alongside AllocationFailed.

surajssd added 2 commits April 3, 2026 13:22
- Add `GalleryImageNotFound` as a retryable error in `CreateVMSSWithRetry`
to handle Azure eventual consistency after image replication completes
- Refactor `retryOn` closure for clarity with early return on non-API errors

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…ry check

Narrow the retry predicate to also verify HTTP 404 status code,
preventing retries on unexpected status codes for the same error code.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
@surajssd surajssd force-pushed the suraj/retry-on-gallery-image-not-found branch from 74d63b7 to 0c1ed59 Compare April 3, 2026 20:37
@ganeshkumarashok ganeshkumarashok merged commit 2e4c9fe into main Apr 3, 2026
26 of 31 checks passed
@ganeshkumarashok ganeshkumarashok deleted the suraj/retry-on-gallery-image-not-found branch April 3, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants