Skip to content

feat(e2e): dynamically fetch VM extension version with caching, timeouts, and fallback#8064

Merged
ganeshkumarashok merged 12 commits intomainfrom
suraj/optimize-extensio-version-lookup
Apr 1, 2026
Merged

feat(e2e): dynamically fetch VM extension version with caching, timeouts, and fallback#8064
ganeshkumarashok merged 12 commits intomainfrom
suraj/optimize-extensio-version-lookup

Conversation

@surajssd
Copy link
Copy Markdown
Member

@surajssd surajssd commented Mar 10, 2026

What this PR does / why we need it:

Replaces the hardcoded VM extension version "1.413" in createVMExtensionLinuxAKSNode with a dynamic Azure API lookup that fetches the latest version at runtime, cached to avoid redundant API calls across parallel tests.

Dynamic version discovery

  • Add CachedGetLatestVMExtensionImageVersion in e2e/cache.go with a GetLatestExtensionVersionRequest struct key supporting Location, ExtType, and Publisher
  • Update createVMExtensionLinuxAKSNode in e2e/test_helpers.go to accept a context.Context, resolve location from config.Config.DefaultLocation, and call the cached version lookup
  • Update all 7 callers in e2e/scenario_gpu_managed_experience_test.go and e2e/scenario_test.go to pass t.Context()

Robustness

  • Add a 30s context.WithTimeout in createVMExtensionLinuxAKSNode to prevent ListVersions from blocking indefinitely and consuming the global 90-minute test timeout
  • Fall back to the known-good hardcoded version "1.413" with a warning log when the Azure API call fails (timeout, throttling, network error), instead of failing the test
  • Fix nil-pointer panic in toolkit.Logf: add missing return after the nil guard's log.Printf fallback, matching the sibling Log function's pattern — without this, any toolkit.Logf call from a context without testing.TB (e.g. t.Context()) dereferences a nil t

Refactoring

  • Rename VMExtenstionVersionvmExtensionVersion (unexported, fixing the typo)
  • Replace sort.Slice + last-element with slices.MaxFunc for O(n) version lookup
  • Rename lesscmp using cmp.Compare from stdlib, returning -1/0/1
  • Add nil-name guard in parseVersion as an early return, and on the slices.MaxFunc result in GetLatestVMExtensionImageVersion
  • Extract testable getLatestVMExtensionImageVersion function behind a vmExtensionImageVersionLister interface
  • Fix %w%v in fmt.Sprintf panic call in e2e/config/config.go (%w is only valid in fmt.Errorf)

Tests

  • e2e/config/azure_vmext_test.go — 26 table-driven unit tests:
    • Test_parseVersion: 10 cases (three-part, two-part, single-part, nil name, non-numeric, partially numeric, empty string, extra parts, large numbers, leading zeros)
    • Test_vmExtensionVersion_cmp: 9 cases (equal, major/minor/patch comparisons, zero values)
    • Test_getLatestVMExtensionImageVersion: 7 cases (multiple versions, single version, two-part versions, API errors, empty list, nil names, mixed valid/malformed)
  • e2e/cache_test.go — 5 unit tests for cachedFunc: consistent results, warm-vs-cold performance, per-key isolation, error caching, struct key support
  • e2e/scenario_gpu_managed_experience_test.goTest_CreateVMExtensionLinuxAKSNode_Timing integration test with testing.Short() skip guard and deterministic cache consistency assertions

Testing Done

cd e2e
go test -run Test_CreateVMExtensionLinuxAKSNode_Timing -v -count=1
go test -run "Test_cachedFunc*" -v

cd config
go test -run Test_parseVersion -v
go test -run Test_vmExtensionVersion_cmp -v
go test -run Test_getLatestVMExtensionImageVersion -v

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the e2e VMSS model generation to dynamically resolve the latest Compute.AKS.Linux.AKSNode VM extension version via the Azure API and memoize the result to avoid repeated lookups.

Changes:

  • Add an e2e cache key + cached wrapper for GetLatestVMExtensionImageVersion.
  • Update createVMExtensionLinuxAKSNode to accept a context.Context and use the cached Azure lookup instead of a hardcoded version.
  • Optimize GetLatestVMExtensionImageVersion to request only the top result using Top + Orderby.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
e2e/test_helpers.go Switch AKSNode extension creation to use cached latest-version lookup; update one call site for new signature.
e2e/scenario_test.go Update scenario to pass a context into createVMExtensionLinuxAKSNode.
e2e/scenario_gpu_managed_experience_test.go Update multiple GPU scenarios to pass a context into createVMExtensionLinuxAKSNode.
e2e/config/azure.go Change extension version lookup to a server-side “top 1, order by name desc” query.
e2e/cache.go Add cached wrapper + cache-key struct for VM extension latest-version lookups.

@surajssd surajssd force-pushed the suraj/optimize-extensio-version-lookup branch from 8982d50 to a9e9293 Compare March 10, 2026 22:29
@surajssd surajssd force-pushed the suraj/optimize-extensio-version-lookup branch from a9e9293 to 7400daf Compare March 11, 2026 05:22
Copilot AI review requested due to automatic review settings March 11, 2026 05:22
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

@surajssd surajssd changed the title test(e2e): add timing test for createVMExtensionLinuxAKSNode feat(e2e): dynamically fetch VM extension version with caching, timeouts, and fallback Mar 11, 2026
@surajssd surajssd force-pushed the suraj/optimize-extensio-version-lookup branch from 0e66f03 to cfc176f Compare March 26, 2026 21:42
@surajssd surajssd requested a review from SriHarsha001 as a code owner March 26, 2026 21:42
Copilot AI review requested due to automatic review settings March 26, 2026 23:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Comment on lines +32 to +52
func Test_cachedFunc_warm_call_is_faster_than_cold(t *testing.T) {
fn := cachedFunc(func(ctx context.Context, key string) (string, error) {
// simulate a slow operation like a network call
time.Sleep(10 * time.Millisecond)
return "result", nil
})

ctx := context.Background()

start := time.Now()
_, err := fn(ctx, "key")
coldDuration := time.Since(start)
require.NoError(t, err)

start = time.Now()
_, err = fn(ctx, "key")
warmDuration := time.Since(start)
require.NoError(t, err)

assert.Less(t, warmDuration, coldDuration, "warm (cached) call should be faster than cold call")
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test_cachedFunc_warm_call_is_faster_than_cold compares wall-clock durations to assert caching, which can be flaky due to scheduler noise (warm call can occasionally be slower). A more deterministic assertion is to track underlying function invocations (e.g., via an atomic counter) and assert it’s called exactly once per key, optionally keeping the sleep to make the cold path slower but not asserting on timing.

Copilot uses AI. Check for mistakes.
@surajssd surajssd force-pushed the suraj/optimize-extensio-version-lookup branch from bb38191 to e3dded0 Compare March 27, 2026 17:08
surajssd added 12 commits March 30, 2026 10:12
Replace hardcoded VM extension version "1.406" with a cached Azure API
call that fetches the latest version at runtime. Add a caching layer
via CachedGetLatestVMExtensionImageVersion to avoid redundant API calls
across tests. Update all callers to pass context for proper propagation.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…ookup

Replace `sort.Slice` + last-element pattern with `slices.MaxFunc` in
`GetLatestVMExtensionImageVersion` for `O(n)` instead of `O(n log n)`.
Rename the less method to `cmp` using `cmp.Compare` from stdlib, and
add a nil name guard before accessing the result.

Move the nil name check in `parseVersion` to an early return before
splitting.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Replace the observation-only Test_CreateVMExtensionLinuxAKSNode_Timing
from the GPU scenario file with proper unit tests in cache_test.go that
assert caching correctness, warm call performance, per-key isolation,
error caching, and struct key support.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Prevent the ListVersions Azure API call from blocking indefinitely
when the API hangs, which would cause the global 90-minute test
timeout to panic and kill all parallel tests.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…ilure

Instead of failing the test when the Azure API call to discover the
latest VM extension version fails, fall back to a known-good hardcoded
version (1.406) and log a warning. This prevents transient API issues
like network errors or throttling from causing test failures.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…ison

Extract GetLatestVMExtensionImageVersion logic into a testable unexported
function accepting a vmExtensionImageVersionLister interface. Add 26
table-driven tests covering parseVersion, vmExtensionVersion.cmp, and
getLatestVMExtensionImageVersion including edge cases for nil names,
malformed versions, API errors, and empty responses.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
fmt.Sprintf does not support the error-wrapping directive %w, which
caused a build failure. The %w verb is only valid in fmt.Errorf.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Replace fragile wall-clock timing assertions with deterministic checks
for the createVMExtensionLinuxAKSNode cache test. The previous 10s/1s
thresholds tested CI/network performance rather than caching logic.

- Add testing.Short() skip guard for integration test
- Add nil/empty checks on returned extensions and properties
- Replace assert with require for fail-fast behavior
- Keep cache consistency check and duration logging for diagnostics
- Remove unused assert import

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The nil guard in Logf fell through to t.Helper() and t.Logf() when
the context had no testing.TB, causing a nil-pointer dereference panic.
Add the missing return to match the sibling Log function's pattern.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Update the fallback AKS VM extension version from 1.406 to 1.413 and
uncomment the assertion that verifies the resolved version differs from
the hardcoded fallback, ensuring the Azure API is actually reached.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…LinuxAKSNode

The context passed to this function does not contain a testing.TB
value set via toolkit.ContextWithT(), causing toolkit.Logf to emit
a spurious warning concatenated into the log output.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
There is a race condition between NPD startup and the creation of the managed
GPU experience marker file during CSE provisioning. This condition only
manifests in the AB e2e test because we enable the extension at the time of
provisioning the VMSS. Restart NPD in GPU e2e tests before validating NPD
conditions to ensure it re-reads the marker file and loads device-plugin, DCGM,
and GRID license checks.

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Copilot AI review requested due to automatic review settings March 30, 2026 17:12
@surajssd surajssd force-pushed the suraj/optimize-extensio-version-lookup branch from e3dded0 to 7eb2c58 Compare March 30, 2026 17:12
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.

@ganeshkumarashok ganeshkumarashok merged commit 0746fd9 into main Apr 1, 2026
33 of 35 checks passed
@ganeshkumarashok ganeshkumarashok deleted the suraj/optimize-extensio-version-lookup branch April 1, 2026 21:03
timmy-wright pushed a commit that referenced this pull request Apr 2, 2026
…uts, and fallback (#8064)

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants