feat(e2e): dynamically fetch VM extension version with caching, timeouts, and fallback#8064
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the e2e VMSS model generation to dynamically resolve the latest Compute.AKS.Linux.AKSNode VM extension version via the Azure API and memoize the result to avoid repeated lookups.
Changes:
- Add an e2e cache key + cached wrapper for
GetLatestVMExtensionImageVersion. - Update
createVMExtensionLinuxAKSNodeto accept acontext.Contextand use the cached Azure lookup instead of a hardcoded version. - Optimize
GetLatestVMExtensionImageVersionto request only the top result usingTop+Orderby.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| e2e/test_helpers.go | Switch AKSNode extension creation to use cached latest-version lookup; update one call site for new signature. |
| e2e/scenario_test.go | Update scenario to pass a context into createVMExtensionLinuxAKSNode. |
| e2e/scenario_gpu_managed_experience_test.go | Update multiple GPU scenarios to pass a context into createVMExtensionLinuxAKSNode. |
| e2e/config/azure.go | Change extension version lookup to a server-side “top 1, order by name desc” query. |
| e2e/cache.go | Add cached wrapper + cache-key struct for VM extension latest-version lookups. |
8982d50 to
a9e9293
Compare
a9e9293 to
7400daf
Compare
0e66f03 to
cfc176f
Compare
| func Test_cachedFunc_warm_call_is_faster_than_cold(t *testing.T) { | ||
| fn := cachedFunc(func(ctx context.Context, key string) (string, error) { | ||
| // simulate a slow operation like a network call | ||
| time.Sleep(10 * time.Millisecond) | ||
| return "result", nil | ||
| }) | ||
|
|
||
| ctx := context.Background() | ||
|
|
||
| start := time.Now() | ||
| _, err := fn(ctx, "key") | ||
| coldDuration := time.Since(start) | ||
| require.NoError(t, err) | ||
|
|
||
| start = time.Now() | ||
| _, err = fn(ctx, "key") | ||
| warmDuration := time.Since(start) | ||
| require.NoError(t, err) | ||
|
|
||
| assert.Less(t, warmDuration, coldDuration, "warm (cached) call should be faster than cold call") | ||
| } |
There was a problem hiding this comment.
Test_cachedFunc_warm_call_is_faster_than_cold compares wall-clock durations to assert caching, which can be flaky due to scheduler noise (warm call can occasionally be slower). A more deterministic assertion is to track underlying function invocations (e.g., via an atomic counter) and assert it’s called exactly once per key, optionally keeping the sleep to make the cold path slower but not asserting on timing.
bb38191 to
e3dded0
Compare
Replace hardcoded VM extension version "1.406" with a cached Azure API call that fetches the latest version at runtime. Add a caching layer via CachedGetLatestVMExtensionImageVersion to avoid redundant API calls across tests. Update all callers to pass context for proper propagation. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…ookup Replace `sort.Slice` + last-element pattern with `slices.MaxFunc` in `GetLatestVMExtensionImageVersion` for `O(n)` instead of `O(n log n)`. Rename the less method to `cmp` using `cmp.Compare` from stdlib, and add a nil name guard before accessing the result. Move the nil name check in `parseVersion` to an early return before splitting. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Replace the observation-only Test_CreateVMExtensionLinuxAKSNode_Timing from the GPU scenario file with proper unit tests in cache_test.go that assert caching correctness, warm call performance, per-key isolation, error caching, and struct key support. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Prevent the ListVersions Azure API call from blocking indefinitely when the API hangs, which would cause the global 90-minute test timeout to panic and kill all parallel tests. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…ilure Instead of failing the test when the Azure API call to discover the latest VM extension version fails, fall back to a known-good hardcoded version (1.406) and log a warning. This prevents transient API issues like network errors or throttling from causing test failures. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…ison Extract GetLatestVMExtensionImageVersion logic into a testable unexported function accepting a vmExtensionImageVersionLister interface. Add 26 table-driven tests covering parseVersion, vmExtensionVersion.cmp, and getLatestVMExtensionImageVersion including edge cases for nil names, malformed versions, API errors, and empty responses. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
fmt.Sprintf does not support the error-wrapping directive %w, which caused a build failure. The %w verb is only valid in fmt.Errorf. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Replace fragile wall-clock timing assertions with deterministic checks for the createVMExtensionLinuxAKSNode cache test. The previous 10s/1s thresholds tested CI/network performance rather than caching logic. - Add testing.Short() skip guard for integration test - Add nil/empty checks on returned extensions and properties - Replace assert with require for fail-fast behavior - Keep cache consistency check and duration logging for diagnostics - Remove unused assert import Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
The nil guard in Logf fell through to t.Helper() and t.Logf() when the context had no testing.TB, causing a nil-pointer dereference panic. Add the missing return to match the sibling Log function's pattern. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Update the fallback AKS VM extension version from 1.406 to 1.413 and uncomment the assertion that verifies the resolved version differs from the hardcoded fallback, ensuring the Azure API is actually reached. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
…LinuxAKSNode The context passed to this function does not contain a testing.TB value set via toolkit.ContextWithT(), causing toolkit.Logf to emit a spurious warning concatenated into the log output. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
There is a race condition between NPD startup and the creation of the managed GPU experience marker file during CSE provisioning. This condition only manifests in the AB e2e test because we enable the extension at the time of provisioning the VMSS. Restart NPD in GPU e2e tests before validating NPD conditions to ensure it re-reads the marker file and loads device-plugin, DCGM, and GRID license checks. Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
e3dded0 to
7eb2c58
Compare
…uts, and fallback (#8064) Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
What this PR does / why we need it:
Replaces the hardcoded VM extension version
"1.413"increateVMExtensionLinuxAKSNodewith a dynamic Azure API lookup that fetches the latest version at runtime, cached to avoid redundant API calls across parallel tests.Dynamic version discovery
CachedGetLatestVMExtensionImageVersionine2e/cache.gowith aGetLatestExtensionVersionRequeststruct key supportingLocation,ExtType, andPublishercreateVMExtensionLinuxAKSNodeine2e/test_helpers.goto accept acontext.Context, resolve location fromconfig.Config.DefaultLocation, and call the cached version lookupe2e/scenario_gpu_managed_experience_test.goande2e/scenario_test.goto passt.Context()Robustness
context.WithTimeoutincreateVMExtensionLinuxAKSNodeto preventListVersionsfrom blocking indefinitely and consuming the global 90-minute test timeout"1.413"with a warning log when the Azure API call fails (timeout, throttling, network error), instead of failing the testtoolkit.Logf: add missingreturnafter the nil guard'slog.Printffallback, matching the siblingLogfunction's pattern — without this, anytoolkit.Logfcall from a context withouttesting.TB(e.g.t.Context()) dereferences a niltRefactoring
VMExtenstionVersion→vmExtensionVersion(unexported, fixing the typo)sort.Slice+ last-element withslices.MaxFuncfor O(n) version lookupless→cmpusingcmp.Comparefrom stdlib, returning-1/0/1parseVersionas an early return, and on theslices.MaxFuncresult inGetLatestVMExtensionImageVersiongetLatestVMExtensionImageVersionfunction behind avmExtensionImageVersionListerinterface%w→%vinfmt.Sprintfpanic call ine2e/config/config.go(%wis only valid infmt.Errorf)Tests
e2e/config/azure_vmext_test.go— 26 table-driven unit tests:Test_parseVersion: 10 cases (three-part, two-part, single-part, nil name, non-numeric, partially numeric, empty string, extra parts, large numbers, leading zeros)Test_vmExtensionVersion_cmp: 9 cases (equal, major/minor/patch comparisons, zero values)Test_getLatestVMExtensionImageVersion: 7 cases (multiple versions, single version, two-part versions, API errors, empty list, nil names, mixed valid/malformed)e2e/cache_test.go— 5 unit tests forcachedFunc: consistent results, warm-vs-cold performance, per-key isolation, error caching, struct key supporte2e/scenario_gpu_managed_experience_test.go—Test_CreateVMExtensionLinuxAKSNode_Timingintegration test withtesting.Short()skip guard and deterministic cache consistency assertionsTesting Done