fix(e2e): unblock Windows sysprep when VMAgentDisabler.dll load stalls#8544
Open
r2k1 wants to merge 4 commits into
Open
fix(e2e): unblock Windows sysprep when VMAgentDisabler.dll load stalls#8544r2k1 wants to merge 4 commits into
r2k1 wants to merge 4 commits into
Conversation
Windows2022 VHDCaching scenarios have been failing at the Sysprep /generalize step in PR check-in runs since ~May 9 2026. The Sysprep RunCommand never completes within the test's vmssCtx budget (TestTimeoutVMSS - prepareAKSNode time, ~14m), and the validation step fails with 'context deadline exceeded'. Root cause: VMAgentDisabler.dll is a Sysprep provider shipped by the Windows Azure Guest Agent. The agent self-updates from Azure fabric on every boot, and in Jan 2026 added a WDAC catalog file install feature (msazure ADO PR 14499782) for the DLL. The feature had bugs (hotfixes 14889344 / 14901019) and rolled out unevenly Feb-May 2026. On hosts where the catalog install failed, Code Integrity cannot validate the DLL and LoadLibrary stalls long enough to exhaust our test timeout. This matches a 2020 incident (ICM 210726081) — the existing vhdbuilder/packer/windows/sysprep.ps1 already has the same workaround during VHD bake. Causal proof: on a healthy Win2022 host where sysprep normally completes in ~10s, renaming VMAgentDisabler.dll while leaving the SysPrepExternal\\Generalize registry entry intact reproduces the stall. Fix (e2e/test_helpers.go): - New windowsSysprepScript that removes any SysPrepExternal\\Generalize registry value pointing at VMAgentDisabler.dll before invoking Sysprep, then polls ImageState until generalization completes. - Replaces the inline sysprep invocation in CreateImage; reads res.Output / res.Error instead of marshaling JSON. Migrate RunCommand from v1 (VMSSVM.BeginRunCommand) to v2 (VMSSVMRunCommands.BeginCreateOrUpdate). v2 is the supported path going forward and matches the migration done in aks-rp PR 15721814 to avoid the 'Keyset does not exist' failure mode of the v1 extension on newer Windows hosts. Two call sites in validators.go refactored to use the new wrapper. Verified: Test_Windows2022_VHDCaching_LegacyTLSBootstrap passes end-to-end in ~9m36s with sysprep completing in ~1m, vs hanging out the full vmssCtx on broken hosts before this change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR updates the e2e harness to mitigate a recurring Windows2022 VHDCaching flake where Sysprep /generalize can hang when the SysPrepExternal\Generalize registry points at VMAgentDisabler.dll, and modernizes the test harness to use the VMSS RunCommand v2 API surface for script execution.
Changes:
- Introduces a VMSS RunCommand v2 wrapper that uses
VirtualMachineRunCommand(v2) and fetches theinstanceViewfor stdout/stderr. - Adds a Windows sysprep script that removes
SysPrepExternal\Generalizeentries referencingVMAgentDisabler.dlland pollsImageStateuntil generalize completion. - Refactors Linux SSH-related validators to consume
stdout/stderrdirectly from the new RunCommand wrapper instead of marshaling full JSON.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| e2e/test_helpers.go | Adds RunCommand v2 wrapper and a Windows sysprep script with registry cleanup + ImageState polling; updates CreateImage to use it. |
| e2e/validators.go | Refactors validator RunCommand call sites to use the new wrapper and parse stdout/stderr directly. |
Comments suppressed due to low confidence (1)
e2e/test_helpers.go:733
- CreateImage only checks
errfrom RunCommand; with RunCommand v2 the ARM operation can succeed even when the guest script fails (non-zero exit code / error output). Please fail fast here based on the runcommand instance view result (exit code/execution state), otherwise the test may proceed to capture a non-generalized disk and produce confusing downstream failures.
if stderr != "" {
s.T.Logf("Sysprep stderr: %s", stderr)
}
require.NoErrorf(s.T, err, "failed to run sysprep on Windows VM for image creation")
}
Comment on lines
+621
to
+624
| // VirtualMachineRunCommand resources persist on the VM until explicitly deleted; | ||
| // use a unique name per call so concurrent / repeated calls don't collide. | ||
| runCommandName := fmt.Sprintf("e2e-runcmd-%d", time.Now().UnixNano()) | ||
|
|
Comment on lines
+651
to
+655
| if getResp.Properties == nil || getResp.Properties.InstanceView == nil { | ||
| return armcompute.VirtualMachineRunCommandInstanceView{}, errors.New("RunCommand result missing instance view") | ||
| } | ||
| return *getResp.Properties.InstanceView, nil | ||
| } |
Previously the poll wrote a line every 10s for up to 10 min (~60 lines). Log only when ImageState changes — typically 2-3 lines for a normal sysprep run — to stay well under RunCommand's stdout cap and keep the test log readable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The ARM CreateOrUpdate operation reports success when the RunCommand extension successfully runs the script, regardless of whether the script itself succeeded. A non-zero exit, PowerShell throw, or timeout inside the script only shows up in InstanceView.ExecutionState / ExitCode (per https://learn.microsoft.com/en-us/azure/virtual-machines/windows/run-command-managed). Without this check the helper returns nil err on a failed script, and callers like CreateImage proceed to capture a non-generalized VM — the exact silent-failure mode our sysprep poll throw was designed to catch. Return a descriptive error including ExecutionState / ExitCode / stdout / stderr so require.NoError fails with actionable info. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines
614
to
617
| start := time.Now() | ||
| defer func() { | ||
| elapsed := time.Since(start) | ||
| toolkit.Logf(ctx, "Command %q took %s", command, elapsed) | ||
| toolkit.Logf(ctx, "Command %q took %s", command, time.Since(start)) | ||
| }() |
Comment on lines
+621
to
+655
| // VirtualMachineRunCommand resources persist on the VM until explicitly deleted; | ||
| // use a unique name per call so concurrent / repeated calls don't collide. | ||
| runCommandName := fmt.Sprintf("e2e-runcmd-%d", time.Now().UnixNano()) | ||
|
|
||
| runCmd := armcompute.VirtualMachineRunCommand{ | ||
| Location: to.Ptr(s.Location), | ||
| Properties: &armcompute.VirtualMachineRunCommandProperties{ | ||
| Source: &armcompute.VirtualMachineRunCommandScriptSource{ | ||
| Script: to.Ptr(command), | ||
| }, | ||
| AsyncExecution: to.Ptr(false), | ||
| }, | ||
| } | ||
|
|
||
| poller, err := config.Azure.VMSSVMRunCommands.BeginCreateOrUpdate(ctx, rg, s.Runtime.VMSSName, instanceID, runCommandName, runCmd, nil) | ||
| if err != nil { | ||
| return armcompute.RunCommandResult{}, fmt.Errorf("failed to run command on Windows VM for image creation: %w", err) | ||
| return armcompute.VirtualMachineRunCommandInstanceView{}, fmt.Errorf("failed to start RunCommand on VMSS VM: %w", err) | ||
| } | ||
| if _, err := poller.PollUntilDone(ctx, nil); err != nil { | ||
| return armcompute.VirtualMachineRunCommandInstanceView{}, fmt.Errorf("failed to wait for RunCommand on VMSS VM: %w", err) | ||
| } | ||
|
|
||
| runResp, err := runPoller.PollUntilDone(ctx, nil) | ||
| // The CreateOrUpdate response doesn't always include the InstanceView; fetch it | ||
| // explicitly so we get stdout/stderr/exit code. | ||
| getResp, err := config.Azure.VMSSVMRunCommands.Get(ctx, rg, s.Runtime.VMSSName, instanceID, runCommandName, &armcompute.VirtualMachineScaleSetVMRunCommandsClientGetOptions{ | ||
| Expand: to.Ptr("instanceView"), | ||
| }) | ||
| if err != nil { | ||
| return runResp.RunCommandResult, fmt.Errorf("failed to run command on Windows VM for image creation: %w", err) | ||
| return armcompute.VirtualMachineRunCommandInstanceView{}, fmt.Errorf("failed to get RunCommand instance view: %w", err) | ||
| } | ||
| if getResp.Properties == nil || getResp.Properties.InstanceView == nil { | ||
| return armcompute.VirtualMachineRunCommandInstanceView{}, errors.New("RunCommand result missing instance view") | ||
| } | ||
| view := *getResp.Properties.InstanceView | ||
| return view, runCommandScriptError(view) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
Test_Windows2022_VHDCaching_*scenarios have been failing first-attempt on PR check-in runs — the Sysprep RunCommand on the test VM never returns within the test'svmssCtx, and validation fails withcontext deadline exceeded.Repro: on a Win2022 test VM where
Sysprep /generalizenormally completes in ~10s, renamingC:\Windows\system32\VMAgentDisabler.dllwhile leaving theSysPrepExternal\Generalizeregistry entry pointing at it makes sysprep stall past the entire vmssCtx budget. Same symptom as the CI failures.vhdbuilder/packer/windows/sysprep.ps1has stripped that registry entry since 2020 (PR #429) for the production VHD-bake path. The e2eCreateImagehelper was added later (PR #4631) and never inherited the workaround — it invokes Sysprep directly via RunCommand. This PR brings the e2e path to parity.Also migrates RunCommand from v1 (
VMSSVM.BeginRunCommand) to v2 (VMSSVMRunCommands.BeginCreateOrUpdate) — same migration aks-rp made in PR 15721814 to avoid the v1 extension'sKeyset does not existfailure on newer Windows hosts. Two call sites invalidators.goupdated.Verification
go build ./...clean (e2e module)Test_Windows2022_VHDCaching_LegacyTLSBootstrappasses locally in ~9m36s, sysprep ~1m, on a VM where the prior code hung the full vmssCtxWhich issue(s) this PR fixes:
Fixes #