fix(e2e): unblock Windows sysprep when VMAgentDisabler.dll load stalls by r2k1 · Pull Request #8544 · Azure/AgentBaker

r2k1 · 2026-05-20T21:18:09Z

What this PR does / why we need it:

Test_Windows2022_VHDCaching_* scenarios have been failing first-attempt on PR check-in runs — the Sysprep RunCommand on the test VM never returns within the test's vmssCtx, and validation fails with context deadline exceeded.

Repro: on a Win2022 test VM where Sysprep /generalize normally completes in ~10s, renaming C:\Windows\system32\VMAgentDisabler.dll while leaving the SysPrepExternal\Generalize registry entry pointing at it makes sysprep stall past the entire vmssCtx budget. Same symptom as the CI failures.

vhdbuilder/packer/windows/sysprep.ps1 has stripped that registry entry since 2020 (PR #429) for the production VHD-bake path. The e2e CreateImage helper was added later (PR #4631) and never inherited the workaround — it invokes Sysprep directly via RunCommand. This PR brings the e2e path to parity.

Also migrates RunCommand from v1 (VMSSVM.BeginRunCommand) to v2 (VMSSVMRunCommands.BeginCreateOrUpdate) — same migration aks-rp made in PR 15721814 to avoid the v1 extension's Keyset does not exist failure on newer Windows hosts. Two call sites in validators.go updated.

Verification

go build ./... clean (e2e module)
Test_Windows2022_VHDCaching_LegacyTLSBootstrap passes locally in ~9m36s, sysprep ~1m, on a VM where the prior code hung the full vmssCtx

Which issue(s) this PR fixes:
Fixes #

Windows2022 VHDCaching scenarios have been failing at the Sysprep /generalize step in PR check-in runs since ~May 9 2026. The Sysprep RunCommand never completes within the test's vmssCtx budget (TestTimeoutVMSS - prepareAKSNode time, ~14m), and the validation step fails with 'context deadline exceeded'. Root cause: VMAgentDisabler.dll is a Sysprep provider shipped by the Windows Azure Guest Agent. The agent self-updates from Azure fabric on every boot, and in Jan 2026 added a WDAC catalog file install feature (msazure ADO PR 14499782) for the DLL. The feature had bugs (hotfixes 14889344 / 14901019) and rolled out unevenly Feb-May 2026. On hosts where the catalog install failed, Code Integrity cannot validate the DLL and LoadLibrary stalls long enough to exhaust our test timeout. This matches a 2020 incident (ICM 210726081) — the existing vhdbuilder/packer/windows/sysprep.ps1 already has the same workaround during VHD bake. Causal proof: on a healthy Win2022 host where sysprep normally completes in ~10s, renaming VMAgentDisabler.dll while leaving the SysPrepExternal\\Generalize registry entry intact reproduces the stall. Fix (e2e/test_helpers.go): - New windowsSysprepScript that removes any SysPrepExternal\\Generalize registry value pointing at VMAgentDisabler.dll before invoking Sysprep, then polls ImageState until generalization completes. - Replaces the inline sysprep invocation in CreateImage; reads res.Output / res.Error instead of marshaling JSON. Migrate RunCommand from v1 (VMSSVM.BeginRunCommand) to v2 (VMSSVMRunCommands.BeginCreateOrUpdate). v2 is the supported path going forward and matches the migration done in aks-rp PR 15721814 to avoid the 'Keyset does not exist' failure mode of the v1 extension on newer Windows hosts. Two call sites in validators.go refactored to use the new wrapper. Verified: Test_Windows2022_VHDCaching_LegacyTLSBootstrap passes end-to-end in ~9m36s with sysprep completing in ~1m, vs hanging out the full vmssCtx on broken hosts before this change. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR updates the e2e harness to mitigate a recurring Windows2022 VHDCaching flake where Sysprep /generalize can hang when the SysPrepExternal\Generalize registry points at VMAgentDisabler.dll, and modernizes the test harness to use the VMSS RunCommand v2 API surface for script execution.

Changes:

Introduces a VMSS RunCommand v2 wrapper that uses VirtualMachineRunCommand (v2) and fetches the instanceView for stdout/stderr.
Adds a Windows sysprep script that removes SysPrepExternal\Generalize entries referencing VMAgentDisabler.dll and polls ImageState until generalize completion.
Refactors Linux SSH-related validators to consume stdout/stderr directly from the new RunCommand wrapper instead of marshaling full JSON.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
e2e/test_helpers.go	Adds RunCommand v2 wrapper and a Windows sysprep script with registry cleanup + ImageState polling; updates CreateImage to use it.
e2e/validators.go	Refactors validator RunCommand call sites to use the new wrapper and parse stdout/stderr directly.

Comments suppressed due to low confidence (1)

e2e/test_helpers.go:733

CreateImage only checks err from RunCommand; with RunCommand v2 the ARM operation can succeed even when the guest script fails (non-zero exit code / error output). Please fail fast here based on the runcommand instance view result (exit code/execution state), otherwise the test may proceed to capture a non-generalized disk and produce confusing downstream failures.

		if stderr != "" {
			s.T.Logf("Sysprep stderr: %s", stderr)
		}
		require.NoErrorf(s.T, err, "failed to run sysprep on Windows VM for image creation")
	}

+	// VirtualMachineRunCommand resources persist on the VM until explicitly deleted;
+	// use a unique name per call so concurrent / repeated calls don't collide.
+	runCommandName := fmt.Sprintf("e2e-runcmd-%d", time.Now().UnixNano())
+


+	if getResp.Properties == nil || getResp.Properties.InstanceView == nil {
+		return armcompute.VirtualMachineRunCommandInstanceView{}, errors.New("RunCommand result missing instance view")
+	}
+	return *getResp.Properties.InstanceView, nil
+}


Previously the poll wrote a line every 10s for up to 10 min (~60 lines). Log only when ImageState changes — typically 2-3 lines for a normal sysprep run — to stay well under RunCommand's stdout cap and keep the test log readable. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The ARM CreateOrUpdate operation reports success when the RunCommand extension successfully runs the script, regardless of whether the script itself succeeded. A non-zero exit, PowerShell throw, or timeout inside the script only shows up in InstanceView.ExecutionState / ExitCode (per https://learn.microsoft.com/en-us/azure/virtual-machines/windows/run-command-managed). Without this check the helper returns nil err on a failed script, and callers like CreateImage proceed to capture a non-generalized VM — the exact silent-failure mode our sysprep poll throw was designed to catch. Return a descriptive error including ExecutionState / ExitCode / stdout / stderr so require.NoError fails with actionable info. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

 	start := time.Now()
 	defer func() {
-		elapsed := time.Since(start)
-		toolkit.Logf(ctx, "Command %q took %s", command, elapsed)
+		toolkit.Logf(ctx, "Command %q took %s", command, time.Since(start))
 	}()


+	// VirtualMachineRunCommand resources persist on the VM until explicitly deleted;
+	// use a unique name per call so concurrent / repeated calls don't collide.
+	runCommandName := fmt.Sprintf("e2e-runcmd-%d", time.Now().UnixNano())
+
+	runCmd := armcompute.VirtualMachineRunCommand{
+		Location: to.Ptr(s.Location),
+		Properties: &armcompute.VirtualMachineRunCommandProperties{
+			Source: &armcompute.VirtualMachineRunCommandScriptSource{
+				Script: to.Ptr(command),
+			},
+			AsyncExecution: to.Ptr(false),
+		},
+	}
+
+	poller, err := config.Azure.VMSSVMRunCommands.BeginCreateOrUpdate(ctx, rg, s.Runtime.VMSSName, instanceID, runCommandName, runCmd, nil)
 	if err != nil {
-		return armcompute.RunCommandResult{}, fmt.Errorf("failed to run command on Windows VM for image creation: %w", err)
+		return armcompute.VirtualMachineRunCommandInstanceView{}, fmt.Errorf("failed to start RunCommand on VMSS VM: %w", err)
+	}
+	if _, err := poller.PollUntilDone(ctx, nil); err != nil {
+		return armcompute.VirtualMachineRunCommandInstanceView{}, fmt.Errorf("failed to wait for RunCommand on VMSS VM: %w", err)
 	}

-	runResp, err := runPoller.PollUntilDone(ctx, nil)
+	// The CreateOrUpdate response doesn't always include the InstanceView; fetch it
+	// explicitly so we get stdout/stderr/exit code.
+	getResp, err := config.Azure.VMSSVMRunCommands.Get(ctx, rg, s.Runtime.VMSSName, instanceID, runCommandName, &armcompute.VirtualMachineScaleSetVMRunCommandsClientGetOptions{
+		Expand: to.Ptr("instanceView"),
+	})
 	if err != nil {
-		return runResp.RunCommandResult, fmt.Errorf("failed to run command on Windows VM for image creation: %w", err)
+		return armcompute.VirtualMachineRunCommandInstanceView{}, fmt.Errorf("failed to get RunCommand instance view: %w", err)
+	}
+	if getResp.Properties == nil || getResp.Properties.InstanceView == nil {
+		return armcompute.VirtualMachineRunCommandInstanceView{}, errors.New("RunCommand result missing instance view")
+	}
+	view := *getResp.Properties.InstanceView
+	return view, runCommandScriptError(view)


Copilot AI review requested due to automatic review settings May 20, 2026 21:18

r2k1 requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, sulixu, surajssd, timmy-wright and zachary-bailey as code owners May 20, 2026 21:18

r2k1 temporarily deployed to test May 20, 2026 21:18 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 20, 2026 21:19 View session

trim speculation from sysprep comment

f525089

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

r2k1 temporarily deployed to test May 20, 2026 21:24 — with GitHub Actions Inactive

Copilot AI reviewed May 20, 2026

View reviewed changes

r2k1 temporarily deployed to test May 21, 2026 00:36 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings May 21, 2026 00:40

r2k1 temporarily deployed to test May 21, 2026 00:40 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 21, 2026 00:41 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(e2e): unblock Windows sysprep when VMAgentDisabler.dll load stalls#8544

fix(e2e): unblock Windows sysprep when VMAgentDisabler.dll load stalls#8544
r2k1 wants to merge 4 commits into
mainfrom
akhantimirov/fix-windows-sysprep-vmagentdisabler-flake

r2k1 commented May 20, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

r2k1 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

r2k1 commented May 20, 2026 •

edited

Loading