Skip to content

feat: add cse timings test to catch regression with cached and full package installation#8284

Open
djsly wants to merge 1 commit intomainfrom
fix/cse-apt-lock-contention-and-perf-tests
Open

feat: add cse timings test to catch regression with cached and full package installation#8284
djsly wants to merge 1 commit intomainfrom
fix/cse-apt-lock-contention-and-perf-tests

Conversation

@djsly
Copy link
Copy Markdown
Collaborator

@djsly djsly commented Apr 11, 2026

Summary

Add E2E for catching regression in cse timings

Root Cause

PR #8105 optimized CSE by skipping installContainerRuntime on golden image VHDs (~80% of Ubuntu 22.04 nodes). This removed a natural ~10s delay that masked a critical issue: the first apt-get or apt-mark call after VHD boot triggers expensive apt initialization (~30-50s) and holds dpkg/apt locks, blocking all subsequent apt operations.

Without the installContainerRuntime delay, installKubeletKubectlFromPkg became the first apt call and took 47.6s p50 (up from ~10s baseline).

E2E Test Results

With fix (Ubuntu 22.04 golden image, westus3, deb package path):

Task Before (production p50) After (E2E verified) Improvement
aptmarkWALinuxAgent 21.7s 0.06s 360x faster
installkubelet.installDebPackageFromFile 47.6s 13.46s 3.5x faster
installkubectl.installDebPackageFromFile 5.6s 2.58s 2.2x faster
configureKubeletAndKubectl (total) 47.6s 16.12s 3x faster
cse_start (total) ~47s 33.29s 29% reduction

Verified from VM provision logs:

  • dpkg --set-selections used for walinuxagent hold ✅
  • dpkg -i --force-confold used for kubelet deb (succeeded on first try) ✅
  • dpkg -i --force-confold used for kubectl deb (succeeded on first try) ✅
  • FULL_INSTALL_REQUIRED=false correctly detected and exported ✅

Testing

  • go test ./pkg/agent/... — PASS
  • go build ./e2e/... — PASS
  • E2E Test_Ubuntu2204_CSE_CachedPerformance — PASS (33.29s total CSE, using real deb package path)
  • Verified dpkg -i and dpkg --set-selections from VM provision logs
  • Snapshot regeneration (go generate) — no changes needed

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a CSE bootstrap latency regression on Ubuntu golden-image VHDs by avoiding first-boot apt initialization/lock contention, and adds E2E timing checks to detect future regressions.

Changes:

  • Switch cached-package installation and walinuxagent hold/unhold operations to dpkg-based fast paths (with apt fallbacks).
  • Export FULL_INSTALL_REQUIRED so it’s available in subshell contexts.
  • Add E2E CSE timing extraction + two Ubuntu 22.04 performance scenarios with thresholds.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
parts/linux/cloud-init/artifacts/ubuntu/cse_helpers_ubuntu.sh Use dpkg --set-selections for walinuxagent hold/unhold and dpkg -i for cached .deb installs; add wait_for_dpkg_lock.
parts/linux/cloud-init/artifacts/cse_main.sh Export FULL_INSTALL_REQUIRED after determining install mode.
e2e/cse_timing.go New helper to extract CSE task durations from Guest Agent event JSON and validate against thresholds.
e2e/scenario_cse_perf_test.go New Ubuntu 22.04 cached/full-install performance scenarios enforcing timing thresholds.
.github/workflows/pr-lint.yaml Bump actions/github-script from v8 to v9.

Comment on lines +14 to +18
echo "walinuxagent hold" | dpkg --set-selections 2>/dev/null || \
{ retrycmd_if_failure 120 5 25 apt-mark hold walinuxagent || exit $ERR_HOLD_WALINUXAGENT; }
elif [ "$1" = "unhold" ]; then
exit $ERR_RELEASE_HOLD_WALINUXAGENT
echo "walinuxagent install" | dpkg --set-selections 2>/dev/null || \
{ retrycmd_if_failure 120 5 25 apt-mark unhold walinuxagent || exit $ERR_RELEASE_HOLD_WALINUXAGENT; }
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the fallback path, apt-mark hold/unhold is invoked without calling wait_for_apt_locks. If dpkg --set-selections fails (e.g., transient dpkg state), the fallback can race with other apt/dpkg users and reintroduce lock contention/failures. Consider calling wait_for_apt_locks before the apt-mark retry, or reusing the existing wait_for_apt_locks helper for the fallback path.

Copilot uses AI. Check for mistakes.
Comment on lines 134 to 141
retrycmd_if_failure 10 5 120 dpkg -i --force-confold ${DEB_FILE}
if [ "$?" -eq 0 ]; then
return 0
fi
echo "dpkg -i failed for ${DEB_FILE}, falling back to apt-get install"
fi
wait_for_apt_locks
retrycmd_if_failure 10 5 600 apt-get -y -f install ${DEB_FILE} --allow-downgrades
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dpkg fast-path retries dpkg -i up to 10 times with 5s sleeps (retrycmd_if_failure 10 5 120 ...). If the failure is due to missing dependencies (common dpkg -i failure mode), this adds ~45s of avoidable delay before falling back to apt. Suggest running dpkg -i once (or with very small retry budget) and falling back immediately on failure; also quote ${DEB_FILE} to avoid word-splitting/globbing.

Suggested change
retrycmd_if_failure 10 5 120 dpkg -i --force-confold ${DEB_FILE}
if [ "$?" -eq 0 ]; then
return 0
fi
echo "dpkg -i failed for ${DEB_FILE}, falling back to apt-get install"
fi
wait_for_apt_locks
retrycmd_if_failure 10 5 600 apt-get -y -f install ${DEB_FILE} --allow-downgrades
dpkg -i --force-confold "${DEB_FILE}"
if [ "$?" -eq 0 ]; then
return 0
fi
echo "dpkg -i failed for ${DEB_FILE}, falling back to apt-get install"
fi
wait_for_apt_locks
retrycmd_if_failure 10 5 600 apt-get -y -f install "${DEB_FILE}" --allow-downgrades

Copilot uses AI. Check for mistakes.
VHD: config.VHDUbuntu2204Gen2Containerd,
BootstrapConfigMutator: func(nbc *datamodel.NodeBootstrappingConfiguration) {
},
VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) {
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VMConfigMutator writes into vmss.Tags[...] without ensuring the Tags map is initialized. PrepareVMSSModel does not create the map, so this can panic at runtime. Follow the pattern used elsewhere in e2e (e.g., scenario_test.go) by initializing vmss.Tags = map[string]*string{} when nil before setting the tag.

Suggested change
VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) {
VMConfigMutator: func(vmss *armcompute.VirtualMachineScaleSet) {
if vmss.Tags == nil {
vmss.Tags = map[string]*string{}
}

Copilot uses AI. Check for mistakes.
Comment on lines +200 to +216
report, err := ExtractCSETimings(ctx, s)
if err != nil {
s.T.Fatalf("failed to extract CSE timings: %v", err)
}

// Always log the full timing report
report.LogReport(ctx, s.T)

// Validate total CSE duration
if thresholds.TotalCSEThreshold > 0 {
totalDuration := report.TotalCSEDuration()
if totalDuration > thresholds.TotalCSEThreshold {
toolkit.LogDuration(ctx, totalDuration, thresholds.TotalCSEThreshold,
fmt.Sprintf("CSE total duration %s exceeds threshold %s", totalDuration, thresholds.TotalCSEThreshold))
s.T.Errorf("CSE total duration %s exceeds threshold %s", totalDuration, thresholds.TotalCSEThreshold)
}
}
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValidateCSETimings can silently pass even if timing extraction/parsing failed: if ExtractCSETimings returns an empty Tasks slice (e.g., no events, parse failures), total duration is 0 and no thresholds are checked. This makes the regression-detection test flaky/ineffective. Consider failing the test when no tasks were parsed and/or when the expected "AKS.CSE.cse_start" task (and any task suffixes in thresholds.TaskThresholds) are missing.

Copilot uses AI. Check for mistakes.
Comment on lines +124 to +147
// Each line is a separate JSON object (one per event file)
lines := strings.Split(strings.TrimSpace(result.stdout), "\n")
for _, line := range lines {
line = strings.TrimSpace(line)
if line == "" {
continue
}

var event cseEventJSON
if err := json.Unmarshal([]byte(line), &event); err != nil {
continue
}
if event.TaskName == "" || event.Timestamp == "" || event.OperationId == "" {
continue
}

startTime, err := parseCSETimestamp(event.Timestamp)
if err != nil {
continue
}
endTime, err := parseCSETimestamp(event.OperationId)
if err != nil {
continue
}
Copy link

Copilot AI Apr 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ExtractCSETimings currently ignores JSON unmarshal and timestamp parse errors (continue), which can hide issues (format drift, unexpected files in the directory) and result in partial/empty reports. For a regression test, it’s safer to surface these as errors (or at least count/Logf them and fail if error rate is non-zero) so timing validation doesn’t silently degrade over time.

Copilot uses AI. Check for mistakes.
@djsly djsly force-pushed the fix/cse-apt-lock-contention-and-perf-tests branch from a7b9d57 to 6f86252 Compare April 11, 2026 23:40
{ retrycmd_if_failure 120 5 25 apt-mark hold walinuxagent || exit $ERR_HOLD_WALINUXAGENT; }
elif [ "$1" = "unhold" ]; then
exit $ERR_RELEASE_HOLD_WALINUXAGENT
echo "walinuxagent install" | dpkg --set-selections 2>/dev/null || \
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn’t this the same comment for hold and unhold?

wait_for_dpkg_lock() {
while fuser /var/lib/dpkg/lock /var/lib/dpkg/lock-frontend >/dev/null 2>&1; do
echo 'Waiting for release of dpkg locks'
sleep 3
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a very long time. It will cause too much CSE increased latency.

fi
}

wait_for_dpkg_lock() {
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we ensure we log an event to track all source call where we wait for a lock ?

echo "Failed to get the install mode and cleanup container images"
exit "$ERR_CLEANUP_CONTAINER_IMAGES"
fi
export FULL_INSTALL_REQUIRED
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you exporting this var. it’s unclear for reviewers.

…k contention

Replace apt-get/apt-mark with dpkg equivalents for cached (golden image) VHDs:

- installDebPackageFromFile: use 'dpkg -i' instead of 'apt-get -f install' when
  FULL_INSTALL_REQUIRED=false. The first apt-get call after VHD boot triggers
  expensive apt initialization (~30-50s) and holds dpkg locks. dpkg -i bypasses
  apt entirely and completes in <100ms for pre-cached packages.

- aptmarkWALinuxAgent: use 'dpkg --set-selections' instead of 'apt-mark hold/unhold'.
  apt-mark internally initializes the apt cache even for simple hold operations,
  causing 10-30s delays on first invocation.

Both functions include fallback to the original apt-based commands if dpkg fails.

Add CSE timing regression detection framework:
- e2e/cse_timing.go: Extracts CSE task timings from VM event JSON files
- e2e/scenario_cse_perf_test.go: Two perf scenarios (cached + full install paths)

E2E test results with fix (Ubuntu 22.04 golden image):
- aptmarkWALinuxAgent: 21.7s → 0.06s (360x faster)
- kubelet deb install: 47.6s → 0.04s (1190x faster)
- Total CSE time: 19.65s (down from 47s+ in production)

Root cause: PR #8105 removed installContainerRuntime's natural ~10s delay that
masked the first-apt-call initialization cost. This fix addresses the underlying
issue instead of re-introducing the delay.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@awesomenix awesomenix force-pushed the fix/cse-apt-lock-contention-and-perf-tests branch from 6f86252 to 0a1d334 Compare April 12, 2026 03:33
Copilot AI review requested due to automatic review settings April 12, 2026 03:33
@awesomenix awesomenix changed the title fix: use dpkg instead of apt for cached packages to eliminate CSE lock contention feat: add cse timings test to catch regression with cached and full package installation Apr 12, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comment on lines +14 to +24
// CSE performance thresholds for the golden image (cached) path.
// These represent the expected normal performance when all binaries are pre-cached on the VHD.
// If any of these are exceeded, it indicates a regression in CSE task ordering or apt lock contention.
var cachedCSEThresholds = CSETimingThresholds{
TotalCSEThreshold: 60 * time.Second,
TaskThresholds: map[string]time.Duration{
"installDebPackageFromFile": 25 * time.Second,
"aptmarkWALinuxAgent": 10 * time.Second,
"configureKubeletAndKubectl": 30 * time.Second,
"ensureContainerd": 15 * time.Second,
},
Copy link

Copilot AI Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description claims the CSE lock contention regression is fixed by switching aptmarkWALinuxAgent to dpkg --set-selections and exporting FULL_INSTALL_REQUIRED, but in the current tree aptmarkWALinuxAgent still uses apt-mark and FULL_INSTALL_REQUIRED is not exported. If those script changes are intended in this PR, they appear to be missing, which would leave the underlying regression unfixed even though new perf tests were added.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants