perf: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM) by ganeshkumarashok · Pull Request #8613 · Azure/AgentBaker

ganeshkumarashok · 2026-05-31T19:16:17Z

Summary

Three small, independent CSE-time optimizations that trim the GPU node provisioning critical path on Ubuntu. None change the default driver install behavior; each is low-risk and self-contained.

1. Skip the redundant driver image pull (`configGPUDrivers`)

The aks-gpu-cuda image is normally pre-pulled into the VHD, but configGPUDrivers() unconditionally ran ctr image pull at boot — a wasted manifest/layer round trip to MCR (and exposure to MCR throttling). Now we only pull when the image is genuinely absent locally; otherwise we go straight to ctr run.

2. Async image cleanup (drop `--sync`)

The post-install ctr images rm --sync blocked CSE waiting for containerd GC to finish. Dropping --sync removes the image reference immediately and lets GC reclaim space asynchronously — same disk outcome, no blocking.

3. Defer DCGM telemetry off the critical path

nvidia-dcgm and nvidia-dcgm-exporter are monitoring only and don't gate GPU workload scheduling, yet they were started with the blocking systemctlEnableAndStart and hard-exited CSE on a slow/failed start. They now use systemctlEnableAndStartNoBlock and are non-fatal. The nvidia-device-plugin start stays blocking and fatal because it gates the node advertising GPUs to the scheduler.

Tests

New shellspec coverage for startNvidiaManagedExpServices: asserts device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the critical path and don't fail provisioning. (7/7 GPU-service examples pass.)
go test ./pkg/agent/... — pass.
make generate — no snapshot diffs.

Risk / behavior notes

Initial Setup #1 keeps a full fallback pull when the image is missing — no regression on non-prebaked images.
Add vhdbuilder and publisher #2 still removes the image; only the synchronous wait is gone.
import: remove iovisor apt remove from vhd #3 makes DCGM/exporter start non-fatal. This is the intended trade-off (telemetry should never block or fail node provisioning); reviewers who want DCGM failures to remain fatal should flag it.

Coordination

Touches configGPUDrivers(), which the held PR #8612 (prebuild GPU kernel module) also edits — these two will need a light rebase against each other whenever both land. This PR is independent and can merge on its own.

… image cleanup, defer DCGM) Three low-risk CSE-time optimizations for GPU nodes, none of which change the default driver install behavior: 1. Skip the redundant `ctr image pull` in configGPUDrivers() when the driver image is already present locally. The image is normally pre-pulled into the VHD, so the boot-time pull was paying a wasted manifest/layer round trip to MCR; we still pull as a fallback when the image is genuinely missing. 2. Drop `--sync` from the post-install `ctr images rm` so containerd garbage collection happens asynchronously instead of blocking provisioning. The image reference is still removed to reclaim disk. 3. Start nvidia-dcgm and nvidia-dcgm-exporter with systemctlEnableAndStartNoBlock and treat a slow/failed start as non-fatal. These are telemetry only and do not gate GPU workload scheduling. The nvidia-device-plugin start stays blocking and fatal because it gates the node advertising GPUs to the scheduler. Adds shellspec coverage for startNvidiaManagedExpServices asserting the device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the critical path and do not fail provisioning. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR optimizes Ubuntu GPU node provisioning in the Linux CSE path by avoiding redundant container image pulls, reducing post-install blocking work, and moving non-critical DCGM telemetry startup off the provisioning critical path.

Changes:

Skip pulling the NVIDIA driver container image when it’s already present in containerd.
Make driver image cleanup non-blocking by dropping ctr images rm --sync.
Start nvidia-dcgm and nvidia-dcgm-exporter asynchronously and treat enqueue failures as non-fatal, while keeping nvidia-device-plugin blocking/fatal.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
parts/linux/cloud-init/artifacts/cse_config.sh	Adds a local-image presence check before pulling, makes image cleanup async, and defers DCGM/exporter startup off the critical path with non-fatal handling.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh	Adds ShellSpec coverage ensuring device-plugin remains blocking while DCGM/exporter are started via the no-block path and failures don’t fail provisioning.

-        ctr -n k8s.io image pull $NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG
+        # The driver image is normally pre-pulled into the VHD; only hit the registry when it is
+        # actually missing so provisioning doesn't pay a redundant manifest/layer round trip.
+        if ! ctr -n k8s.io images ls -q | grep -qx "$NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG"; then


ganeshkumarashok · 2026-06-01T17:21:40Z

Closing in favor of #8615, which is the identical change from a same-repo branch. AgentBaker requires PRs from Azure/AgentBaker branches (fork PRs don't get CI/secrets).

Copilot AI review requested due to automatic review settings May 31, 2026 19:16

ganeshkumarashok requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, lilypan26, mxj220, pdamianov-dev, phealy, r2k1, sulixu, surajssd, timmy-wright and zachary-bailey as code owners May 31, 2026 19:16

Copilot started reviewing on behalf of ganeshkumarashok May 31, 2026 19:16 View session

Copilot AI reviewed May 31, 2026

View reviewed changes

ganeshkumarashok mentioned this pull request Jun 1, 2026

refactor: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM) #8615

Open

ganeshkumarashok closed this Jun 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)#8613

perf: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)#8613
ganeshkumarashok wants to merge 1 commit into
Azure:mainfrom
ganeshkumarashok:gpu-provisioning-boot-path

ganeshkumarashok commented May 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

ganeshkumarashok commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ganeshkumarashok commented May 31, 2026

Summary

1. Skip the redundant driver image pull (configGPUDrivers)

2. Async image cleanup (drop --sync)

3. Defer DCGM telemetry off the critical path

Tests

Risk / behavior notes

Coordination

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

ganeshkumarashok commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. Skip the redundant driver image pull (`configGPUDrivers`)

2. Async image cleanup (drop `--sync`)