Skip to content

perf: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)#8613

Closed
ganeshkumarashok wants to merge 1 commit into
Azure:mainfrom
ganeshkumarashok:gpu-provisioning-boot-path
Closed

perf: trim GPU provisioning critical path (skip redundant pull, async cleanup, defer DCGM)#8613
ganeshkumarashok wants to merge 1 commit into
Azure:mainfrom
ganeshkumarashok:gpu-provisioning-boot-path

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Contributor

Summary

Three small, independent CSE-time optimizations that trim the GPU node provisioning critical path on Ubuntu. None change the default driver install behavior; each is low-risk and self-contained.

1. Skip the redundant driver image pull (configGPUDrivers)

The aks-gpu-cuda image is normally pre-pulled into the VHD, but configGPUDrivers() unconditionally ran ctr image pull at boot — a wasted manifest/layer round trip to MCR (and exposure to MCR throttling). Now we only pull when the image is genuinely absent locally; otherwise we go straight to ctr run.

2. Async image cleanup (drop --sync)

The post-install ctr images rm --sync blocked CSE waiting for containerd GC to finish. Dropping --sync removes the image reference immediately and lets GC reclaim space asynchronously — same disk outcome, no blocking.

3. Defer DCGM telemetry off the critical path

nvidia-dcgm and nvidia-dcgm-exporter are monitoring only and don't gate GPU workload scheduling, yet they were started with the blocking systemctlEnableAndStart and hard-exited CSE on a slow/failed start. They now use systemctlEnableAndStartNoBlock and are non-fatal. The nvidia-device-plugin start stays blocking and fatal because it gates the node advertising GPUs to the scheduler.

Tests

  • New shellspec coverage for startNvidiaManagedExpServices: asserts device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the critical path and don't fail provisioning. (7/7 GPU-service examples pass.)
  • go test ./pkg/agent/... — pass.
  • make generate — no snapshot diffs.

Risk / behavior notes

Coordination

Touches configGPUDrivers(), which the held PR #8612 (prebuild GPU kernel module) also edits — these two will need a light rebase against each other whenever both land. This PR is independent and can merge on its own.

… image cleanup, defer DCGM)

Three low-risk CSE-time optimizations for GPU nodes, none of which change the
default driver install behavior:

1. Skip the redundant `ctr image pull` in configGPUDrivers() when the driver
   image is already present locally. The image is normally pre-pulled into the
   VHD, so the boot-time pull was paying a wasted manifest/layer round trip to
   MCR; we still pull as a fallback when the image is genuinely missing.

2. Drop `--sync` from the post-install `ctr images rm` so containerd garbage
   collection happens asynchronously instead of blocking provisioning. The
   image reference is still removed to reclaim disk.

3. Start nvidia-dcgm and nvidia-dcgm-exporter with
   systemctlEnableAndStartNoBlock and treat a slow/failed start as non-fatal.
   These are telemetry only and do not gate GPU workload scheduling. The
   nvidia-device-plugin start stays blocking and fatal because it gates the
   node advertising GPUs to the scheduler.

Adds shellspec coverage for startNvidiaManagedExpServices asserting the
device-plugin stays blocking while dcgm/dcgm-exporter are enqueued off the
critical path and do not fail provisioning.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes Ubuntu GPU node provisioning in the Linux CSE path by avoiding redundant container image pulls, reducing post-install blocking work, and moving non-critical DCGM telemetry startup off the provisioning critical path.

Changes:

  • Skip pulling the NVIDIA driver container image when it’s already present in containerd.
  • Make driver image cleanup non-blocking by dropping ctr images rm --sync.
  • Start nvidia-dcgm and nvidia-dcgm-exporter asynchronously and treat enqueue failures as non-fatal, while keeping nvidia-device-plugin blocking/fatal.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
parts/linux/cloud-init/artifacts/cse_config.sh Adds a local-image presence check before pulling, makes image cleanup async, and defers DCGM/exporter startup off the critical path with non-fatal handling.
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Adds ShellSpec coverage ensuring device-plugin remains blocking while DCGM/exporter are started via the no-block path and failures don’t fail provisioning.

ctr -n k8s.io image pull $NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG
# The driver image is normally pre-pulled into the VHD; only hit the registry when it is
# actually missing so provisioning doesn't pay a redundant manifest/layer round trip.
if ! ctr -n k8s.io images ls -q | grep -qx "$NVIDIA_DRIVER_IMAGE:$NVIDIA_DRIVER_IMAGE_TAG"; then
@ganeshkumarashok
Copy link
Copy Markdown
Contributor Author

Closing in favor of #8615, which is the identical change from a same-repo branch. AgentBaker requires PRs from Azure/AgentBaker branches (fork PRs don't get CI/secrets).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants