Skip to content

feat(gpu): use aks-gpu-cuda-lts (R580 LTS) for the managed CUDA driver#8811

Merged
ganeshkumarashok merged 2 commits into
mainfrom
ganesh/gpu-cuda-lts
Jul 1, 2026
Merged

feat(gpu): use aks-gpu-cuda-lts (R580 LTS) for the managed CUDA driver#8811
ganeshkumarashok merged 2 commits into
mainfrom
ganesh/gpu-cuda-lts

Conversation

@ganeshkumarashok

@ganeshkumarashok ganeshkumarashok commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Switches the managed CUDA GPU driver image from aks-gpu-cuda to the R580 LTS variant aks-gpu-cuda-lts (580.159.04-20260629214430).

Why: Enabling the CUDA driver prebake (bake #8786, flag #8803) requires an aks-gpu image that supports the build-only action (aks-gpu #162). The only build-only-capable aks-gpu-cuda builds are on the R595 line — and NVIDIA R595 drops Volta/V100 (R590+ deprecation). Fleet telemetry (AKSprod AgentPoolSnapshot) shows that a significant number of managed-GPU nodes, spanning many customer subscriptions, still run on V100 (NC*_v3, ND40rs_v2) — and nearly all of them rely on the managed driver (only a small fraction opt out via --gpu-driver None). So bumping to R595 (the approach in #8810) would break them.

aks-gpu already publishes exactly what we need: aks-gpu-cuda-lts = the NVIDIA R580 Long Term Support branch (driver_config.yml: "supported through Aug 2028"), which keeps V100 and whose post-#162 builds already have build-only. It also covers every other managed CUDA SKU (T4/A100/H100/H200). This keeps V100 working and unblocks the prebake — with no aks-gpu change. It's a move within the 580 line (580.126.09 → 580.159.04), not a driver-branch jump.

Wiring (traced end-to-end):

Area Change
parts/common/components.json aks-gpu-cudaaks-gpu-cuda-lts (repo, renovateTag, version)
pkg/agent/datamodel/gpu_components.go LoadConfig case → aks-gpu-cuda-lts (sets CUDA driver version/suffix for VHD build and runtime)
pkg/agent/baker.go GetGPUDriverType modern CUDA SKUs → "cuda-lts" (selects aks-gpu-cuda-lts via cse_helpers.sh aks-gpu-${type}); legacy NCv1/K80 stays "cuda" with its pinned R470 driver
parts/.../cse_config.sh logGPUDriverPrebakeReadiness map driver-type → aks-gpu marker driver_kind (cuda-ltscuda, grid-v20grid) so a CUDA-prebaked marker matches a cuda-lts node
vhdbuilder/packer/install-dependencies.sh VHD-build prebake/caching image selection → aks-gpu-cuda-lts
ACL / Mariner cse_install_*.sh comment accuracy; their grid-vs-non-grid sysext logic already handles cuda-lts

Scope note: ACL/AzureLinux install GPU drivers from OS sysext images, not the aks-gpu container, so this is effectively Ubuntu-scoped.

Validation: go test ✅ · make generate-testdata (no drift) ✅ · shellcheck (changed scripts clean) ✅ · shellspec 751/0 (incl. new cuda-ltscuda match test) ✅ · make validate-components

Relationship to other PRs:

Which issue(s) this PR fixes: Fixes #

Requirements

  • Conventional-commit title · DCO signed-off · GPG-verified · branch on Azure/AgentBaker
  • make generate-testdata clean; go test, shellspec, validate-components pass

Switch the managed CUDA GPU driver image from aks-gpu-cuda to the R580 LTS
variant aks-gpu-cuda-lts (580.159.04-20260629214430).

Why: enabling the CUDA driver prebake (#8786 / #8803) needs an aks-gpu image
that supports the `build-only` action (aks-gpu #162). The only build-only-capable
aks-gpu-cuda images are on the R595 line, which drops NVIDIA Volta/V100 support --
and ~487 managed-GPU nodes across ~293 subscriptions still run on V100 (NC*_v3,
ND40rs_v2). aks-gpu already ships aks-gpu-cuda-lts: the NVIDIA R580 Long Term
Support branch (supported through Aug 2028), which keeps V100 AND whose post-#162
builds have build-only, and which also covers every other managed CUDA SKU
(T4/A100/H100/H200). So this keeps V100 working and unblocks the prebake with no
aks-gpu change. It is a move within the 580 line (580.126.09 -> 580.159.04), not
a driver-branch jump.

Wiring:
- components.json: aks-gpu-cuda -> aks-gpu-cuda-lts (repo, renovateTag, version).
- gpu_components.go: LoadConfig case -> aks-gpu-cuda-lts (drives the CUDA driver
  version/suffix used by both VHD build and runtime install).
- baker.go GetGPUDriverType: modern CUDA SKUs -> "cuda-lts" (selects the
  aks-gpu-cuda-lts image via cse_helpers.sh `aks-gpu-${type}`); legacy NCv1 (K80)
  stays on "cuda" with its pinned R470 driver.
- cse_config.sh logGPUDriverPrebakeReadiness: map the driver-type to the aks-gpu
  marker's driver_kind (cuda-lts -> cuda, grid-v20 -> grid) so a CUDA-prebaked
  marker matches a cuda-lts node.
- install-dependencies.sh: VHD-build prebake/caching image selection -> aks-gpu-cuda-lts.
- ACL/Mariner comments updated; their grid-vs-non-grid sysext logic already
  handles "cuda-lts". ACL/AzureLinux install drivers from OS sysext images (not
  the aks-gpu container), so this change is effectively Ubuntu-scoped.

Validation: go test, make generate-testdata (no drift), shellcheck, shellspec
(751/0), make validate-components all pass.

Supersedes #8810 (which bumped aks-gpu-cuda to R595 and would have dropped V100).

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the managed NVIDIA CUDA driver container image selection so GPU nodes use the R580 LTS–based aks-gpu-cuda-lts image (to retain Volta/V100 support while enabling build-only prebake), and wires the new driver type through AgentBaker’s Go selection logic, VHD build caching, and CSE observability/tests.

Changes:

  • Switch GPU container image metadata in parts/common/components.json from aks-gpu-cuda to aks-gpu-cuda-lts (including renovate tag + version).
  • Update AgentBaker Go GPU component loading and driver-type selection to use cuda-lts for modern CUDA SKUs (while keeping legacy NCv1 on cuda).
  • Update VHD build pre-pull logic and add ShellSpec coverage for the prebake marker kind mapping.

Package Update Analysis: aks-gpu-cuda-lts

Version change: 580.126.09-20260126030251580.159.04-20260629214430 (minor update within the 580 driver branch)
OS variants affected: Ubuntu VHD build/runtime managed-driver path (container-based install)
OS variants NOT updated: AzureLinux/Mariner/ACL (they install via sysext/RPM paths, not the aks-gpu CUDA container)

Changes between versions: Upstream, version-specific release-note diffs for these exact driver point releases were not found in a reliably citable form during this review. Manual validation (e2e GPU scenarios + targeted V100/NCv3 coverage) is recommended before merge.

Overall Risk: Medium
Justification: Although the driver stays on the R580 line (not a major branch jump), the PR also changes the image repository and introduces a new driver-type string (cuda-lts) that must remain compatible with legacy selection paths (notably NCv1/K80).

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
parts/common/components.json Switch CUDA GPU image metadata to aks-gpu-cuda-lts and bump version.
pkg/agent/datamodel/gpu_components.go Load CUDA driver version/suffix from aks-gpu-cuda-lts repo entry.
pkg/agent/datamodel/gpu_components_test.go Update repo parsing test to expect aks-gpu-cuda-lts.
pkg/agent/baker.go Return cuda-lts for modern CUDA SKUs; keep legacy NCv1 on cuda.
pkg/agent/baker_test.go Update expectations for cuda-lts and add explicit NCv1 legacy coverage.
vhdbuilder/packer/install-dependencies.sh Pre-pull aks-gpu-cuda-lts during Ubuntu VHD build and update related error text.
parts/linux/cloud-init/artifacts/cse_config.sh Map driver-type to marker driver_kind for prebake readiness logging (cuda-ltscuda).
spec/parts/linux/cloud-init/artifacts/cse_config_spec.sh Add ShellSpec test for cuda-lts marker-kind matching behavior.
parts/linux/cloud-init/artifacts/mariner/cse_install_mariner.sh Update comments describing cuda-lts vs legacy cuda behavior (no logic change).
parts/linux/cloud-init/artifacts/acl/cse_install_acl.sh Update comments describing cuda-lts vs legacy cuda behavior (no logic change).

Comment thread pkg/agent/baker.go
Comment thread parts/common/components.json
The custom versioning rule that lets Renovate parse the driver image's
"<major.minor.patch>-<timestamp>" tag matched "aks/aks-gpu-cuda"; after moving
the managed CUDA driver to aks-gpu-cuda-lts, retarget the rule (and groupName)
so the LTS repo is version-tracked. Also flip automerge to false: this is now
the V100-critical managed driver, so driver bumps should be reviewed (matching
the aks-gpu-grid / grid-v20 rules) rather than auto-merged.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
Comment thread pkg/agent/baker.go
@ganeshkumarashok ganeshkumarashok merged commit 70d4fff into main Jul 1, 2026
37 of 46 checks passed
@ganeshkumarashok ganeshkumarashok deleted the ganesh/gpu-cuda-lts branch July 1, 2026 23:48
ganeshkumarashok added a commit that referenced this pull request Jul 2, 2026
#8811 moved the managed CUDA driver from aks-gpu-cuda to aks-gpu-cuda-lts
(R580 LTS, V100-capable). This keeps the pre-LTS aks-gpu-cuda image tracked in
the component manifest during the transition, pinned to the R580 line.

Minimal footprint -- components.json + renovate only; no Go/behavior change:
- components.json: add an aks-gpu-cuda entry pinned to the R580 line
  (580.126.09), NOT the R595 line that drops Volta/V100. LoadConfig has no case
  for it, so it touches no driver-version global (avoiding a clobber of the
  aks-gpu-cuda-lts render values -- they share the NvidiaCudaDriverVersion /
  AKSGPUCudaVersionSuffix globals).
- renovate.json: constrain aks-gpu-cuda to /^580\./ so it never bumps to the
  V100-dropping R595 line.

Deliberately inert: not baked into the VHD (install-dependencies.sh only
pre-pulls aks-gpu-cuda-lts) and not a render target (GetGPUDriverType returns
"cuda-lts"). Old-VHD / version-skewed nodes that target aks-gpu-cuda already
resolve it at boot via the hardened registry pull (#8821), served by
required-MCR egress or the wildcard network-isolated ACR cache. This just keeps
it a recognized, V100-safe, renovate-managed reference during the transition.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
ganeshkumarashok added a commit that referenced this pull request Jul 3, 2026
…a-lts

#8811 moved the managed CUDA driver from aks-gpu-cuda to aks-gpu-cuda-lts,
reusing the NvidiaCudaDriverVersion / AKSGPUCudaVersionSuffix globals for the
LTS image. This restores a first-class `case "aks-gpu-cuda"` in LoadConfig so
the pre-LTS image's version is loaded and available if a SKU is ever routed to
the "cuda" image in CSE again -- without disturbing today's render.

- components.json: add an aks-gpu-cuda entry pinned to the R580 line
  (580.126.09), NOT the R595 line that drops Volta/V100.
- gpu_components.go: aks-gpu-cuda reclaims NvidiaCudaDriverVersion /
  AKSGPUCudaVersionSuffix (its pre-#8811 names); aks-gpu-cuda-lts moves to
  NvidiaCudaLTSDriverVersion / AKSGPUCudaLTSVersionSuffix. Mirrors the existing
  base-vs-variant naming (NvidiaGridDriverVersion vs NvidiaGridV20DriverVersion)
  and avoids clobbering a shared global.
- baker.go: GetGPUDriverVersion / GetAKSGPUImageSHA render the LTS globals for
  modern CUDA SKUs, so rendered output is byte-identical (verified: zero
  testdata drift). aks-gpu-cuda is loaded but not the default render target.
- renovate.json: constrain aks-gpu-cuda to /^580\./ so it never bumps to R595.

Still not baked into the VHD (install-dependencies.sh only pre-pulls
aks-gpu-cuda-lts). Old-VHD / skewed nodes that target aks-gpu-cuda resolve it at
boot via the hardened pull (#8821), served by required-MCR or the wildcard
network-isolated ACR cache.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants