Skip to content

feat(gpu): bump aks-gpu-cuda to 595.71.05 (build-only-capable; prereq for CUDA prebake)#8810

Closed
ganeshkumarashok wants to merge 1 commit into
mainfrom
ganesh/bump-aks-gpu-cuda-buildonly
Closed

feat(gpu): bump aks-gpu-cuda to 595.71.05 (build-only-capable; prereq for CUDA prebake)#8810
ganeshkumarashok wants to merge 1 commit into
mainfrom
ganesh/bump-aks-gpu-cuda-buildonly

Conversation

@ganeshkumarashok

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Bumps the cached CUDA driver image aks-gpu-cuda in parts/common/components.json:

580.126.09-20260126030251  (Jan 2026)  →  595.71.05-20260623180420  (Jun 2026)

This is the prerequisite that unblocks the CUDA driver prebake (bake merged in #8786, flag enabled in #8803). The prebake runs the aks-gpu container with /entrypoint.sh build-only, an action added in aks-gpu #162 (merged June 2026). I verified against MCR that no 580.126.09 image supports build-only (newest 580 build is 2026-04-30, pre-#162) — only the 595.71.05 family (May–June builds) does. Without this bump the prebake fails at VHD build:

/opt/actions/install.sh: line 4: /opt/gpu/config.sh: No such file or directory  → exit 1

(observed in PR-gate build 170469320 for #8803). 595.71.05-20260623180420 is the exact image validated green in the prebake GPU e2e.

Only aks-gpu-cuda is bumped; aks-gpu-grid / aks-gpu-grid-v20 are not baked by the enablement (#8803) and are left unchanged.


🔴 BLOCKER — this drops V100 / Volta support (why it's a draft)

Version change: aks-gpu-cuda 580.126.09 → 595.71.05 (major driver-branch bump)
OS variants affected: Ubuntu 22.04 / 24.04 gen2 x86 (shared image used by all CUDA GPU SKUs)

Change Description Risk
Breaking NVIDIA R590/R595 branches no longer load on Tesla V100 (Volta/GV100); R580 is the last branch supporting it (NVIDIA fora, 590 deprecation) 🔴 High
Feature Adds build-only entrypoint (aks-gpu #162) → enables VHD-time DKMS prebake 🟢 Low
Feature Newer CUDA driver for Ampere/Hopper/Ada/Blackwell 🟡 Medium

AgentBaker e2e still actively tests V100 via Standard_NC6s_v3:
Test_Ubuntu2204_GPUNC, Test_ACL_GPUNC, Test_Ubuntu2204_GPUNoDriver(_Scriptless), Test_AzureLinuxV3_GPU. This bump will break those scenarios and any NC-v3 / V100 customer nodes on the shared image.

This is almost certainly why components.json has been held at 580.126.09 despite 595.71.05 being available since May.

Do NOT merge until one of these is resolved

  1. V100 / NC-v3 EOL sign-off — confirm Volta is out of support on the affected node images (then drop/skip the GPUNC e2e), or
  2. aks-gpu backports build-only to a V100-capable 580.x image — keeps V100 and enables prebake (no driver-branch jump), or
  3. Split the CUDA driver — V100 stays on 580.x (fallback compile, no prebake) while newer GPUs get 595.71.05 prebaked (needs driver-selection logic; larger change).

Overall Risk: 🔴 High

Justification: Major GPU driver-branch bump on the shared image that removes support for a still-tested, still-shipping GPU architecture (V100/NC-v3).
Recommendation: Hold as draft pending the V100 disposition above. Sequenced before #8803.

Which issue(s) this PR fixes: Fixes #

Requirements

  • Conventional-commit title · DCO signed-off · GPG-verified · branch on Azure/AgentBaker
  • make generate-testdata clean (GPU image version isn't snapshotted); make validate-components passes

… for CUDA prebake)

Bump the cached CUDA driver image aks-gpu-cuda from 580.126.09-20260126030251
(Jan 2026) to 595.71.05-20260623180420.

Prerequisite for the CUDA driver prebake (install-dependencies.sh, merged in
#8786; flag enabled in #8803). The prebake runs the aks-gpu container with
`/entrypoint.sh build-only`, an action added in aks-gpu #162 (June 2026). No
580.126.09 image supports build-only (newest 580 build is 2026-04-30, pre-#162);
only the 595.71.05 family (May-June builds) does. Without this bump the prebake
fails at VHD build: install.sh:4 `source /opt/gpu/config.sh: No such file`.
This exact image was validated green via the prebake GPU e2e.

WARNING -- drops V100/Volta support. NVIDIA R590/R595 branches no longer load on
Tesla V100 (Volta); R580 is the last branch supporting it. AgentBaker e2e still
exercises V100 via Standard_NC6s_v3 (Test_Ubuntu2204_GPUNC, Test_ACL_GPUNC,
GPUNoDriver, AzureLinux GPU), so this will break those scenarios and any NC-v3/
V100 nodes on the shared image. Do not merge without V100/NC-v3 EOL sign-off (or
an aks-gpu build-only backport to a V100-capable 580.x image). Kept as draft.

Signed-off-by: Ganeshkumar Ashokavardhanan <aganeshkumar@microsoft.com>
Copilot AI review requested due to automatic review settings July 1, 2026 20:36
@github-actions github-actions Bot added the components This pull request updates cached components on Linux or Windows VHDs label Jul 1, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR bumps the cached AKS CUDA GPU driver container image tag (aks-gpu-cuda) in parts/common/components.json, intended to unblock the Ubuntu VHD-time CUDA driver prebake path (which requires the newer build-only entrypoint behavior in the aks-gpu image family).

Changes:

  • Update aks-gpu-cuda image tag from 580.126.09-20260126030251 to 595.71.05-20260623180420.

Package Update Analysis: aks-gpu-cuda

Version change: 580.126.09 → 595.71.05 (major driver-branch bump)
OS variants affected: Ubuntu 22.04 gen2 x86, Ubuntu 24.04 gen2 x86 (this image is pre-pulled for Ubuntu during VHD build)
OS variants NOT updated: None (single shared tag in components.json)

Changes between 580.126.09 and 595.71.05

Change Description Risk
Breaking R595/R590 branches no longer support Tesla V100 (Volta); R580 is the legacy branch for Volta 🔴 High
Feature build-only / related entrypoint modes needed for VHD prebake workflows 🟢 Low
Driver branch update Newer datacenter driver branch with additional fixes/features vs R580 🟡 Medium

Overall Risk: 🔴 High

Justification: The repo still has active V100/NCsv3 coverage (e.g., Standard_NC6s_v3 GPU E2E scenarios), and an R595 bump on the shared CUDA driver image risks breaking those nodes/tests without an explicit V100/NCsv3 support/EOL decision or a split/backport strategy.
Recommendation: Hold until V100 disposition is confirmed (EOL vs backport vs split driver selection).

Comment on lines 739 to 743
"downloadURL": "mcr.microsoft.com/aks/aks-gpu-cuda:*",
"gpuVersion": {
"renovateTag": "registry=https://mcr.microsoft.com, name=aks/aks-gpu-cuda",
"latestVersion": "580.126.09-20260126030251"
"latestVersion": "595.71.05-20260623180420"
}
@ganeshkumarashok

Copy link
Copy Markdown
Contributor Author

Superseded by #8811. Rather than bump aks-gpu-cuda to R595 (which drops Volta/V100 — ~487 managed nodes across ~293 subscriptions per AKSprod AgentPoolSnapshot), #8811 switches the managed CUDA driver to the R580 LTS image aks-gpu-cuda-lts (580.159.04, supported through Aug 2028). That keeps V100 and enables the build-only prebake, with no aks-gpu change. Closing this in favor of #8811.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

components This pull request updates cached components on Linux or Windows VHDs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants