Skip to content

feat: add GRID v20 and CUDA LTS driver containers#158

Merged
ganeshkumarashok merged 1 commit into
mainfrom
support-grid-v20
May 29, 2026
Merged

feat: add GRID v20 and CUDA LTS driver containers#158
ganeshkumarashok merged 1 commit into
mainfrom
support-grid-v20

Conversation

@ganeshkumarashok
Copy link
Copy Markdown
Collaborator

@ganeshkumarashok ganeshkumarashok commented May 27, 2026

Adds two new NVIDIA driver container variants, following the same matrix-build pattern as the existing aks-gpu-cuda / aks-gpu-cuda-arm64 split:

  • aks-gpu-grid-v20 — NVIDIA GRID v20 (driver 595.58.03). Required for the RTX PRO 6000 Blackwell Server Edition v6 SKU family. Existing aks-gpu-grid (v18, 570.211.01) is unchanged.
  • aks-gpu-cuda-lts and aks-gpu-cuda-lts-arm64 — NVIDIA R580 LTSB (580.159.04). A Long Term Support Branch variant alongside the existing Production Branch aks-gpu-cuda (595.71.05). R580 LTSB is supported by NVIDIA through Aug 2028.

Changes

  • driver_config.yml: add grid_v20 and cuda_lts blocks.
  • main.yaml / ci.yaml: matrix-include both branches in each of the grid, cuda, and cuda-arm64 jobs; DRIVER_KIND=grid/cuda literal; cache keys and image tags scope by image repo.
  • justfile: add buildgridv20/pushgridv20 and buildcudalts/pushcudalts.

Design trade-off: separate MCR repo vs. shared repo with version-prefix tags

Two designs were considered:

Chosen (this PR): new MCR repo per driver branch — e.g. aks-gpu-grid (570.x), aks-gpu-grid-v20 (595.x), aks-gpu-cuda (PB), aks-gpu-cuda-lts (LTSB).

Alternative considered: keep one MCR repo per driver kind, distinguish branches by tag prefix — e.g. aks-gpu-grid:570.* and aks-gpu-grid:595.* both pushed from this same workflow.

Recurring cost vs. one-time cost

The biggest asymmetry between the two designs is when the complexity is paid:

  • Chosen design — recurring cost per new branch. Each new major (a hypothetical future grid_v21, next CUDA LTSB after R580, etc.) requires a new MCR repo onboarding, a new datamodel constant in AgentBaker, a new line in the SKU → repo selection logic, and a new Renovate rule. Version is encoded in the image name, so every branch bump requires coordinating a new image name across repos.
  • Alternative — one-time cost up front, then cheap. New majors are just a new entry in driver_config.yml and a new tag on the existing repo; the image name does not change. Version lives in the tag, which is its natural place. But: the schema/parser/renovate complexity to support multiple branches in one repo has to be paid once up front, and it brings a silent-regression risk that has to be permanently guarded against.

Where the alternative pushes complexity in AgentBaker

  1. components.json schema disambiguation. With separate repos, each entry has a unique downloadURL. With a shared repo, both entries would have the same downloadURL (aks-gpu-grid:*), forcing a new disambiguator field (e.g. driverBranch: "570" / "595") or a change to the :* placeholder convention.
  2. gpu_components.go parser rework. Today it does switch strings.TrimSuffix(image.DownloadURL, ":*") — two entries pointing to the same repo collapse to the same case, second silently overwrites first. Disambiguation requires switching on (repo, branch).
  3. Renovate config — silent-regression foot-gun. Highest-sorting tag wins by default (595.x > 570.x), which would silently overwrite the v18 entry with a 595.x tag, regressing every v18 GRID node. Preventing this requires matchCurrentValue: "/^570\\./" and /^595\\./ regex pins on each rule — permanent complexity plus an ambient "latest tag" ambiguity for anyone outside AgentBaker pulling naively.

Verdict

The chosen design is simpler to reason about today but pays its cost every time a new branch is added. The alternative is more natural for versions-as-tags and cheaper per new branch, but has a permanent silent-regression risk in Renovate that has to be actively guarded against.

Given the cadence of new driver branches (a couple per year between GRID and CUDA LTSBs) and the existing precedent of aks-gpu-cuda / aks-gpu-cuda-arm64 already being separate repos, the chosen design is the more compatible step. If we end up onboarding many more branches over time, switching to the tag-based model later is a reasonable migration target.

Out of scope

Operational note

New MCR paths (aks-gpu-grid-v20, aks-gpu-cuda-lts, aks-gpu-cuda-lts-arm64) need standard onboarding before downstream consumers can pull them.

Co-authored-by: Copilot 223556219+Copilot@users.noreply.github.com

@ganeshkumarashok
Copy link
Copy Markdown
Collaborator Author

Operational follow-ups before downstream rollout

A rubber-duck pass surfaced two items worth tracking. Both are operational, not code blockers for this PR.

1. MCR onboarding for the new repo path (action required)

The new image path needs to be onboarded to MCR before AgentBaker can pull it. Confirmed with a current check:

GET https://mcr.microsoft.com/v2/aks/aks-gpu-grid/tags/list       -> 200
GET https://mcr.microsoft.com/v2/aks/aks-gpu-grid-v20/tags/list   -> 404

Without the registration/ACL/replication setup, the first push from this workflow will succeed against ACR but the public MCR path will still 404. Please trigger the standard new-MCR-repo onboarding for public/aks/aks-gpu-grid-v20 before the AgentBaker PR cuts over.

2. Runtime validation on RTX PRO 6000 BSE v6 hardware

Blackwell GPUs require the open-source NVIDIA kernel module. The 595.x branch should default to the open kmod at install time (default since 560.x), so the existing install.sh flags are expected to work, but this has not been validated on actual Blackwell hardware. Recommend a one-time runtime check (nvidia-smi on a real node) before promoting GA SKUs to this image.

@djsly
Copy link
Copy Markdown
Collaborator

djsly commented May 28, 2026

Could you augment at the same time the logic to support two CUDA drivers in AKS-GPU

Adds two new NVIDIA driver container images alongside the existing ones,
following the same matrix-build pattern as the existing aks-gpu-cuda /
aks-gpu-cuda-arm64 split:

- `aks-gpu-grid-v20` (GRID v20, driver `595.58.03`): required for the
  RTX PRO 6000 Blackwell Server Edition v6 SKU family. Existing
  `aks-gpu-grid` (v18, 570.211.01) is unchanged.
- `aks-gpu-cuda-lts` and `aks-gpu-cuda-lts-arm64` (CUDA LTS, NVIDIA
  R580 LTSB `580.159.04`): a Long Term Support Branch variant alongside
  the existing Production Branch `aks-gpu-cuda` (595.71.05). R580 LTSB
  is supported by NVIDIA through Aug 2028.

Files:
- `driver_config.yml`: add `grid_v20` and `cuda_lts` blocks.
- `main.yaml` / `ci.yaml`: matrix-include both branches in each of the
  grid, cuda, and cuda-arm64 jobs. `DRIVER_KIND=grid` / `cuda` is
  literal; cache keys and image tags scope by image repo.
- `justfile`: add `buildgridv20` / `pushgridv20` and
  `buildcudalts` / `pushcudalts`.

Out of scope:
- `auto_update.py` — new variants pinned manually for now, same flow
  as the existing v18 pin (PR #154). A focused follow-up can extend the
  updater to handle each branch within its own major.
- AgentBaker consumption of the new images — separate PR.
- Mariner/Azure Linux/ACL GRID install paths — use separate
  package/sysext flows, unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ganeshkumarashok ganeshkumarashok changed the title feat: add GRID v20 driver container for RTX PRO 6000 BSE v6 SKUs feat: add GRID v20 and CUDA LTS driver containers May 28, 2026
Copy link
Copy Markdown
Collaborator

@sulixu sulixu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ganeshkumarashok ganeshkumarashok merged commit 511547c into main May 29, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants