chore(deps): update dependency nvidia-cuda-driver to v595 by renovate[bot] · Pull Request #149 · Azure/aks-gpu

renovate · 2026-04-29T20:09:52Z

This PR contains the following updates:

Package	Update	Change
nvidia-cuda-driver	major	`580.126.09` → `595.71.05`

Configuration

📅 Schedule: (UTC)

Branch creation
- At any time (no schedule defined)
Automerge
- At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.

If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

surajssd · 2026-04-30T19:04:31Z

Testing nvidia-cuda-driver 580.126.09 -> 595.71.05

Key verification: Whether the new 595.x driver installs correctly and GPU workloads function as expected.

Step 1: Login and set up environment

source az-login.sh env azcore-linux-k8s-dev

export INDEX="1"
export AZURE_RESOURCE_GROUP=""
export AZURE_REGION="eastus2"
export CLUSTER_NAME="aks-${INDEX}"
export NODE_POOL_VM_SIZE="Standard_NC24ads_A100_v4"
export NODE_POOL_NAME="gpu"
export NODE_POOL_NODE_COUNT=1
AZURE_SUBSCRIPTION_ID=$(az account show --query id --output tsv)
export AZURE_SUBSCRIPTION_ID
export K8S_VERSION="1.34"

Step 2: Deploy AKS cluster with GPU node pool (driver=none)

Uses the aks-rdma-infiniband helper scripts to create the cluster and add a GPU node pool with --gpu-driver=none so we can manually install the driver from the aks-gpu PR branch.

git clone https://github.com/Azure/aks-rdma-infiniband
cd aks-rdma-infiniband

./tests/setup-infra/deploy-aks.sh deploy-aks &&
    ./tests/setup-infra/deploy-aks.sh add-nodepool --gpu-driver=none

Step 3: Build aks-gpu image from PR #149 branch

Checkout the PR branch and build a custom aks-gpu image with the new driver version.

gh pr checkout 149

DRIVER_VERSION="$(yq '.cuda.version' ./driver_config.yml)"
IMG="quay.io/surajd/aks-gpu:${DRIVER_VERSION}"
export DRIVER_VERSION
docker build --push \
    --build-arg DRIVER_VERSION="${DRIVER_VERSION}" \
    -t "$IMG" .

Step 4: Install the driver on the GPU node

Get a shell on the GPU node and run the aks-gpu container to install the NVIDIA driver. This replicates what AgentBaker does via configGPUDrivers() in cse_config.sh.

GPU_NODE=$(kubectl get nodes -l accelerator=nvidia -o name)
kubectl debug "${GPU_NODE}" --image=ubuntu --profile=sysadmin -it -- chroot /host /bin/bash

Once on the node:

mkdir -p /opt/{actions,gpu}
IMG="quay.io/surajd/aks-gpu:595.71.05"
ctr image pull "$IMG"
ctr run --privileged \
    --net-host \
    --with-ns pid:/proc/1/ns/pid \
    --mount type=bind,src=/opt/gpu,dst=/mnt/gpu,options=rbind \
    --mount type=bind,src=/opt/actions,dst=/mnt/actions,options=rbind \
    -t "$IMG" \
    gpuinstall /entrypoint.sh install

Step 5: Configure containerd to use nvidia-container-runtime

The aks-gpu container installs the driver and nvidia-container-runtime binary, but does not update containerd's config. In AgentBaker this is done beforehand via a pre-rendered containerd config template (containerd.toml.gtpl). Without this step, GPU workload pods will fail with:

exec /usr/bin/nvidia-smi: no such file or directory

Still on the GPU node (debug pod / chroot):

which nvidia-container-runtime
nvidia-modprobe -u -c0
ldconfig

Use nvidia-ctk to configure containerd. This creates a drop-in file at /etc/containerd/conf.d/99-nvidia.toml that sets nvidia as the default runtime.

WARNING: Do not use sed to patch config.toml directly. The AKS containerd config uses imports = ["/etc/containerd/conf.d/*.toml"] and raw edits can break the CRI plugin, causing unknown service runtime.v1.RuntimeService errors that take the node offline.

nvidia-ctk runtime configure --runtime=containerd --set-as-default

Restart containerd and kubelet to pick up the new config:

systemctl restart containerd kubelet
systemctl status containerd
systemctl status kubelet

Verify the drop-in config:

cat /etc/containerd/conf.d/99-nvidia.toml

Step 6: Verify driver and nvidia-ctk installation

nvidia-smi
nvidia-smi | grep 'Driver Version:'
# Expected driver version: 595.71.05

nvidia-ctk --version

Fabric Manager and Persistenced (NVSwitch SKUs only)

NOTE: Fabric Manager only works on NVSwitch SKUs (ND96 H100, ND96 A100, etc.). On PCIe single-GPU SKUs like Standard_NC24ads_A100_v4 it will fail with NV_WARN_NOTHING_TO_DO -- that is expected and can be ignored.

NOTE: nvidia-persistenced is only available on NVSwitch SKUs (ND-series). On PCIe single-GPU SKUs the service does not exist.

systemctl start nvidia-fabricmanager.service
systemctl status nvidia-fabricmanager.service
journalctl -u nvidia-fabricmanager.service

systemctl start nvidia-persistenced.service
systemctl status nvidia-persistenced.service

Step 7: Test GPU workload

From outside the debug pod, deploy the nvidia device plugin and a test GPU pod.

Deploy the NVIDIA device plugin DaemonSet

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-resources
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-resources
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.19.1
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
EOF

Deploy a test GPU pod

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: test-gpu
spec:
  restartPolicy: Never
  tolerations:
    - key: "sku"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
  containers:
  - name: test-gpu
    image: pytorch/pytorch:latest
    command: ["/usr/bin/nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

Verify

kubectl wait --for=condition=Ready pod/test-gpu --timeout=5m || kubectl describe pod test-gpu
kubectl logs test-gpu

Expected output:

➜  kubectl logs test-gpu

Thu Apr 30 18:59:57 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000001:00:00.0 Off |                    0 |
| N/A   36C    P0             43W /  300W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Cleanup

kubectl delete pod test-gpu --ignore-not-found
az group delete --name "${AZURE_RESOURCE_GROUP}" --yes --no-wait

renovate Bot force-pushed the renovate/nvidia-cuda-driver-595.x branch 4 times, most recently from eb282a8 to d29b382 Compare April 29, 2026 21:32

chore(deps): update dependency nvidia-cuda-driver to v595

5f7aaab

renovate Bot force-pushed the renovate/nvidia-cuda-driver-595.x branch from d29b382 to 5f7aaab Compare April 30, 2026 04:04

surajssd approved these changes Apr 30, 2026

View reviewed changes

sulixu merged commit 45135be into main Apr 30, 2026
4 checks passed

renovate Bot deleted the renovate/nvidia-cuda-driver-595.x branch April 30, 2026 19:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): update dependency nvidia-cuda-driver to v595#149

chore(deps): update dependency nvidia-cuda-driver to v595#149
sulixu merged 1 commit intomainfrom
renovate/nvidia-cuda-driver-595.x

renovate Bot commented Apr 29, 2026

Uh oh!

surajssd commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

renovate Bot commented Apr 29, 2026

Configuration

Uh oh!

surajssd commented Apr 30, 2026

Testing nvidia-cuda-driver 580.126.09 -> 595.71.05

Step 1: Login and set up environment

Step 2: Deploy AKS cluster with GPU node pool (driver=none)

Step 3: Build aks-gpu image from PR #149 branch

Step 4: Install the driver on the GPU node

Step 5: Configure containerd to use nvidia-container-runtime

Step 6: Verify driver and nvidia-ctk installation

Fabric Manager and Persistenced (NVSwitch SKUs only)

Step 7: Test GPU workload

Deploy the NVIDIA device plugin DaemonSet

Deploy a test GPU pod

Verify

Cleanup

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants