Skip to content

chore(deps): update dependency nvidia-cuda-driver to v595#149

Merged
sulixu merged 1 commit intomainfrom
renovate/nvidia-cuda-driver-595.x
Apr 30, 2026
Merged

chore(deps): update dependency nvidia-cuda-driver to v595#149
sulixu merged 1 commit intomainfrom
renovate/nvidia-cuda-driver-595.x

Conversation

@renovate
Copy link
Copy Markdown
Contributor

@renovate renovate Bot commented Apr 29, 2026

This PR contains the following updates:

Package Update Change
nvidia-cuda-driver major 580.126.09595.71.05

Configuration

📅 Schedule: (UTC)

  • Branch creation
    • At any time (no schedule defined)
  • Automerge
    • At any time (no schedule defined)

🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.

Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.

🔕 Ignore: Close this PR and you won't be reminded about this update again.


  • If you want to rebase/retry this PR, check this box

This PR was generated by Mend Renovate. View the repository job log.

@renovate renovate Bot force-pushed the renovate/nvidia-cuda-driver-595.x branch 4 times, most recently from eb282a8 to d29b382 Compare April 29, 2026 21:32
@renovate renovate Bot force-pushed the renovate/nvidia-cuda-driver-595.x branch from d29b382 to 5f7aaab Compare April 30, 2026 04:04
@surajssd
Copy link
Copy Markdown
Member

Testing nvidia-cuda-driver 580.126.09 -> 595.71.05

  • Key verification: Whether the new 595.x driver installs correctly and GPU workloads function as expected.

Step 1: Login and set up environment

source az-login.sh env azcore-linux-k8s-dev

export INDEX="1"
export AZURE_RESOURCE_GROUP=""
export AZURE_REGION="eastus2"
export CLUSTER_NAME="aks-${INDEX}"
export NODE_POOL_VM_SIZE="Standard_NC24ads_A100_v4"
export NODE_POOL_NAME="gpu"
export NODE_POOL_NODE_COUNT=1
AZURE_SUBSCRIPTION_ID=$(az account show --query id --output tsv)
export AZURE_SUBSCRIPTION_ID
export K8S_VERSION="1.34"

Step 2: Deploy AKS cluster with GPU node pool (driver=none)

Uses the aks-rdma-infiniband helper scripts to create the cluster and add a GPU node pool with --gpu-driver=none so we can manually install the driver from the aks-gpu PR branch.

git clone https://github.com/Azure/aks-rdma-infiniband
cd aks-rdma-infiniband

./tests/setup-infra/deploy-aks.sh deploy-aks &&
    ./tests/setup-infra/deploy-aks.sh add-nodepool --gpu-driver=none

Step 3: Build aks-gpu image from PR #149 branch

Checkout the PR branch and build a custom aks-gpu image with the new driver version.

gh pr checkout 149

DRIVER_VERSION="$(yq '.cuda.version' ./driver_config.yml)"
IMG="quay.io/surajd/aks-gpu:${DRIVER_VERSION}"
export DRIVER_VERSION
docker build --push \
    --build-arg DRIVER_VERSION="${DRIVER_VERSION}" \
    -t "$IMG" .

Step 4: Install the driver on the GPU node

Get a shell on the GPU node and run the aks-gpu container to install the NVIDIA driver. This replicates what AgentBaker does via configGPUDrivers() in cse_config.sh.

GPU_NODE=$(kubectl get nodes -l accelerator=nvidia -o name)
kubectl debug "${GPU_NODE}" --image=ubuntu --profile=sysadmin -it -- chroot /host /bin/bash

Once on the node:

mkdir -p /opt/{actions,gpu}
IMG="quay.io/surajd/aks-gpu:595.71.05"
ctr image pull "$IMG"
ctr run --privileged \
    --net-host \
    --with-ns pid:/proc/1/ns/pid \
    --mount type=bind,src=/opt/gpu,dst=/mnt/gpu,options=rbind \
    --mount type=bind,src=/opt/actions,dst=/mnt/actions,options=rbind \
    -t "$IMG" \
    gpuinstall /entrypoint.sh install

Step 5: Configure containerd to use nvidia-container-runtime

The aks-gpu container installs the driver and nvidia-container-runtime binary, but does not update containerd's config. In AgentBaker this is done beforehand via a pre-rendered containerd config template (containerd.toml.gtpl). Without this step, GPU workload pods will fail with:

exec /usr/bin/nvidia-smi: no such file or directory

Still on the GPU node (debug pod / chroot):

which nvidia-container-runtime
nvidia-modprobe -u -c0
ldconfig

Use nvidia-ctk to configure containerd. This creates a drop-in file at /etc/containerd/conf.d/99-nvidia.toml that sets nvidia as the default runtime.

WARNING: Do not use sed to patch config.toml directly. The AKS containerd config uses imports = ["/etc/containerd/conf.d/*.toml"] and raw edits can break the CRI plugin, causing unknown service runtime.v1.RuntimeService errors that take the node offline.

nvidia-ctk runtime configure --runtime=containerd --set-as-default

Restart containerd and kubelet to pick up the new config:

systemctl restart containerd kubelet
systemctl status containerd
systemctl status kubelet

Verify the drop-in config:

cat /etc/containerd/conf.d/99-nvidia.toml

Step 6: Verify driver and nvidia-ctk installation

nvidia-smi
nvidia-smi | grep 'Driver Version:'
# Expected driver version: 595.71.05

nvidia-ctk --version

Fabric Manager and Persistenced (NVSwitch SKUs only)

NOTE: Fabric Manager only works on NVSwitch SKUs (ND96 H100, ND96 A100, etc.). On PCIe single-GPU SKUs like Standard_NC24ads_A100_v4 it will fail with NV_WARN_NOTHING_TO_DO -- that is expected and can be ignored.

NOTE: nvidia-persistenced is only available on NVSwitch SKUs (ND-series). On PCIe single-GPU SKUs the service does not exist.

systemctl start nvidia-fabricmanager.service
systemctl status nvidia-fabricmanager.service
journalctl -u nvidia-fabricmanager.service

systemctl start nvidia-persistenced.service
systemctl status nvidia-persistenced.service

Step 7: Test GPU workload

From outside the debug pod, deploy the nvidia device plugin and a test GPU pod.

Deploy the NVIDIA device plugin DaemonSet

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-resources
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-resources
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.19.1
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
EOF

Deploy a test GPU pod

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: test-gpu
spec:
  restartPolicy: Never
  tolerations:
    - key: "sku"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
  containers:
  - name: test-gpu
    image: pytorch/pytorch:latest
    command: ["/usr/bin/nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

Verify

kubectl wait --for=condition=Ready pod/test-gpu --timeout=5m || kubectl describe pod test-gpu
kubectl logs test-gpu

Expected output:

➜  kubectl logs test-gpu

Thu Apr 30 18:59:57 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05              Driver Version: 595.71.05      CUDA Version: 13.2     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000001:00:00.0 Off |                    0 |
| N/A   36C    P0             43W /  300W |       0MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Cleanup

kubectl delete pod test-gpu --ignore-not-found
az group delete --name "${AZURE_RESOURCE_GROUP}" --yes --no-wait

@sulixu sulixu merged commit 45135be into main Apr 30, 2026
4 checks passed
@renovate renovate Bot deleted the renovate/nvidia-cuda-driver-595.x branch April 30, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants