chore(deps): update dependency nvidia-cuda-driver to v595#149
Conversation
eb282a8 to
d29b382
Compare
d29b382 to
5f7aaab
Compare
Testing nvidia-cuda-driver 580.126.09 -> 595.71.05
Step 1: Login and set up environmentsource az-login.sh env azcore-linux-k8s-dev
export INDEX="1"
export AZURE_RESOURCE_GROUP=""
export AZURE_REGION="eastus2"
export CLUSTER_NAME="aks-${INDEX}"
export NODE_POOL_VM_SIZE="Standard_NC24ads_A100_v4"
export NODE_POOL_NAME="gpu"
export NODE_POOL_NODE_COUNT=1
AZURE_SUBSCRIPTION_ID=$(az account show --query id --output tsv)
export AZURE_SUBSCRIPTION_ID
export K8S_VERSION="1.34"Step 2: Deploy AKS cluster with GPU node pool (driver=none)Uses the aks-rdma-infiniband helper scripts to create the cluster and add a GPU node pool with git clone https://github.com/Azure/aks-rdma-infiniband
cd aks-rdma-infiniband
./tests/setup-infra/deploy-aks.sh deploy-aks &&
./tests/setup-infra/deploy-aks.sh add-nodepool --gpu-driver=noneStep 3: Build aks-gpu image from PR #149 branchCheckout the PR branch and build a custom aks-gpu image with the new driver version. gh pr checkout 149
DRIVER_VERSION="$(yq '.cuda.version' ./driver_config.yml)"
IMG="quay.io/surajd/aks-gpu:${DRIVER_VERSION}"
export DRIVER_VERSION
docker build --push \
--build-arg DRIVER_VERSION="${DRIVER_VERSION}" \
-t "$IMG" .Step 4: Install the driver on the GPU nodeGet a shell on the GPU node and run the aks-gpu container to install the NVIDIA driver. This replicates what AgentBaker does via GPU_NODE=$(kubectl get nodes -l accelerator=nvidia -o name)
kubectl debug "${GPU_NODE}" --image=ubuntu --profile=sysadmin -it -- chroot /host /bin/bashOnce on the node: mkdir -p /opt/{actions,gpu}
IMG="quay.io/surajd/aks-gpu:595.71.05"
ctr image pull "$IMG"
ctr run --privileged \
--net-host \
--with-ns pid:/proc/1/ns/pid \
--mount type=bind,src=/opt/gpu,dst=/mnt/gpu,options=rbind \
--mount type=bind,src=/opt/actions,dst=/mnt/actions,options=rbind \
-t "$IMG" \
gpuinstall /entrypoint.sh installStep 5: Configure containerd to use nvidia-container-runtimeThe aks-gpu container installs the driver and Still on the GPU node (debug pod / chroot): which nvidia-container-runtime
nvidia-modprobe -u -c0
ldconfigUse
nvidia-ctk runtime configure --runtime=containerd --set-as-defaultRestart containerd and kubelet to pick up the new config: systemctl restart containerd kubelet
systemctl status containerd
systemctl status kubeletVerify the drop-in config: cat /etc/containerd/conf.d/99-nvidia.tomlStep 6: Verify driver and nvidia-ctk installationnvidia-smi
nvidia-smi | grep 'Driver Version:'
# Expected driver version: 595.71.05
nvidia-ctk --versionFabric Manager and Persistenced (NVSwitch SKUs only)
systemctl start nvidia-fabricmanager.service
systemctl status nvidia-fabricmanager.service
journalctl -u nvidia-fabricmanager.service
systemctl start nvidia-persistenced.service
systemctl status nvidia-persistenced.serviceStep 7: Test GPU workloadFrom outside the debug pod, deploy the nvidia device plugin and a test GPU pod. Deploy the NVIDIA device plugin DaemonSetcat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Namespace
metadata:
name: gpu-resources
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: gpu-resources
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
priorityClassName: "system-node-critical"
containers:
- image: nvcr.io/nvidia/k8s-device-plugin:v0.19.1
name: nvidia-device-plugin-ctr
env:
- name: FAIL_ON_INIT_ERROR
value: "false"
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
EOFDeploy a test GPU podkubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: test-gpu
spec:
restartPolicy: Never
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
containers:
- name: test-gpu
image: pytorch/pytorch:latest
command: ["/usr/bin/nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
EOFVerifykubectl wait --for=condition=Ready pod/test-gpu --timeout=5m || kubectl describe pod test-gpu
kubectl logs test-gpuExpected output: ➜ kubectl logs test-gpu
Thu Apr 30 18:59:57 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05 Driver Version: 595.71.05 CUDA Version: 13.2 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A100 80GB PCIe Off | 00000001:00:00.0 Off | 0 |
| N/A 36C P0 43W / 300W | 0MiB / 81920MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+Cleanupkubectl delete pod test-gpu --ignore-not-found
az group delete --name "${AZURE_RESOURCE_GROUP}" --yes --no-wait |
This PR contains the following updates:
580.126.09→595.71.05Configuration
📅 Schedule: (UTC)
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.