-
Notifications
You must be signed in to change notification settings - Fork 721
Description
Hello,
During the E2E test of changes in GPU Operator to support COS (https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/1061), I found out that to discover the nvidia lib, it requries the specific PATH/LD_LIBRARY_PATH on the pod spec:
after the pod is running
$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-rr2x2 1/1 Running 0 4h16m
gpu-operator-66575c8958-sslch 1/1 Running 0 4h16m
noperator-node-feature-discovery-gc-6968c7c64-g7w7r 1/1 Running 0 4h16m
noperator-node-feature-discovery-master-749679f664-dvs48 1/1 Running 0 4h16m
noperator-node-feature-discovery-worker-glhxw 1/1 Running 0 4h16m
nvidia-container-toolkit-daemonset-wvpvx 1/1 Running 0 4h16m
nvidia-cuda-validator-z84ks 0/1 Completed 0 4h15m
nvidia-dcgm-exporter-9r87v 1/1 Running 0 4h16m
nvidia-device-plugin-daemonset-fp7hm 1/1 Running 0 4h16m
nvidia-operator-validator-hstkb 1/1 Running 0 4h16m
and deploy the GPU workload
apiVersion: v1
kind: Pod
metadata:
name: my-gpu-pod
spec:
containers:
- name: my-gpu-container
image: nvidia/cuda:11.0.3-base-ubuntu20.04
command: ["bash", "-c"]
args:
- |-
# export PATH="$PATH:/home/kubernetes/bin/nvidia/bin";
# export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/kubernetes/bin/nvidia/lib64;
nvidia-smi;
resources:
limits:
nvidia.com/gpu: "1"
I looked at the OCI spec of the container, the PATH looks like PATH=/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
In GKE's device plugin case, we expect that nvidia bin under /usr/local. (https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/145797868c0f6bd6a0f37c0295f06dfe5fa94265/cmd/nvidia_gpu/nvidia_gpu.go#L42)
Is there something similar we can configure in the k8s device plugin as well so that container path /usr/local could mount to a nvidia bin dir, which is /home/kubernetes/bin/nvidia on the host ? Thanks