You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running with containerd, the k8s-device-plugin requires you to have the runtime set to nvidia-container-runtime. If it's set to runc, the daemonset deploys but sees 0 graphics cards and thus sets nvidia.com/gpu: 0 on the node.
I'm assuming this is the auto detection looking at what runtime is set (maybe a hangover of the way Docker works, where you have to define the runtimes in advanced).
When setting nvidia-container-runtime as the runtime, it breaks other consumers of containerd, such as ctr:
$ sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi
ctr: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=26228 /run/containerd/io.containerd.runtime.v1.linux/default/nvidia-smi/rootfs]\\\\nnvidia-container-cli: mount error: file creation failed: /run/containerd/io.containerd.runtime.v1.linux/default/nvidia-smi/rootfs/run/nvidia-persistenced/socket: no such device or address\\\\n\\\"\"": unknown
Setting it back to runc fixes this (but then breaks the k8s-device-plugin (as explained above).
$ sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi
Wed Apr 3 11:00:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.1 ||-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||===============================+======================+======================|| 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 || N/A 49C P8 27W / 149W | 0MiB / 11441MiB | 2% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory || GPU PID Type Process name Usage ||=============================================================================|| No running processes found |
+-----------------------------------------------------------------------------+
I believe the fix here is to find a better way for the k8s-device-plugin to detect the presence of GPUs on the nodes. From my understanding, containerd is doing this handling of GPUs with runc runtime by design.
2. Steps to reproduce the issue
Install containerd with default config.
Start kubelet, bind to containerd.
Deploy daemonset.
See that nvidia.com/gpu is always set to 0.
sudo sed -i 's/runtime = "runc"/runtime = "nvidia-container-runtime"/g' /etc/containerd/config.toml
Restart containerd and kubelet.
See that nvidia.com/gpu sets to 1 (in my case, with 1 GPU).
The text was updated successfully, but these errors were encountered:
1. Issue or feature description
When running with containerd, the
k8s-device-plugin
requires you to have the runtime set tonvidia-container-runtime
. If it's set torunc
, the daemonset deploys but sees 0 graphics cards and thus setsnvidia.com/gpu: 0
on the node.I'm assuming this is the auto detection looking at what runtime is set (maybe a hangover of the way Docker works, where you have to define the runtimes in advanced).
When setting
nvidia-container-runtime
as the runtime, it breaks other consumers of containerd, such asctr
:$ sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi ctr: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=26228 /run/containerd/io.containerd.runtime.v1.linux/default/nvidia-smi/rootfs]\\\\nnvidia-container-cli: mount error: file creation failed: /run/containerd/io.containerd.runtime.v1.linux/default/nvidia-smi/rootfs/run/nvidia-persistenced/socket: no such device or address\\\\n\\\"\"": unknown
Setting it back to
runc
fixes this (but then breaks thek8s-device-plugin
(as explained above).I believe the fix here is to find a better way for the
k8s-device-plugin
to detect the presence of GPUs on the nodes. From my understanding, containerd is doing this handling of GPUs withrunc
runtime by design.2. Steps to reproduce the issue
nvidia.com/gpu
is always set to 0.sudo sed -i 's/runtime = "runc"/runtime = "nvidia-container-runtime"/g' /etc/containerd/config.toml
nvidia.com/gpu
sets to 1 (in my case, with 1 GPU).The text was updated successfully, but these errors were encountered: