Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Requires nvidia-container-runtime to be set as containerd runtime #107

Closed
joedborg opened this issue Apr 3, 2019 · 2 comments
Closed

Requires nvidia-container-runtime to be set as containerd runtime #107

joedborg opened this issue Apr 3, 2019 · 2 comments

Comments

@joedborg
Copy link

joedborg commented Apr 3, 2019

1. Issue or feature description

When running with containerd, the k8s-device-plugin requires you to have the runtime set to nvidia-container-runtime. If it's set to runc, the daemonset deploys but sees 0 graphics cards and thus sets nvidia.com/gpu: 0 on the node.

I'm assuming this is the auto detection looking at what runtime is set (maybe a hangover of the way Docker works, where you have to define the runtimes in advanced).

When setting nvidia-container-runtime as the runtime, it breaks other consumers of containerd, such as ctr:

$ sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi
ctr: OCI runtime create failed: container_linux.go:345: starting container process caused "process_linux.go:424: container init caused \"process_linux.go:407: running prestart hook 1 caused \\\"error running hook: exit status 1, stdout: , stderr: exec command: [/usr/bin/nvidia-container-cli --load-kmods configure --ldconfig=@/sbin/ldconfig.real --device=all --compute --utility --require=cuda>=9.0 --pid=26228 /run/containerd/io.containerd.runtime.v1.linux/default/nvidia-smi/rootfs]\\\\nnvidia-container-cli: mount error: file creation failed: /run/containerd/io.containerd.runtime.v1.linux/default/nvidia-smi/rootfs/run/nvidia-persistenced/socket: no such device or address\\\\n\\\"\"": unknown

Setting it back to runc fixes this (but then breaks the k8s-device-plugin (as explained above).

$ sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:9.0-base nvidia-smi nvidia-smi
Wed Apr  3 11:00:43 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   49C    P8    27W / 149W |      0MiB / 11441MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I believe the fix here is to find a better way for the k8s-device-plugin to detect the presence of GPUs on the nodes. From my understanding, containerd is doing this handling of GPUs with runc runtime by design.

2. Steps to reproduce the issue

  1. Install containerd with default config.
  2. Start kubelet, bind to containerd.
  3. Deploy daemonset.
  4. See that nvidia.com/gpu is always set to 0.
  5. sudo sed -i 's/runtime = "runc"/runtime = "nvidia-container-runtime"/g' /etc/containerd/config.toml
  6. Restart containerd and kubelet.
  7. See that nvidia.com/gpu sets to 1 (in my case, with 1 GPU).
@joedborg
Copy link
Author

joedborg commented Apr 3, 2019

Bug also logged with containerd containerd/containerd#3151

@RenaudWasTaken
Copy link
Contributor

Closing this, as original problem was solved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants