Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCI runtime create failed #68

Open
hdwmp123 opened this issue Mar 6, 2023 · 6 comments
Open

OCI runtime create failed #68

hdwmp123 opened this issue Mar 6, 2023 · 6 comments

Comments

@hdwmp123
Copy link

hdwmp123 commented Mar 6, 2023

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/665f824c082a491ade73297c4deaa21441b43ffa827367e3302d5efaa332fade/log.json: no such file or directory): fork/exec /tmp/.X11-unix: permission denied: <nil>: unknown.
ubuntu 20.04
CUDA Version: 11.2
NVIDIA-SMI 460.106.00
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00   Driver Version: 460.106.00   CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 308...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P0    29W /  N/A |     10MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1255      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2426      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+
sudo nvidia-docker run -tid -p 8888:8888 \
    --hostname deepfakes-gpu --name deepfakes-gpu \
    -v /home/administrator/data/deepfakes:/root/faceswap \
    -v /tmp/.X11-unix:/tmp/.X11-unix \
    -e DISPLAY=unix$DISPLAY \
    -e AUDIO_GID=`getent group audio | cut -d: -f3` \
    -e VIDEO_GID=`getent group video | cut -d: -f3` \
    -e GID=`id -g` \
    -e UID=`id -u` \
    deepfakes-gpu
@elezar
Copy link
Member

elezar commented Mar 6, 2023

Hi @hdwmp123. Which version of the NVIDIA Container Toolkit are you using?

@MaxiBoether
Copy link

MaxiBoether commented May 24, 2023

Hi @elezar

I am currently facing the same or at least very similar error in a rootless docker setup:

docker run --runtime=nvidia hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/user/5004/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/41d7c05c72f1d1113886eede371c1cdd745c6e3bf36011c38b40c2f50e547459/log.json: no such file or directory): /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1: unknown.
ERRO[0000] error waiting for container:

When using runc as the runtime, the hello world container works.

We are using version 1.13. When running /usr/bin/nvidia-container-runtime run hello-world (not sure what this would do), we get the error ERRO[0000] runc run failed: JSON specification file config.json not found. If we create a file config.json in the current working directory with {} as content, we get the error ERRO[0000] runc run failed: process property must not be empty. I don't exactly know what is going wrong here, maybe you have an idea?

@elezar
Copy link
Member

elezar commented May 24, 2023

@MaxiBoether the nvidia-container-runtime is a shim for runc or another OCI-compliant runtime and does not implement the docker CLI.

Please provide the following:

  1. More information about your platform including the output from nvidia-smi on the host
  2. Please enabled debug logging by uncommenting / modifying the #debug = lines in the /etc/nvidia-container-runtime/config.toml file. You could also bump the log-level for the runtime to "debug" to produce more logs. Please attach the nvidia-container-toolkit.log and nvidia-container-runtime.log files generated to the issue.

@MaxiBoether
Copy link

MaxiBoether commented May 24, 2023

@elezar

Yes, I understand that it does not implement the docker CLI. I just did this experiment because the output of docker run --runtime=nvidia hello-world (as given above) mentioned error code 1 of /usr/bin/nvidia-container-runtime, I wanted to investigate what the problem might be since there was no clear error msg.

maxilocal4@sgs-gpu04:~$ nvidia-smi
Wed May 24 14:34:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:23:00.0 Off |                  N/A |
|  0%   62C    P2              247W / 350W|  14231MiB / 24576MiB |     59%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         On | 00000000:41:00.0 Off |                  N/A |
|  0%   58C    P2              243W / 350W|  14357MiB / 24576MiB |     75%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090         On | 00000000:A1:00.0 Off |                  N/A |
|  0%   62C    P2              283W / 350W|  14587MiB / 24576MiB |     71%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090         On | 00000000:C1:00.0 Off |                  N/A |
|  0%   59C    P2              254W / 350W|  14587MiB / 24576MiB |     66%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

maxilocal4@sgs-gpu04:~$ uname -a
Linux sgs-gpu04.ethz.ch 5.15.0-67-generic NVIDIA/nvidia-docker#74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

This is the output of nvidia-smi and uname -a on the host machine. I hope that helps.

  1. My config.toml looks like that:

disable-require = false

[nvidia-container-cli]
environment = []
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
no-cgroups = true

[nvidia-container-runtime]
debug = "/local/home/maxilocal4/log/nvidia-container-runtime.log"
# levels => debug, info, warning, error
log-level = "debug"

runtimes = ["runc"]
mode = "auto"

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

The log files do not get created for me, probably because the containers do not start in the first place?

Note that this explicitly affects rootless docker. When running docker as root, everything works. Rootless docker with runc also works.

@MaxiBoether
Copy link

Okay, we figured out what the problem was. The debug log file was not writable due to filesystem permissions. Maybe it would be cool to add a more verbose error message if writing to the log file fails?

@elezar
Copy link
Member

elezar commented May 25, 2023

@MaxiBoether do you mean that the original error was caused by the log file not being writable, or the fact that the log wasn't being generated?

Update: I have created https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/404 that ignores the error when opening / creating any log files. Would you be able to test these changes and verify that they stop the behaviour you were seeing?

@elezar elezar transferred this issue from NVIDIA/nvidia-docker Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants