OCI runtime create failed #68

hdwmp123 · 2023-03-06T04:09:01Z

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/containerd/io.containerd.runtime.v2.task/moby/665f824c082a491ade73297c4deaa21441b43ffa827367e3302d5efaa332fade/log.json: no such file or directory): fork/exec /tmp/.X11-unix: permission denied: <nil>: unknown.

ubuntu 20.04
CUDA Version: 11.2
NVIDIA-SMI 460.106.00

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.106.00   Driver Version: 460.106.00   CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 308...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   41C    P0    29W /  N/A |     10MiB / 16125MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1255      G   /usr/lib/xorg/Xorg                  4MiB |
|    0   N/A  N/A      2426      G   /usr/lib/xorg/Xorg                  4MiB |
+-----------------------------------------------------------------------------+

sudo nvidia-docker run -tid -p 8888:8888 \
    --hostname deepfakes-gpu --name deepfakes-gpu \
    -v /home/administrator/data/deepfakes:/root/faceswap \
    -v /tmp/.X11-unix:/tmp/.X11-unix \
    -e DISPLAY=unix$DISPLAY \
    -e AUDIO_GID=`getent group audio | cut -d: -f3` \
    -e VIDEO_GID=`getent group video | cut -d: -f3` \
    -e GID=`id -g` \
    -e UID=`id -u` \
    deepfakes-gpu

The text was updated successfully, but these errors were encountered:

elezar · 2023-03-06T06:18:11Z

Hi @hdwmp123. Which version of the NVIDIA Container Toolkit are you using?

MaxiBoether · 2023-05-24T11:33:12Z

Hi @elezar

I am currently facing the same or at least very similar error in a rootless docker setup:

docker run --runtime=nvidia hello-world
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: unable to retrieve OCI runtime error (open /run/user/5004/docker/containerd/daemon/io.containerd.runtime.v2.task/moby/41d7c05c72f1d1113886eede371c1cdd745c6e3bf36011c38b40c2f50e547459/log.json: no such file or directory): /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1: unknown.
ERRO[0000] error waiting for container:

When using runc as the runtime, the hello world container works.

We are using version 1.13. When running /usr/bin/nvidia-container-runtime run hello-world (not sure what this would do), we get the error ERRO[0000] runc run failed: JSON specification file config.json not found. If we create a file config.json in the current working directory with {} as content, we get the error ERRO[0000] runc run failed: process property must not be empty. I don't exactly know what is going wrong here, maybe you have an idea?

elezar · 2023-05-24T12:19:07Z

@MaxiBoether the nvidia-container-runtime is a shim for runc or another OCI-compliant runtime and does not implement the docker CLI.

Please provide the following:

More information about your platform including the output from nvidia-smi on the host
Please enabled debug logging by uncommenting / modifying the #debug = lines in the /etc/nvidia-container-runtime/config.toml file. You could also bump the log-level for the runtime to "debug" to produce more logs. Please attach the nvidia-container-toolkit.log and nvidia-container-runtime.log files generated to the issue.

MaxiBoether · 2023-05-24T12:38:35Z

@elezar

Yes, I understand that it does not implement the docker CLI. I just did this experiment because the output of docker run --runtime=nvidia hello-world (as given above) mentioned error code 1 of /usr/bin/nvidia-container-runtime, I wanted to investigate what the problem might be since there was no clear error msg.

maxilocal4@sgs-gpu04:~$ nvidia-smi
Wed May 24 14:34:09 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090         On | 00000000:23:00.0 Off |                  N/A |
|  0%   62C    P2              247W / 350W|  14231MiB / 24576MiB |     59%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 3090         On | 00000000:41:00.0 Off |                  N/A |
|  0%   58C    P2              243W / 350W|  14357MiB / 24576MiB |     75%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  NVIDIA GeForce RTX 3090         On | 00000000:A1:00.0 Off |                  N/A |
|  0%   62C    P2              283W / 350W|  14587MiB / 24576MiB |     71%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  NVIDIA GeForce RTX 3090         On | 00000000:C1:00.0 Off |                  N/A |
|  0%   59C    P2              254W / 350W|  14587MiB / 24576MiB |     66%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

maxilocal4@sgs-gpu04:~$ uname -a
Linux sgs-gpu04.ethz.ch 5.15.0-67-generic NVIDIA/nvidia-docker#74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

This is the output of nvidia-smi and uname -a on the host machine. I hope that helps.

My config.toml looks like that:


disable-require = false

[nvidia-container-cli]
environment = []
ldconfig = "@/sbin/ldconfig.real"
load-kmods = true
no-cgroups = true

[nvidia-container-runtime]
debug = "/local/home/maxilocal4/log/nvidia-container-runtime.log"
# levels => debug, info, warning, error
log-level = "debug"

runtimes = ["runc"]
mode = "auto"

[nvidia-container-runtime.modes.csv]
mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

The log files do not get created for me, probably because the containers do not start in the first place?

Note that this explicitly affects rootless docker. When running docker as root, everything works. Rootless docker with runc also works.

MaxiBoether · 2023-05-25T11:51:57Z

Okay, we figured out what the problem was. The debug log file was not writable due to filesystem permissions. Maybe it would be cool to add a more verbose error message if writing to the log file fails?

elezar · 2023-05-25T12:35:00Z

@MaxiBoether do you mean that the original error was caused by the log file not being writable, or the fact that the log wasn't being generated?

Update: I have created https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/404 that ignores the error when opening / creating any log files. Would you be able to test these changes and verify that they stop the behaviour you were seeing?

elezar transferred this issue from NVIDIA/nvidia-docker Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCI runtime create failed #68

OCI runtime create failed #68

hdwmp123 commented Mar 6, 2023

elezar commented Mar 6, 2023

MaxiBoether commented May 24, 2023 •

edited

Loading

elezar commented May 24, 2023

MaxiBoether commented May 24, 2023 •

edited

Loading

MaxiBoether commented May 25, 2023

elezar commented May 25, 2023 •

edited

Loading

OCI runtime create failed #68

OCI runtime create failed #68

Comments

hdwmp123 commented Mar 6, 2023

elezar commented Mar 6, 2023

MaxiBoether commented May 24, 2023 • edited Loading

elezar commented May 24, 2023

MaxiBoether commented May 24, 2023 • edited Loading

MaxiBoether commented May 25, 2023

elezar commented May 25, 2023 • edited Loading

MaxiBoether commented May 24, 2023 •

edited

Loading

MaxiBoether commented May 24, 2023 •

edited

Loading

elezar commented May 25, 2023 •

edited

Loading