OCI runtime error: crun: error executing hook using podman –userns keep-id #46

osiler · 2023-01-30T07:41:45Z

Versions:

	podman -v
            podman version 4.3.1

	buildah -v
	buildah version 1.28.0 (image-spec 1.0.2-dev, runtime-spec 1.0.2-dev)

	nvidia-container-toolkit -version
	NVIDIA Container Runtime Hook version 1.12.0-rc.3
	commit: 14e587d55f2a4dc2e047a88e9acc2be72cb45af8

I am attempting to get containers to run with access to the GPU with rootless podman and the --userns keep-id flag. My current steps include:

Generating the cdi spec via:

	nvidia-ctk cdi generate > nvidia.yaml && sudo mkdir /etc/cdi && sudo mv nvidia.yaml /etc/cdi

Attempt 1: Fails

podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"

Error: setting up CDI devices: failed to inject devices: failed to stat CDI host device "/dev/dri/controlD69": no such file or directory

I then removed references to the following in the devices section of the /etc/cdi/nvidia.yaml:

    - path: /dev/dri/card5 
    - path: /dev/dri/controlD69
    - path: /dev/dri/renderD129

and removed the create symlink hooks in the devices section.

    hooks:
    - args:
      - nvidia-ctk
      - hook
      - create-symlinks
      - --link
      - ../card5::/dev/dri/by-path/pci-0000:58:00.0-card
      - --link
      - ../renderD129::/dev/dri/by-path/pci-0000:58:00.0-render
      hookName: createContainer
      path: nvidia-ctk

Finally I also removed the nvidia-ctk hook that changes the ownership of the /dev/dri path.

  - args:
    - nvidia-ctk
    - hook
    - chmod
    - --mode
    - "755"
    - --path
    - /dev/dri
    hookName: createContainer
    path: /usr/bin/nvidia-ctk

Attempt 2: Pass (missing selinux modules)

podman run --rm --device nvidia.com/gpu=gpu0 docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
    return get_device_properties(device).name
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

I am not concerned about this error, I believe I need to just amend some policy modules as specified here.

However if I attempt to run the above with the –userns keep-id flag.

Attempt 3: Fail

podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Error: OCI runtime error: crun: error executing hook `/usr/bin/nvidia-container-runtime-hook` (exit code: 1)

I have also tried the different combinations for the flags of load-kmods and no-cgroups in /etc/nvidia-container-runtime/config.toml.

A lot of this trouble shooting has been directed from the following links.

I am unsure on the of the lifecycle of the permissions when running these hooks however it looks like the first issue where the mapped permissions may not add up is here.

elezar · 2023-01-30T10:35:48Z

Thanks for the detailed report. The final error that I see:

podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Error: OCI runtime error: crun: error executing hook `/usr/bin/nvidia-container-runtime-hook` (exit code: 1)

indicates that the original hook is still being detected and injected. When using CDI it's important that this is not the case. Please remove the installed hook and repeat the run.

With regards to the /dev/dri/ devices that cannot be found. Do the devices exist on your system and what are their permissions? The error you are seeing indicates that our detection logic around them is not as robust as what it should be and we will work on getting a fix out. Would you be willing to test a build that would address this -- assuming that we can get the container running with your modified CDI specification?

osiler · 2023-01-30T11:00:11Z

Thanks looks like it was left behind during the attempts to get this working with the various package versions. It would be a nice inclusion to remove this hook (or alert the user of the duality) during update.

Attempt 4: Pass (missing selinux modules)

cd /usr/share/containers/oci/hooks.d && sudo rm oci-nvidia-hook.json

podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 341, in get_device_name
    return get_device_properties(device).name
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 371, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/opt/conda/lib/python3.10/site-packages/torch/cuda/__init__.py", line 229, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available

For /dev/dri I have various /dev/dri/cardX with owned by root, within the video group, and /dev/dri/renderXXXX with root owner and render goup. The /dev/dri by path is all root.

I am happy to assist with release testing.

elezar · 2023-01-30T12:19:21Z

I have looked into the issue with the /dev/dri nodes and noted that it should be changed as of https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/260 which is available on main.

Would you be able to repeat your experiments with a CDI spec generated from the HEAD of main? It should then not be required to modify the spec at all since only the /dev/dri/ device nodes actuallty present on your host should be added to the spec. Note that for crun in particular the creation of the /dev/dri folder in the container may be required in this case. There was an issue with how "nested" device nodes such as this were being handled.

Also with regards to:

Attempt 4: Pass (missing selinux modules)

What do you mean by missing selinux modules? Any information you could provide here as to how you are able to work around this would be much appreciated.

osiler · 2023-02-05T10:39:18Z

I checked out HEAD but have run into trouble getting a build to work correctly with podman and podman-docker as the runner. Currently I do not have a native docker install on my dev machine and have not dug into some of the issues of running both alongside each other as specefied here. I am assuming you currently run the build script with docker as the runner?

emanuelbuholzer · 2023-05-19T14:06:26Z

With the latest version of Podman and the NVIDIA Container Toolkit 1.13.1 this runs now just fine on my machine:

podman run --rm --device nvidia.com/gpu=gpu0 --userns keep-id docker.io/pytorch/pytorch python -c "import torch; print(torch.cuda.get_device_name(0))"

joefiorini · 2023-06-12T13:30:13Z

I had this issue as well on Fedora Kinoite (Silverblue). @emanuelbuholzer's command did not work for me immediately, kept getting:

Error: setting up CDI devices: unresolvable CDI devices nvidia.com/gpu=gpu0

I could not find the actual name of the gpu's device file, but I did figure out that I could use nvidia.com/gpu=all, which then turned the error into:

Error: OCI runtime error: crun: error executing hook `/usr/bin/nvidia-container-runtime-hook` (exit code: 1)

According to nvidia's documentation this is because CDI and the nvidia-ctk runtime hook are incompatible. To disable it I added --runtime crun --hooks-dir "" to the podman command. Now I was down to a Python stack trace:

RuntimeError: No CUDA GPUs are available

To fix this I had to disable selinux (or lower security settings, something like that) with --security-opt label=disable. Altogether the final command that worked for me was:

podman --runtime crun --hooks-dir "" run --rm --security-opt label=disable --device nvidia.com/gpu=all --userns keep-id devel-jupyterlab python -c "import torch; print(torch.cuda.get_device_name(0))"

osiler mentioned this issue Jan 30, 2023

OCI runtime error: crun: error executing hook using podman –userns keep-id NVIDIA/nvidia-container-runtime#182

Closed

osiler closed this as completed Jun 4, 2023

osiler reopened this Jun 4, 2023

osiler closed this as completed Jun 4, 2023

juntao mentioned this issue Feb 15, 2024

feat: Crun + GGML plugin tracking issue WasmEdge/WasmEdge#3217

Closed

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCI runtime error: crun: error executing hook using podman –userns keep-id #46

OCI runtime error: crun: error executing hook using podman –userns keep-id #46

osiler commented Jan 30, 2023 •

edited

Loading

elezar commented Jan 30, 2023

osiler commented Jan 30, 2023

elezar commented Jan 30, 2023

osiler commented Feb 5, 2023

emanuelbuholzer commented May 19, 2023

joefiorini commented Jun 12, 2023

OCI runtime error: crun: error executing hook using podman –userns keep-id #46

OCI runtime error: crun: error executing hook using podman –userns keep-id #46

Comments

osiler commented Jan 30, 2023 • edited Loading

elezar commented Jan 30, 2023

osiler commented Jan 30, 2023

elezar commented Jan 30, 2023

osiler commented Feb 5, 2023

emanuelbuholzer commented May 19, 2023

joefiorini commented Jun 12, 2023

osiler commented Jan 30, 2023 •

edited

Loading