Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't run container using nvidia-docker - libnvidia-ml.so.1 #219

Open
pfcouto opened this issue Apr 18, 2023 · 8 comments
Open

Can't run container using nvidia-docker - libnvidia-ml.so.1 #219

pfcouto opened this issue Apr 18, 2023 · 8 comments

Comments

@pfcouto
Copy link

pfcouto commented Apr 18, 2023

1. Issue or feature description

Upon running the command docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi i get the error

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

2. Steps to reproduce the issue

I installed nvidia through Fedora Docs, not Nvidia, so as an example nvcc --version outputs an error saying that it does not recognize nvcc command but in my host machine I can run nvidia-smi

The commands I used to install nvidia are the following:

sudo dnf install akmod-nvidia
sudo dnf install xorg-x11-drv-nvidia-cuda

And as visible in the following image I am able to run the command nvidia-smi in my host machine

image

I followed this guide on how yo install nvidia-docker - - and did the following:

curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
##############################
sudo dnf install nvidia-docker2
# Edit /etc/nvidia-container-runtime/config.toml and disable cgroups:
no-cgroups = true

sudo reboot
##############################
sudo systemctl start docker.service
##############################
docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

and upon running this docker command I get the error show in ### 1.

The thing is, I have the file that it says it is missing (check the following image), so maybe it is looking for it in a different directory?

image

3. Information to attach (optional if deemed irrelevant)

uname -a:

Linux fedora 6.2.10-200.fc37.x86_64 NVIDIA/nvidia-docker#1 SMP PREEMPT_DYNAMIC Thu Apr  6 23:30:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

docker version

Client: Docker Engine - Community
 Cloud integration: v1.0.31
 Version:           23.0.3
 API version:       1.41 (downgraded from 1.42)
 Go version:        go1.19.7
 Git commit:        3e7cbfd
 Built:             Tue Apr  4 22:10:33 2023
 OS/Arch:           linux/amd64
 Context:           desktop-linux

Server: Docker Desktop 4.18.0 (104112)
 Engine:
  Version:          20.10.24
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.19.7
  Git commit:       5d6db84
  Built:            Tue Apr  4 18:18:42 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.18
  GitCommit:        2456e983eb9e37e47538f59ea18f2043c9a73640
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

rpm -qa '*nvidia*'

 nvidia-gpu-firmware-20230310-148.fc37.noarch
xorg-x11-drv-nvidia-kmodsrc-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-cuda-libs-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-libs-530.41.03-1.fc37.x86_64
nvidia-settings-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-power-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-530.41.03-1.fc37.x86_64
akmod-nvidia-530.41.03-1.fc37.x86_64
kmod-nvidia-6.2.9-200.fc37.x86_64-530.41.03-1.fc37.x86_64
nvidia-persistenced-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-cuda-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-libs-530.41.03-1.fc37.i686
xorg-x11-drv-nvidia-cuda-libs-530.41.03-1.fc37.i686
kmod-nvidia-6.2.10-200.fc37.x86_64-530.41.03-1.fc37.x86_64
nvidia-container-toolkit-base-1.13.0-1.x86_64
libnvidia-container1-1.13.0-1.x86_64
libnvidia-container-tools-1.13.0-1.x86_64
nvidia-container-toolkit-1.13.0-1.x86_64
nvidia-docker2-2.13.0-1.noarch

nvidia-container-cli -V

cli-version: 1.13.0
lib-version: 1.13.0
build date: 2023-03-31T13:12+00:00
build revision: 20823911e978a50b33823a5783f92b6e345b241a
build compiler: gcc 8.5.0 20210514 (Red Hat 8.5.0-18)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Thanks for your help!

@pfcouto
Copy link
Author

pfcouto commented Apr 23, 2023

Hello again, can't anyone help me out? @elezar do you know how can I fix this issue? I believe that if I redirect the location of the file nvidia-docker is looking for to one of the locations the file is present in my PC it should this issue, However I don't know how to do it? Thanks.

@pfcouto pfcouto changed the title Can't run container using nvidia-docker Can't run container using nvidia-docker - libnvidia-ml.so.1 Apr 23, 2023
@albert-queralto
Copy link

Hi, I am also experiencing the same issue. It is worth mentioning that I have Docker Desktop installed. The issue occurs when building an image using the sudo docker command, I can get the image to recognize the GPU and the nvidia-smi command inside docker works correctly. However, this does not work in a development environment if I want to use this image, as it is not seen by vscode.

On the other hand, when I am trying to build the image using docker build and then running it, I get the error, mentioned by OP:
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

From what I have seen, it may be an issue with Docker Desktop:
#229

Any way to solve this?

Thank you very much.

@elezar
Copy link
Member

elezar commented Jun 20, 2023

We do not support Docker Desktop. Please use docker-ce instead.

@pfcouto
Copy link
Author

pfcouto commented Jun 20, 2023

Hi @elezar in my case the Nvidia-smi command doesn't even run. I am still waiting for some help since I created this issue and didn't get any! I believe my prints should help, if you could help me

@elezar
Copy link
Member

elezar commented Jun 21, 2023

@pfcouto since you are using rootless docker, would using podman be an option for you? We have recently added support for generating CDI specifications using the nvidia-ctk cdi generate command. Furthermore, specifying CDI devices using the command line are natively supported as of podman 4.1.0.

Please see https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html#container-device-interface-cdi-support for information on generating CDI specifications and consuming them using podman.

If podman is not an option, the NVIDIA Container Runtime can be configured in CDI mode to allow CDI specifications to be consumed by docker (also in rootless mode) until support is available. This is expected as an experimental feature in the Docker 25 release.

@SajjadMzf
Copy link

For me, the reason for the error was a curropted installation of docker-ce, so I deleted docker completely and re-installed it using latest instruction on official documentation.

@olariuromeo
Copy link

you have to start and build the container with sudo in docker-ce

@NinjaPerson24119
Copy link

NinjaPerson24119 commented Feb 18, 2024

On EndeavourOS (Arch) solved by removing docker-desktop. With my package manager yay -R docker-desktop

Unfortunately, this means that I can't run anything that expects docker desktop...

That's a really dumb incompatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants