Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occuring in release 1.14.4: load library failed: libnvidia-ml.so.1: cannot open shared object file #305

Closed
bawee opened this issue Jan 27, 2024 · 13 comments
Assignees

Comments

@bawee
Copy link

bawee commented Jan 27, 2024

Hello, I'm getting a load library failed error as as previous issue (unsure whether related, hence the new issue) when running a nextflow pipeline with docker that uses the nvidia-runtime-toolkit. It seems that the error is only present in the new version of nvidia-runtime-toolkit (1.14.4) but does not occur on an identical computer running version 1.14.3 which I had set up only few days prior.

Command error: docker: Error response from daemon: failed to create task for container: failed to create a shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown. time="2024-01-24T14:05:51Z" level=error msg="error waiting for container: "

nvidia-runtime-toolkit was installed using apt on Ubuntu 22.04.3 following instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

I tried installing the old version instead (sudo apt install nvidia-container-toolkit=1.14.3-1) but it was unsuccessful due to an unavailable dependency.

Thanks in advance!

Originally posted by @bawee in #302 (comment)

@bawee bawee changed the title Error occuring in release 1.14.4: load library failed: libnvidia-ml.so.1: cannot open shared object file in version 1.14.4 Error occuring in release 1.14.4: load library failed: libnvidia-ml.so.1: cannot open shared object file Jan 27, 2024
@elezar
Copy link
Member

elezar commented Jan 27, 2024

I will be able to check what the source of this could be on Monday. For now, you should be able to downgrade by specifying the versions of all packages:

sudo apt-get install nvidia-container-toolkit=1.14.3-1 \
        nvidia-container-toolkit-base=1.14.3-1 \
        libnvidia-container-tools=1.14.3-1 \
        libnvidia-container1=1.14.3-1

@elezar elezar self-assigned this Jan 27, 2024
@elezar
Copy link
Member

elezar commented Jan 29, 2024

@bawee could you provide more information on your setup? How are you running containers? How is the NVIDIA Container Toolkit installed and configured to be used with Docker?

@bawee
Copy link
Author

bawee commented Jan 29, 2024

Hi @elezar,
Docker was installed as follows:

sudo apt update
sudo apt install -y docker.io
sudo usermod -aG docker ${USER}
sudo systemctl restart docker

The NVIDIA toolkit was installed as follows with instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Then configured using:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

The containers are run using a nextflow pipeline documented here: https://labs.epi2me.io/workflows/wf-basecalling/

Rolling back to the previous version of nvida-container-toolkit using your previous instructions above did not help. The error still appeared even with v1.14.3-1. My identical machine running v1.14.3 that i set up last week is still working fine

I also tried completely removing docker and reinstalling using sudo apt-get autoremove -y --purge docker.io

I hope that is at all helpful. Please let me know if I need to provide more info. Thank you!

@elezar
Copy link
Member

elezar commented Jan 29, 2024

Note that docker.io is listed as a conflicting package here: https://docs.docker.com/engine/install/ubuntu/

Would you be able to install docker-ce instead?

@bawee
Copy link
Author

bawee commented Jan 29, 2024

Hi Evan,

Thank you for pointing that out. I had not seen that. I followed the instructions on https://docs.docker.com/engine/install/ubuntu/ and replaced docker.io with docker-ce.

The error message is still the same, unfortunately:

Command error:
  docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
  nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
  time="2024-01-29T18:00:47Z" level=error msg="error waiting for container: context canceled"

Here is some more information:

$ dpkg -l | grep -i docker
ii  docker-buildx-plugin                       0.12.1-1~ubuntu.22.04~jammy             amd64        Docker Buildx cli plugin.
ii  docker-ce                                  5:25.0.1-1~ubuntu.22.04~jammy           amd64        Docker: the open-source application container engine
ii  docker-ce-cli                              5:25.0.1-1~ubuntu.22.04~jammy           amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras                  5:25.0.1-1~ubuntu.22.04~jammy           amd64        Rootless support for Docker.
ii  docker-compose-plugin                      2.24.2-1~ubuntu.22.04~jammy             amd64        Docker Compose (V2) plugin for the Docker CLI.

$ nvidia-container-cli info
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory

@bawee
Copy link
Author

bawee commented Jan 29, 2024

Hi @elezar, it turns out the solution was to run sudo ubuntu-drivers install which installed nvidia-driver-535. My fresh install of 22.04 had the open source X.org driver by default (please see screenshot below).

Screenshot from 2024-01-29 21-53-56

Thanks for helping to troubleshoot and apologies, I did not see the requirement for the driver in the documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

@elezar
Copy link
Member

elezar commented Jan 30, 2024

Ah. Thanks. Yes, we should definitely include the output of nvidia-smi -L on the host in our issue template. The NVIDIA driver is required for the container toolkit to function.

Can we close this issue then?

@bawee
Copy link
Author

bawee commented Jan 30, 2024

Yes, thank you very much.

@mr-ryan-james
Copy link

mr-ryan-james commented Jun 14, 2024

I had this same issue, btw, and fixed it with the same solution (downloading the drivers based off the docs here https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#driver-installation )

The reason I had this problem is because I thought I had already installed the drivers, because in these CUDA download instructions it is very much implied they are being installed as the last step, and I went with the legacy non open "flavor".

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network

But the "download" instructions do not seem to have anything about "sudo apt-get install cuda-drivers-535" from the "cuda installation guide". I am a bit new to CUDA and eager to just get machine learnin' on my newly rented GPU server so I can't say I fully understand the difference between cuda-drivers-535 from the "installation guide" instructions and the "sudo apt-get install -y cuda-drivers" from the CUDA download instructions.

@bawee thanks for posting this question, saved me a lot of time!

@amirian
Copy link

amirian commented Jun 18, 2024

Hi @elezar, I am trying to run an application on ubuntu 24.04 which needs nvidia container but stopped having the same issue. I have no nvidia hardware installed and I wish there be a solution such as gpu simulation.

@amirian
Copy link

amirian commented Jun 24, 2024

No comments @elezar? How can I run the application without gpu?

@zxdreamer
Copy link

Hi @elezar, it turns out the solution was to run sudo ubuntu-drivers install which installed nvidia-driver-535. My fresh install of 22.04 had the open source X.org driver by default (please see screenshot below).

Screenshot from 2024-01-29 21-53-56

Thanks for helping to troubleshoot and apologies, I did not see the requirement for the driver in the documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

thanks, I have meet same question, I slove it use the same way.

@Saurav3108
Copy link

After reinstalling the cuda drivers, cuda toolkit and container toolkit, I was still getting the error. Issue got resolved by just reinstalling docker-ce using sudo apt-get install --reinstall docker-ce

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants