Error occuring in release 1.14.4: load library failed: libnvidia-ml.so.1: cannot open shared object file #305

bawee · 2024-01-27T10:48:38Z

Hello, I'm getting a load library failed error as as previous issue (unsure whether related, hence the new issue) when running a nextflow pipeline with docker that uses the nvidia-runtime-toolkit. It seems that the error is only present in the new version of nvidia-runtime-toolkit (1.14.4) but does not occur on an identical computer running version 1.14.3 which I had set up only few days prior.

Command error: docker: Error response from daemon: failed to create task for container: failed to create a shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown. time="2024-01-24T14:05:51Z" level=error msg="error waiting for container: "

nvidia-runtime-toolkit was installed using apt on Ubuntu 22.04.3 following instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

I tried installing the old version instead (sudo apt install nvidia-container-toolkit=1.14.3-1) but it was unsuccessful due to an unavailable dependency.

Thanks in advance!

Originally posted by @bawee in #302 (comment)

The text was updated successfully, but these errors were encountered:

elezar · 2024-01-27T13:56:32Z

I will be able to check what the source of this could be on Monday. For now, you should be able to downgrade by specifying the versions of all packages:

sudo apt-get install nvidia-container-toolkit=1.14.3-1 \
        nvidia-container-toolkit-base=1.14.3-1 \
        libnvidia-container-tools=1.14.3-1 \
        libnvidia-container1=1.14.3-1

elezar · 2024-01-29T14:32:00Z

@bawee could you provide more information on your setup? How are you running containers? How is the NVIDIA Container Toolkit installed and configured to be used with Docker?

bawee · 2024-01-29T16:03:40Z

Hi @elezar,
Docker was installed as follows:

sudo apt update
sudo apt install -y docker.io
sudo usermod -aG docker ${USER}
sudo systemctl restart docker

The NVIDIA toolkit was installed as follows with instructions from https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Then configured using:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

The containers are run using a nextflow pipeline documented here: https://labs.epi2me.io/workflows/wf-basecalling/

Rolling back to the previous version of nvida-container-toolkit using your previous instructions above did not help. The error still appeared even with v1.14.3-1. My identical machine running v1.14.3 that i set up last week is still working fine

I also tried completely removing docker and reinstalling using sudo apt-get autoremove -y --purge docker.io

I hope that is at all helpful. Please let me know if I need to provide more info. Thank you!

elezar · 2024-01-29T16:37:00Z

Note that docker.io is listed as a conflicting package here: https://docs.docker.com/engine/install/ubuntu/

Would you be able to install docker-ce instead?

bawee · 2024-01-29T18:02:28Z

Hi Evan,

Thank you for pointing that out. I had not seen that. I followed the instructions on https://docs.docker.com/engine/install/ubuntu/ and replaced docker.io with docker-ce.

The error message is still the same, unfortunately:

Command error:
  docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
  nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
  time="2024-01-29T18:00:47Z" level=error msg="error waiting for container: context canceled"

Here is some more information:

$ dpkg -l | grep -i docker
ii  docker-buildx-plugin                       0.12.1-1~ubuntu.22.04~jammy             amd64        Docker Buildx cli plugin.
ii  docker-ce                                  5:25.0.1-1~ubuntu.22.04~jammy           amd64        Docker: the open-source application container engine
ii  docker-ce-cli                              5:25.0.1-1~ubuntu.22.04~jammy           amd64        Docker CLI: the open-source application container engine
ii  docker-ce-rootless-extras                  5:25.0.1-1~ubuntu.22.04~jammy           amd64        Rootless support for Docker.
ii  docker-compose-plugin                      2.24.2-1~ubuntu.22.04~jammy             amd64        Docker Compose (V2) plugin for the Docker CLI.

$ nvidia-container-cli info
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory

bawee · 2024-01-29T23:06:16Z

Hi @elezar, it turns out the solution was to run sudo ubuntu-drivers install which installed nvidia-driver-535. My fresh install of 22.04 had the open source X.org driver by default (please see screenshot below).

Thanks for helping to troubleshoot and apologies, I did not see the requirement for the driver in the documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

elezar · 2024-01-30T10:07:03Z

Ah. Thanks. Yes, we should definitely include the output of nvidia-smi -L on the host in our issue template. The NVIDIA driver is required for the container toolkit to function.

Can we close this issue then?

bawee · 2024-01-30T13:19:33Z

Yes, thank you very much.

mr-ryan-james · 2024-06-14T07:13:34Z

I had this same issue, btw, and fixed it with the same solution (downloading the drivers based off the docs here https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#driver-installation )

The reason I had this problem is because I thought I had already installed the drivers, because in these CUDA download instructions it is very much implied they are being installed as the last step, and I went with the legacy non open "flavor".

https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Ubuntu&target_version=22.04&target_type=deb_network

But the "download" instructions do not seem to have anything about "sudo apt-get install cuda-drivers-535" from the "cuda installation guide". I am a bit new to CUDA and eager to just get machine learnin' on my newly rented GPU server so I can't say I fully understand the difference between cuda-drivers-535 from the "installation guide" instructions and the "sudo apt-get install -y cuda-drivers" from the CUDA download instructions.

@bawee thanks for posting this question, saved me a lot of time!

amirian · 2024-06-18T07:59:05Z

Hi @elezar, I am trying to run an application on ubuntu 24.04 which needs nvidia container but stopped having the same issue. I have no nvidia hardware installed and I wish there be a solution such as gpu simulation.

amirian · 2024-06-24T13:45:21Z

No comments @elezar? How can I run the application without gpu?

zxdreamer · 2024-07-03T07:25:25Z

Hi @elezar, it turns out the solution was to run sudo ubuntu-drivers install which installed nvidia-driver-535. My fresh install of 22.04 had the open source X.org driver by default (please see screenshot below).

Thanks for helping to troubleshoot and apologies, I did not see the requirement for the driver in the documentation https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html

thanks, I have meet same question, I slove it use the same way.

Saurav3108 · 2024-08-22T08:08:34Z

After reinstalling the cuda drivers, cuda toolkit and container toolkit, I was still getting the error. Issue got resolved by just reinstalling docker-ce using sudo apt-get install --reinstall docker-ce

bawee changed the title ~~Error occuring in release 1.14.4: load library failed: libnvidia-ml.so.1: cannot open shared object file in version 1.14.4~~ Error occuring in release 1.14.4: load library failed: libnvidia-ml.so.1: cannot open shared object file Jan 27, 2024

elezar self-assigned this Jan 27, 2024

elezar mentioned this issue Jan 29, 2024

k8s-dra-driver-kubelet-plugin pod failed to run NVIDIA/k8s-dra-driver#65

Closed

bawee closed this as completed Jan 30, 2024

elezar mentioned this issue Feb 12, 2024

can not run Nvidia-docker or docker with "--gpus" #302

Closed

dhiltgen mentioned this issue Sep 12, 2024

Library missing from ollama when running it in Docker ollama/ollama#6770

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error occuring in release 1.14.4: load library failed: libnvidia-ml.so.1: cannot open shared object file #305

Error occuring in release 1.14.4: load library failed: libnvidia-ml.so.1: cannot open shared object file #305

bawee commented Jan 27, 2024

elezar commented Jan 27, 2024

elezar commented Jan 29, 2024

bawee commented Jan 29, 2024 •

edited

Loading

elezar commented Jan 29, 2024

bawee commented Jan 29, 2024 •

edited

Loading

bawee commented Jan 29, 2024

elezar commented Jan 30, 2024

bawee commented Jan 30, 2024

mr-ryan-james commented Jun 14, 2024 •

edited

Loading

amirian commented Jun 18, 2024

amirian commented Jun 24, 2024

zxdreamer commented Jul 3, 2024

Saurav3108 commented Aug 22, 2024

Error occuring in release 1.14.4: load library failed: libnvidia-ml.so.1: cannot open shared object file #305

Error occuring in release 1.14.4: load library failed: libnvidia-ml.so.1: cannot open shared object file #305

Comments

bawee commented Jan 27, 2024

elezar commented Jan 27, 2024

elezar commented Jan 29, 2024

bawee commented Jan 29, 2024 • edited Loading

elezar commented Jan 29, 2024

bawee commented Jan 29, 2024 • edited Loading

bawee commented Jan 29, 2024

elezar commented Jan 30, 2024

bawee commented Jan 30, 2024

mr-ryan-james commented Jun 14, 2024 • edited Loading

amirian commented Jun 18, 2024

amirian commented Jun 24, 2024

zxdreamer commented Jul 3, 2024

Saurav3108 commented Aug 22, 2024

bawee commented Jan 29, 2024 •

edited

Loading

bawee commented Jan 29, 2024 •

edited

Loading

mr-ryan-james commented Jun 14, 2024 •

edited

Loading