Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble Running NVIDIA GPU Containers, ldconfig failed #1516

Closed
Nauman3S opened this issue Mar 31, 2024 · 5 comments
Closed

Trouble Running NVIDIA GPU Containers, ldconfig failed #1516

Nauman3S opened this issue Mar 31, 2024 · 5 comments

Comments

@Nauman3S
Copy link

I'm experiencing difficulties running NVIDIA GPU containers. I encounter errors when attempting to run containers that utilize the GPU.

Issue Reproduction Steps:

Configuring the container runtime:

sudo nvidia-ctk runtime configure --runtime=containerd
sudo systemctl restart containerd

Pulling images for testing:

sudo ctr images pull docker.io/nvidia/cuda:12.0.0-runtime-ubuntu20.04
sudo ctr images pull docker.io/nvidia/cuda:12.0.0-runtime-ubi8
sudo ctr images pull docker.io/nvidia/cuda:12.0.0-base-ubuntu20.04
sudo ctr images pull docker.io/nvidia/cuda:12.0.0-base-ubi8
Running a container with GPU:

sudo ctr run --rm --gpus 0 --runtime io.containerd.runc.v1 --privileged docker.io/nvidia/cuda:12.0.0-runtime-ubuntu20.04 test nvidia-smi

Error Message:

ctr: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: ldcache error: process /sbin/ldconfig.real failed with error code: 1: unknown
This error persists across all pulled NVIDIA images(non-ubuntu based images show the same error but with /sbin/ldconfig instead of /sbin/ldconfig.real. However, non-GPU containers (e.g., docker.io/macabees/neofetch:latest) work without issues.

Further Details:

Running ldconfig -p shows 264 libs found, including various NVIDIA libraries while running ldconfig outputs no error.

Output from sudo nvidia-container-cli -k -d /dev/tty info includes warnings about missing libraries and compat32 libraries, although nvidia-smi shows the GPU is recognized correctly.

Attempted Solutions:

Verifying all NVIDIA driver and toolkit components are correctly installed. Ensuring the ldconfig cache is current and includes paths to the NVIDIA libraries and /sbin/ldconfig.real is a symlink to /sbin/ldconfig.

Despite these efforts, the error persists, and GPU containers fail to start. I'm seeking advice on resolving this ldcache and container initialization error to run NVIDIA GPU containers.

@dwalkes
Copy link
Member

dwalkes commented Mar 31, 2024

Hi,

You can see the tests we run on meta-tegra images in the test spreadsheet.

@Nauman3S
Copy link
Author

Nauman3S commented Mar 31, 2024

Hi,

  • It's image-full, branch is mickledore and it is orion.
  • I do have some un-related layers in my final build like neofetch but they are not interfering with any other layers.

The issue is, I need to use containerd instead of docker hence I removed docker recipe(s) from the build and with containerd I am getting this error although nothing related to kernel and nvidia-drivers has changed.

@ichergui
Copy link
Member

ichergui commented Apr 4, 2024

Hi @Nauman3S

Could you please use nanbield branch instead of mickledore ?
mickledore is deprecated branch.

Please share any findings when you are able to test with nanbield branch

@ichergui
Copy link
Member

HI @Nauman3S
Any update on this issue ?

@ichergui
Copy link
Member

Closing this issue since no updates provided.
Feel free to open new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants