Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail to start on second run. libs being set to 0 size #224

Open
dbkinghorn opened this issue Oct 5, 2023 · 2 comments
Open

Fail to start on second run. libs being set to 0 size #224

dbkinghorn opened this issue Oct 5, 2023 · 2 comments

Comments

@dbkinghorn
Copy link

dbkinghorn commented Oct 5, 2023

Now I've got a show stopper. I basically cannot use enroot with current driver and libnvidia-container*

Ubuntu Server 22.04
Driver Version: 535.113.01
nvidia-container-cli --version
cli-version: 1.14.2
lib-version: 1.14.2
enroot version 3.4.1

Example"
enroot import docker://nvcr.io#nvidia/cuda:12.2.0-runtime-ubuntu22.04
enroot create --name cuda12.2 nvidia+cuda+12.2.0-runtime-ubuntu22.04.sqsh
enroot start cuda12.2 # runs correctly

In the container on first run:

/lib/x86_64-linux-gnu$ ls -l | grep nvidia
lrwxrwxrwx  1 kinghorn kinghorn       33 Oct  5 11:27 libnvidia-allocator.so.1 -> libnvidia-allocator.so.535.113.01
-rw-r--r--  1 nobody   nogroup    160552 Sep 25 02:45 libnvidia-allocator.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn       27 Oct  5 11:27 libnvidia-cfg.so.1 -> libnvidia-cfg.so.535.113.01
-rw-r--r--  1 nobody   nogroup    270840 Sep 25 02:45 libnvidia-cfg.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn       26 Oct  5 11:27 libnvidia-ml.so.1 -> libnvidia-ml.so.535.113.01
-rw-r--r--  1 nobody   nogroup   1819968 Sep 25 02:45 libnvidia-ml.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn       28 Oct  5 11:27 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.535.113.01
-rw-r--r--  1 nobody   nogroup  86140736 Sep 25 02:45 libnvidia-nvvm.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn       30 Oct  5 11:27 libnvidia-opencl.so.1 -> libnvidia-opencl.so.535.113.01
-rw-r--r--  1 nobody   nogroup  24224408 Sep 25 02:45 libnvidia-opencl.so.535.113.01
-rw-r--r--  1 nobody   nogroup     10176 Sep 25 02:45 libnvidia-pkcs11-openssl3.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn       38 Oct  5 11:27 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.535.113.01
-rw-r--r--  1 nobody   nogroup  23348992 Sep 25 02:45 libnvidia-ptxjitcompiler.so.535.113.01

On the host system the libs are already clobbered:

~/.local/share/enroot/cuda12.2/lib/x86_64-linux-gnu$ ls -l | grep nvidia
lrwxrwxrwx  1 kinghorn kinghorn      33 Oct  5 11:10 libnvidia-allocator.so.1 -> libnvidia-allocator.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-allocator.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn      27 Oct  5 11:10 libnvidia-cfg.so.1 -> libnvidia-cfg.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-cfg.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn      26 Oct  5 11:10 libnvidia-ml.so.1 -> libnvidia-ml.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-ml.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn      28 Oct  5 11:10 libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-nvvm.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn      30 Oct  5 11:10 libnvidia-opencl.so.1 -> libnvidia-opencl.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-opencl.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-pkcs11-openssl3.so.535.113.01
lrwxrwxrwx  1 kinghorn kinghorn      38 Oct  5 11:10 libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.535.113.01
-rw-r--r--  1 kinghorn kinghorn       0 Oct  5 11:10 libnvidia-ptxjitcompiler.so.535.113.01

exit
enroot start cuda12.2 # fails

nvidia-container-cli: initialization error: load library failed: /home/kinghorn/.local/share/enroot/cuda12.2/lib/x86_64-linux-gnu/libnvidia-ml.so.1: file too short
[ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1

I don't know that this is an enroot issue.?? Should I be reporting this somewhere else?

@3XX0
Copy link
Member

3XX0 commented Oct 5, 2023

Did you end up doing #222 (comment)?
This might be the culprit since libnvidia-container will attempt to load NVML while the driver is still not mounted.
Not sure what you can do until #222 is fixed.
Maybe LD_PRELOAD instead of LD_LIBRARY_PATH would do it:

export LD_PRELOAD="${ENROOT_ROOTFS}/lib/x86_64-linux-gnu/libnvidia-container-go.so.1"

@dbkinghorn
Copy link
Author

Ahhh! When I initially tested #222 I made a mistake! I tried it again and it does take care of my test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants