Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1 #154

Open
Dan-Burns opened this issue Dec 2, 2022 · 47 comments

Comments

@Dan-Burns
Copy link

Hello,

I tried the different combinations of conda and pip packages that people suggest to get tensorflow running for the rtx 30 series. Thought it was working after utilizing the gpu with keras tutorial code but moved to a different type of model and something apparently broke.

Now I'm trying the docker route.
docker run --gpus all -it --rm nvcr.io/nvidia/tensorflow:22.11-tf2-py3
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy' nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
There seems to be a lot of missing libraries.

3. Information to attach (optional if deemed irrelevant)

  • [ x] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
  • I1202 15:15:34.407243 26518 nvc.c:376] initializing library context (version=1.11.0, build=)
    I1202 15:15:34.407353 26518 nvc.c:350] using root /
    I1202 15:15:34.407365 26518 nvc.c:351] using ldcache /etc/ld.so.cache
    I1202 15:15:34.407377 26518 nvc.c:352] using unprivileged user 1000:1000
    I1202 15:15:34.407426 26518 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
    I1202 15:15:34.408137 26518 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
    W1202 15:15:34.411623 26519 nvc.c:273] failed to set inheritable capabilities
    W1202 15:15:34.411736 26519 nvc.c:274] skipping kernel modules load due to failure
    I1202 15:15:34.412602 26520 rpc.c:71] starting driver rpc service
    I1202 15:15:34.433974 26521 rpc.c:71] starting nvcgo rpc service
    I1202 15:15:34.438005 26518 nvc_info.c:766] requesting driver information with ''
    I1202 15:15:34.445181 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.520.56.06
    I1202 15:15:34.445313 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.520.56.06
    I1202 15:15:34.445952 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.520.56.06
    I1202 15:15:34.446254 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.520.56.06
    I1202 15:15:34.446554 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.520.56.06
    I1202 15:15:34.446877 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.520.56.06
    I1202 15:15:34.447241 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.520.56.06
    I1202 15:15:34.447301 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.520.56.06
    I1202 15:15:34.447405 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.520.56.06
    I1202 15:15:34.447490 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.520.56.06
    I1202 15:15:34.447550 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.520.56.06
    I1202 15:15:34.447813 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.520.56.06
    I1202 15:15:34.448099 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.520.56.06
    I1202 15:15:34.448197 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.520.56.06
    I1202 15:15:34.448693 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.520.56.06
    I1202 15:15:34.448755 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.520.56.06
    I1202 15:15:34.449075 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.520.56.06
    I1202 15:15:34.449417 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.520.56.06
    I1202 15:15:34.450211 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.520.56.06
    I1202 15:15:34.450273 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.520.56.06
    I1202 15:15:34.450625 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.520.56.06
    I1202 15:15:34.450896 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.520.56.06
    I1202 15:15:34.451174 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.520.56.06
    I1202 15:15:34.451236 26518 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.520.56.06
    I1202 15:15:34.451580 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.520.56.06
    I1202 15:15:34.451929 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.520.56.06
    I1202 15:15:34.452169 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.520.56.06
    I1202 15:15:34.452413 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.520.56.06
    I1202 15:15:34.452680 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.520.56.06
    I1202 15:15:34.452975 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.520.56.06
    I1202 15:15:34.453288 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.520.56.06
    I1202 15:15:34.453571 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.520.56.06
    I1202 15:15:34.453833 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.520.56.06
    I1202 15:15:34.454141 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.520.56.06
    I1202 15:15:34.454359 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.520.56.06
    I1202 15:15:34.455059 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.520.56.06
    I1202 15:15:34.455764 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-allocator.so.520.56.06
    I1202 15:15:34.456075 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.520.56.06
    I1202 15:15:34.456395 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.520.56.06
    I1202 15:15:34.456750 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.520.56.06
    I1202 15:15:34.457050 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.520.56.06
    I1202 15:15:34.457314 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.520.56.06
    I1202 15:15:34.457580 26518 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.520.56.06
    W1202 15:15:34.457645 26518 nvc_info.c:399] missing library libnvidia-nscq.so
    W1202 15:15:34.457659 26518 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
    W1202 15:15:34.457678 26518 nvc_info.c:399] missing library libnvidia-pkcs11.so
    W1202 15:15:34.457694 26518 nvc_info.c:399] missing library libvdpau_nvidia.so
    W1202 15:15:34.457709 26518 nvc_info.c:399] missing library libnvidia-ifr.so
    W1202 15:15:34.457722 26518 nvc_info.c:399] missing library libnvidia-cbl.so
    W1202 15:15:34.457740 26518 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
    W1202 15:15:34.457753 26518 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
    W1202 15:15:34.457768 26518 nvc_info.c:403] missing compat32 library libcudadebugger.so
    W1202 15:15:34.457780 26518 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
    W1202 15:15:34.457792 26518 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
    W1202 15:15:34.457808 26518 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
    W1202 15:15:34.457828 26518 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
    W1202 15:15:34.457843 26518 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
    W1202 15:15:34.457860 26518 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
    W1202 15:15:34.457880 26518 nvc_info.c:403] missing compat32 library libnvoptix.so
    W1202 15:15:34.457894 26518 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
    I1202 15:15:34.460121 26518 nvc_info.c:299] selecting /usr/bin/nvidia-smi
    I1202 15:15:34.460197 26518 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
    I1202 15:15:34.460243 26518 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
    I1202 15:15:34.460336 26518 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
    I1202 15:15:34.460409 26518 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
    W1202 15:15:34.460616 26518 nvc_info.c:425] missing binary nv-fabricmanager
    I1202 15:15:34.460810 26518 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/520.56.06/gsp.bin
    I1202 15:15:34.460876 26518 nvc_info.c:529] listing device /dev/nvidiactl
    I1202 15:15:34.460891 26518 nvc_info.c:529] listing device /dev/nvidia-uvm
    I1202 15:15:34.460904 26518 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
    I1202 15:15:34.460915 26518 nvc_info.c:529] listing device /dev/nvidia-modeset
    I1202 15:15:34.460980 26518 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
    W1202 15:15:34.461036 26518 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
    W1202 15:15:34.461083 26518 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
    I1202 15:15:34.461100 26518 nvc_info.c:822] requesting device information with ''
    I1202 15:15:34.468056 26518 nvc_info.c:713] listing device /dev/nvidia0 (GPU-ba9fdcdb-8a2b-d2b6-f69c-5f2ac08dde8b at 00000000:01:00.0)
    NVRM version: 520.56.06
    CUDA version: 11.8

Device Index: 0
Device Minor: 0
Model: NVIDIA GeForce RTX 3090 Ti
Brand: GeForce
GPU UUID: GPU-ba9fdcdb-8a2b-d2b6-f69c-5f2ac08dde8b
Bus Location: 00000000:01:00.0
Architecture: 8.6
I1202 15:15:34.468151 26518 nvc.c:434] shutting down library context
I1202 15:15:34.468317 26521 rpc.c:95] terminating nvcgo rpc service
I1202 15:15:34.469397 26518 rpc.c:132] nvcgo rpc service terminated successfully
I1202 15:15:34.474156 26520 rpc.c:95] terminating driver rpc service
I1202 15:15:34.474599 26518 rpc.c:132] driver rpc service terminated successfully

Timestamp : Fri Dec 2 09:17:13 2022
Driver Version : 520.56.06
CUDA Version : 11.8

Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce RTX 3090 Ti
Product Brand : GeForce
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Enabled
Persistence Mode : Enabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-ba9fdcdb-8a2b-d2b6-f69c-5f2ac08dde8b
Minor Number : 0
VBIOS Version : 94.02.A0.00.2D
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : G002.0000.00.03
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x220310DE
Bus Id : 00000000:01:00.0
Sub System Id : 0x88701043
GPU Link Info
PCIe Generation
Max : 4
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 1000 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
FB Memory Usage
Total : 24564 MiB
Reserved : 310 MiB
Used : 510 MiB
Free : 23742 MiB
BAR1 Memory Usage
Total : 256 MiB
Used : 13 MiB
Free : 243 MiB
Compute Mode : Default
Utilization
Gpu : 6 %
Memory : 5 %
Encoder : 0 %
Decoder : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
Ecc Mode
Current : Disabled
Pending : Disabled
ECC Errors
Volatile
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Aggregate
SRAM Correctable : N/A
SRAM Uncorrectable : N/A
DRAM Correctable : N/A
DRAM Uncorrectable : N/A
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 36 C
GPU Shutdown Temp : 97 C
GPU Slowdown Temp : 94 C
GPU Max Operating Temp : 92 C
GPU Target Temperature : 83 C
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
Power Readings
Power Management : Supported
Power Draw : 32.45 W
Power Limit : 480.00 W
Default Power Limit : 480.00 W
Enforced Power Limit : 480.00 W
Min Power Limit : 100.00 W
Max Power Limit : 516.00 W
Clocks
Graphics : 210 MHz
SM : 210 MHz
Memory : 405 MHz
Video : 555 MHz
Applications Clocks
Graphics : N/A
Memory : N/A
Default Applications Clocks
Graphics : N/A
Memory : N/A
Max Clocks
Graphics : 2115 MHz
SM : 2115 MHz
Memory : 10501 MHz
Video : 1950 MHz
Max Customer Boost Clocks
Graphics : N/A
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : 740.000 mV
Processes
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2283
Type : G
Name : /usr/lib/xorg/Xorg
Used GPU Memory : 259 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 2441
Type : G
Name : /usr/bin/gnome-shell
Used GPU Memory : 52 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 3320
Type : G
Name : /opt/docker-desktop/Docker Desktop --type=gpu-process --enable-crashpad --enable-crash-reporter=46721d59-e3cc-4241-8f96-57bab71f8674,no_channel --user-data-dir=/home/kanaka/.config/Docker Desktop --gpu-preferences=WAAAAAAAAAAgAAAIAAAAAAAAAAAAAAAAAABgAAAAAAA4AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAIAAAAAAAAAABAAAAAAAAAAgAAAAAAAAACAAAAAAAAAAIAAAAAAAAAA== --shared-files --field-trial-handle=0,i,777493636119283380,17735576311253417080,131072 --disable-features=SpareRendererForSitePerProcess
Used GPU Memory : 27 MiB
GPU instance ID : N/A
Compute instance ID : N/A
Process ID : 4402
Type : C+G
Name : /opt/google/chrome/chrome --type=gpu-process --enable-crashpad --crashpad-handler-pid=4367 --enable-crash-reporter=, --change-stack-guard-on-fork=enable --gpu-preferences=WAAAAAAAAAAgAAAIAAAAAAAAAAAAAAAAAABgAAEAAAA4AAAAAAAAAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAAAAAAAAAAIAAAAAAAAAABAAAAAAAAAAgAAAAAAAAACAAAAAAAAAAIAAAAAAAAAA== --shared-files --field-trial-handle=0,i,1352372760819385498,10632265477078674372,131072
Used GPU Memory : 166 MiB

  • [ x] Docker version from docker version
  • Client: Docker Engine - Community
    Cloud integration: v1.0.29
    Version: 20.10.21
    API version: 1.41
    Go version: go1.18.7
    Git commit: baeda1f
    Built: Tue Oct 25 18:01:58 2022
    OS/Arch: linux/amd64
    Context: desktop-linux
    Experimental: true

Server: Docker Desktop 4.15.0 (93002)
Engine:
Version: 20.10.21
API version: 1.41 (minimum version 1.12)
Go version: go1.18.7
Git commit: 3056208
Built: Tue Oct 25 18:00:19 2022
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.6.10
GitCommit: 770bd0108c32f3fb5c73ae1264f7e503fe7b2661
runc:
Version: 1.1.4
GitCommit: v1.1.4-0-g5fd4c4d
docker-init:
Version: 0.19.0
GitCommit: de40ad0

  • [x ] NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    -Desired=Unknown/Install/Remove/Purge/Hold
    | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
    |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
    ||/ Name Version Architecture Description
    +++-===================================-============================-============-========================================================================
    un libgldispatch0-nvidia (no description available)
    ii libnvidia-cfg1-515:amd64 520.56.06-0lambda0.22.04.3 amd64 Transitional package for libnvidia-cfg1-520
    ii libnvidia-cfg1-520:amd64 520.56.06-0lambda0.22.04.3 amd64 NVIDIA binary OpenGL/GLX configuration library
    un libnvidia-cfg1-any (no description available)
    un libnvidia-common (no description available)
    ii libnvidia-common-515 520.56.06-0lambda0.22.04.3 all Transitional package for libnvidia-common-520
    ii libnvidia-common-520 520.56.06-0lambda0.22.04.3 all Shared files used by the NVIDIA libraries
    un libnvidia-compute (no description available)
    ii libnvidia-compute-515:amd64 520.56.06-0lambda0.22.04.3 amd64 Transitional package for libnvidia-compute-520
    ii libnvidia-compute-515:i386 520.56.06-0lambda0.22.04.3 i386 Transitional package for libnvidia-compute-520
    ii libnvidia-compute-520:amd64 520.56.06-0lambda0.22.04.3 amd64 NVIDIA libcompute package
    ii libnvidia-compute-520:i386 520.56.06-0lambda0.22.04.3 i386 NVIDIA libcompute package
    ii libnvidia-container-tools 1.11.0+dfsg-0lambda0.22.04.1 amd64 Package for configuring containers with NVIDIA hardware (CLI tool)
    ii libnvidia-container1:amd64 1.11.0+dfsg-0lambda0.22.04.1 amd64 Package for configuring containers with NVIDIA hardware (shared library)
    un libnvidia-decode (no description available)
    ii libnvidia-decode-515:amd64 520.56.06-0lambda0.22.04.3 amd64 Transitional package for libnvidia-decode-520
    ii libnvidia-decode-515:i386 520.56.06-0lambda0.22.04.3 i386 Transitional package for libnvidia-decode-520
    ii libnvidia-decode-520:amd64 520.56.06-0lambda0.22.04.3 amd64 NVIDIA Video Decoding runtime libraries
    ii libnvidia-decode-520:i386 520.56.06-0lambda0.22.04.3 i386 NVIDIA Video Decoding runtime libraries
    ii libnvidia-egl-wayland1:amd64 1:1.1.9-1.1 amd64 Wayland EGL External Platform library -- shared library
    un libnvidia-encode (no description available)
    ii libnvidia-encode-515:amd64 520.56.06-0lambda0.22.04.3 amd64 Transitional package for libnvidia-encode-520
    ii libnvidia-encode-515:i386 520.56.06-0lambda0.22.04.3 i386 Transitional package for libnvidia-encode-520
    ii libnvidia-encode-520:amd64 520.56.06-0lambda0.22.04.3 amd64 NVENC Video Encoding runtime library
    ii libnvidia-encode-520:i386 520.56.06-0lambda0.22.04.3 i386 NVENC Video Encoding runtime library
    un libnvidia-encode1 (no description available)
    un libnvidia-extra (no description available)
    ii libnvidia-extra-515:amd64 520.56.06-0lambda0.22.04.3 amd64 Transitional package for libnvidia-extra-520
    ii libnvidia-extra-520:amd64 520.56.06-0lambda0.22.04.3 amd64 Extra libraries for the NVIDIA driver
    ii libnvidia-extra-520:i386 520.56.06-0lambda0.22.04.3 i386 Extra libraries for the NVIDIA driver
    un libnvidia-fbc1 (no description available)
    ii libnvidia-fbc1-515:amd64 520.56.06-0lambda0.22.04.3 amd64 Transitional package for libnvidia-fbc1-520
    ii libnvidia-fbc1-515:i386 520.56.06-0lambda0.22.04.3 i386 Transitional package for libnvidia-fbc1-520
    ii libnvidia-fbc1-520:amd64 520.56.06-0lambda0.22.04.3 amd64 NVIDIA OpenGL-based Framebuffer Capture runtime library
    ii libnvidia-fbc1-520:i386 520.56.06-0lambda0.22.04.3 i386 NVIDIA OpenGL-based Framebuffer Capture runtime library
    un libnvidia-gl (no description available)
    un libnvidia-gl-390 (no description available)
    un libnvidia-gl-410 (no description available)
    un libnvidia-gl-470 (no description available)
    un libnvidia-gl-495 (no description available)
    ii libnvidia-gl-515:amd64 520.56.06-0lambda0.22.04.3 amd64 Transitional package for libnvidia-gl-520
    ii libnvidia-gl-515:i386 520.56.06-0lambda0.22.04.3 i386 Transitional package for libnvidia-gl-520
    ii libnvidia-gl-520:amd64 520.56.06-0lambda0.22.04.3 amd64 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
    ii libnvidia-gl-520:i386 520.56.06-0lambda0.22.04.3 i386 NVIDIA OpenGL/GLX/EGL/GLES GLVND libraries and Vulkan ICD
    un libnvidia-legacy-390xx-egl-wayland1 (no description available)
    un libnvidia-ml1 (no description available)
    un nvidia-common (no description available)
    un nvidia-compute-utils (no description available)
    ii nvidia-compute-utils-515 520.56.06-0lambda0.22.04.3 amd64 Transitional package for nvidia-compute-utils-520
    ii nvidia-compute-utils-520 520.56.06-0lambda0.22.04.3 amd64 NVIDIA compute utilities
    un nvidia-contaienr-runtime (no description available)
    un nvidia-container-runtime (no description available)
    un nvidia-container-runtime-hook (no description available)
    ii nvidia-container-toolkit 1.11.0-0lambda0.22.04.1 amd64 OCI hook for configuring containers for NVIDIA hardware
    ii nvidia-container-toolkit-base 1.11.0-0lambda0.22.04.1 amd64 OCI hook for configuring containers for NVIDIA hardware
    ii nvidia-dkms-515 520.56.06-0lambda0.22.04.3 amd64 Transitional package for nvidia-dkms-520
    ii nvidia-dkms-520 520.56.06-0lambda0.22.04.3 amd64 NVIDIA DKMS package
    un nvidia-dkms-kernel (no description available)
    un nvidia-driver (no description available)
    ii nvidia-driver-515 520.56.06-0lambda0.22.04.3 amd64 Transitional package for nvidia-driver-520
    ii nvidia-driver-520 520.56.06-0lambda0.22.04.3 amd64 NVIDIA driver metapackage
    un nvidia-driver-binary (no description available)
    un nvidia-egl-wayland-common (no description available)
    un nvidia-kernel-common (no description available)
    ii nvidia-kernel-common-515 520.56.06-0lambda0.22.04.3 amd64 Transitional package for nvidia-kernel-common-520
    ii nvidia-kernel-common-520 520.56.06-0lambda0.22.04.3 amd64 Shared files used with the kernel module
    un nvidia-kernel-source (no description available)
    ii nvidia-kernel-source-515 520.56.06-0lambda0.22.04.3 amd64 Transitional package for nvidia-kernel-source-520
    ii nvidia-kernel-source-520 520.56.06-0lambda0.22.04.3 amd64 NVIDIA kernel source package
    un nvidia-libopencl1-dev (no description available)
    un nvidia-opencl-icd (no description available)
    un nvidia-persistenced (no description available)
    ii nvidia-prime 0.8.17.1 all Tools to enable NVIDIA's Prime
    ii nvidia-settings 510.47.03-0ubuntu1 amd64 Tool for configuring the NVIDIA graphics driver
    un nvidia-settings-binary (no description available)
    un nvidia-smi (no description available)
    un nvidia-utils (no description available)
    ii nvidia-utils-515 520.56.06-0lambda0.22.04.3 amd64 Transitional package for nvidia-utils-520
    ii nvidia-utils-520 520.56.06-0lambda0.22.04.3 amd64 NVIDIA driver support binaries
    ii xserver-xorg-video-nvidia-515 520.56.06-0lambda0.22.04.3 amd64 Transitional package for xserver-xorg-video-nvidia-520
    ii xserver-xorg-video-nvidia-520 520.56.06-0lambda0.22.04.3 amd64 NVIDIA binary Xorg driver

  • [ x] NVIDIA container library version from nvidia-container-cli -V

  • cli-version: 1.11.0
    lib-version: 1.11.0
    build date: 2022-10-25T22:10+00:00
    build revision:
    build compiler: x86_64-linux-gnu-gcc-11 11.3.0
    build platform: x86_64
    build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -Wdate-time -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -g -O2 -ffile-prefix-map=/build/libnvidia-container-956QFy/libnvidia-container-1.11.0+dfsg=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections -Wl,-Bsymbolic-functions -flto=auto -ffat-lto-objects -flto=auto -Wl,-z,relro

@elezar
Copy link
Member

elezar commented Dec 2, 2022

The toolkit explicitly looks for libnvidia-ml.so.1 which should be symlinked to libnvidia-mk.so.<DRIVER_VERSION> after running ldconfig on your host. Since nvidia-smi works (and also uses libnvidia-ml.so.1), I would not expect this to be the case.

How is docker installed, could it be that it is installed as a snap and cannot load the system libraries because of this?

@Dan-Burns
Copy link
Author

Dan-Burns commented Dec 2, 2022

I installed docker-desktop after following the "docker engine" link on https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tensorflow

@johndpope
Copy link

johndpope commented Dec 5, 2022

same problem ubuntu 22:04

Linux msi 5.15.0-56-generic NVIDIA/nvidia-docker#62-Ubuntu SMP Tue Nov 22 19:54:14 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

docker desktop

can you unpack this?

The toolkit explicitly looks for libnvidia-ml.so.1 which should be symlinked to libnvidia-mk.so.<DRIVER_VERSION> after running ldconfig on your host. Since nvidia-smi works (and also uses libnvidia-ml.so.1), I would not expect this to be the case.

How is docker installed, could it be that it is installed as a snap and cannot load the system libraries because of this?

I installed

sudo apt-get install -y nvidia-docker2

successfully
nvidia-docker2 is already the newest version (2.11.0-1).

Mon Dec  5 18:59:03 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   57C    P8    29W / 370W |   1010MiB / 24576MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1515      G   /usr/lib/xorg/Xorg                548MiB |
|    0   N/A  N/A      1649      G   /usr/bin/gnome-shell              234MiB |
|    0   N/A  N/A     19695      G   ...RendererForSitePerProcess       32MiB |
|    0   N/A  N/A     19769    C+G   ...192290595412440874,131072      191MiB |
+-----------------------------------------------------------------------------+




nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I1205 08:00:00.132727 24945 nvc.c:376] initializing library context (version=1.11.0, build=c8f267be0bac1c654d59ad4ea5df907141149977)
I1205 08:00:00.132797 24945 nvc.c:350] using root /
I1205 08:00:00.132806 24945 nvc.c:351] using ldcache /etc/ld.so.cache
I1205 08:00:00.132819 24945 nvc.c:352] using unprivileged user 29999:29999
I1205 08:00:00.132844 24945 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1205 08:00:00.133009 24945 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W1205 08:00:00.134346 24946 nvc.c:273] failed to set inheritable capabilities
W1205 08:00:00.134424 24946 nvc.c:274] skipping kernel modules load due to failure
I1205 08:00:00.134891 24947 rpc.c:71] starting driver rpc service
I1205 08:00:00.142782 24948 rpc.c:71] starting nvcgo rpc service
I1205 08:00:00.143811 24945 nvc_info.c:766] requesting driver information with ''
I1205 08:00:00.145644 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.525.60.11
I1205 08:00:00.145731 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.525.60.11
I1205 08:00:00.145778 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.525.60.11
I1205 08:00:00.145821 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.525.60.11
I1205 08:00:00.145877 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.525.60.11
I1205 08:00:00.145930 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.525.60.11
I1205 08:00:00.145970 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.525.60.11
I1205 08:00:00.146007 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.525.60.11
I1205 08:00:00.146066 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.525.60.11
I1205 08:00:00.146105 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.525.60.11
I1205 08:00:00.146144 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.525.60.11
I1205 08:00:00.146183 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.525.60.11
I1205 08:00:00.146236 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.525.60.11
I1205 08:00:00.146288 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.525.60.11
I1205 08:00:00.146325 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.525.60.11
I1205 08:00:00.146366 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.525.60.11
I1205 08:00:00.146418 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.525.60.11
I1205 08:00:00.146475 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.525.60.11
I1205 08:00:00.146752 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcudadebugger.so.525.60.11
I1205 08:00:00.146788 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.525.60.11
I1205 08:00:00.146943 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.525.60.11
I1205 08:00:00.146977 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.525.60.11
I1205 08:00:00.147011 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.525.60.11
I1205 08:00:00.147046 24945 nvc_info.c:173] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.525.60.11
I1205 08:00:00.147106 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-tls.so.525.60.11
I1205 08:00:00.147140 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ptxjitcompiler.so.525.60.11
I1205 08:00:00.147186 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opticalflow.so.525.60.11
I1205 08:00:00.147236 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-opencl.so.525.60.11
I1205 08:00:00.147271 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-ml.so.525.60.11
I1205 08:00:00.147319 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glvkspirv.so.525.60.11
I1205 08:00:00.147350 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glsi.so.525.60.11
I1205 08:00:00.147385 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-glcore.so.525.60.11
I1205 08:00:00.147417 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-fbc.so.525.60.11
I1205 08:00:00.147465 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-encode.so.525.60.11
I1205 08:00:00.147515 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-eglcore.so.525.60.11
I1205 08:00:00.147547 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvidia-compiler.so.525.60.11
I1205 08:00:00.147582 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libnvcuvid.so.525.60.11
I1205 08:00:00.147649 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libcuda.so.525.60.11
I1205 08:00:00.147707 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLX_nvidia.so.525.60.11
I1205 08:00:00.147741 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv2_nvidia.so.525.60.11
I1205 08:00:00.147775 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libGLESv1_CM_nvidia.so.525.60.11
I1205 08:00:00.147811 24945 nvc_info.c:173] selecting /usr/lib/i386-linux-gnu/libEGL_nvidia.so.525.60.11
W1205 08:00:00.147830 24945 nvc_info.c:399] missing library libnvidia-nscq.so
W1205 08:00:00.147836 24945 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so
W1205 08:00:00.147842 24945 nvc_info.c:399] missing library libnvidia-pkcs11.so
W1205 08:00:00.147847 24945 nvc_info.c:399] missing library libvdpau_nvidia.so
W1205 08:00:00.147854 24945 nvc_info.c:399] missing library libnvidia-ifr.so
W1205 08:00:00.147859 24945 nvc_info.c:399] missing library libnvidia-cbl.so
W1205 08:00:00.147867 24945 nvc_info.c:403] missing compat32 library libnvidia-cfg.so
W1205 08:00:00.147873 24945 nvc_info.c:403] missing compat32 library libnvidia-nscq.so
W1205 08:00:00.147878 24945 nvc_info.c:403] missing compat32 library libcudadebugger.so
W1205 08:00:00.147887 24945 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so
W1205 08:00:00.147893 24945 nvc_info.c:403] missing compat32 library libnvidia-allocator.so
W1205 08:00:00.147899 24945 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so
W1205 08:00:00.147904 24945 nvc_info.c:403] missing compat32 library libnvidia-ngx.so
W1205 08:00:00.147910 24945 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so
W1205 08:00:00.147916 24945 nvc_info.c:403] missing compat32 library libnvidia-ifr.so
W1205 08:00:00.147921 24945 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so
W1205 08:00:00.147926 24945 nvc_info.c:403] missing compat32 library libnvoptix.so
W1205 08:00:00.147932 24945 nvc_info.c:403] missing compat32 library libnvidia-cbl.so
I1205 08:00:00.148532 24945 nvc_info.c:299] selecting /usr/bin/nvidia-smi
I1205 08:00:00.148551 24945 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump
I1205 08:00:00.148569 24945 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced
I1205 08:00:00.148598 24945 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control
I1205 08:00:00.148615 24945 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server
W1205 08:00:00.148707 24945 nvc_info.c:425] missing binary nv-fabricmanager
W1205 08:00:00.148735 24945 nvc_info.c:349] missing firmware path /lib/firmware/nvidia/525.60.11/gsp.bin
I1205 08:00:00.148762 24945 nvc_info.c:529] listing device /dev/nvidiactl
I1205 08:00:00.148767 24945 nvc_info.c:529] listing device /dev/nvidia-uvm
I1205 08:00:00.148775 24945 nvc_info.c:529] listing device /dev/nvidia-uvm-tools
I1205 08:00:00.148781 24945 nvc_info.c:529] listing device /dev/nvidia-modeset
I1205 08:00:00.148809 24945 nvc_info.c:343] listing ipc path /run/nvidia-persistenced/socket
W1205 08:00:00.148831 24945 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket
W1205 08:00:00.148847 24945 nvc_info.c:349] missing ipc path /tmp/nvidia-mps
I1205 08:00:00.148851 24945 nvc_info.c:822] requesting device information with ''
I1205 08:00:00.155221 24945 nvc_info.c:713] listing device /dev/nvidia0 (GPU-94c5d11e-e574-eefc-2db6-08e204f9e1a4 at 00000000:01:00.0)
NVRM version:   525.60.11
CUDA version:   12.0

Device Index:   0
Device Minor:   0
Model:          NVIDIA GeForce RTX 3090
Brand:          GeForce
GPU UUID:       GPU-94c5d11e-e574-eefc-2db6-08e204f9e1a4
Bus Location:   00000000:01:00.0
Architecture:   8.6
I1205 08:00:00.155235 24945 nvc.c:434] shutting down library context
I1205 08:00:00.155296 24948 rpc.c:95] terminating nvcgo rpc service
I1205 08:00:00.155542 24945 rpc.c:135] nvcgo rpc service terminated successfully
I1205 08:00:00.156623 24947 rpc.c:95] terminating driver rpc service
I1205 08:00:00.156671 24945 rpc.c:135] driver rpc service terminated successfully

@johndpope
Copy link

johndpope commented Dec 5, 2022

Screenshot from 2022-12-05 22-35-34
not sure it helps - I had originally installed the driver from cuda 11.8 - but then when I did nvidia-docker2 install - the driver broke - so i reverted back to the system (auto install) driver.

UPDATE

reading through docs - for
https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

this command works fine....


sudo docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.8.0-base-ubuntu22.04' locally
11.8.0-base-ubuntu22.04: Pulling from nvidia/cuda
301a8b74f71f: Already exists 
35985d37d899: Already exists 
5b7513e7876e: Already exists 
bbf319bc026c: Already exists 
da5c9c5d5ac3: Already exists 
Digest: sha256:83493b3f150cc23f91fb0d2509e491204e33f062355d401662389a80a9091b82
Status: Downloaded newer image for nvidia/cuda:11.8.0-base-ubuntu22.04
Mon Dec  5 23:05:46 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0  On |                  N/A |
|  0%   44C    P8    25W / 370W |    995MiB / 24576MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+


ok

it's basically a problem without using sudo...

docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi 
Unable to find image 'nvidia/cuda:11.8.0-base-ubuntu22.04' locally
11.8.0-base-ubuntu22.04: Pulling from nvidia/cuda
301a8b74f71f: Already exists 
35985d37d899: Already exists 
5b7513e7876e: Already exists 
bbf319bc026c: Already exists 
da5c9c5d5ac3: Already exists 
Digest: sha256:83493b3f150cc23f91fb0d2509e491204e33f062355d401662389a80a9091b82
Status: Downloaded newer image for nvidia/cuda:11.8.0-base-ubuntu22.04
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

UPDATE - FIXED.
I don't know if this helps - but on my installation I had
cudnn-local-repo-ubuntu2204-8.6.0.163_1.0-1_amd64.deb + 11.8 cuda
this is incorrect. i was using cog - and this didn't find the error - just assumed it was all working correctly.
updating to latest cudnn - resolved my original issue.
cudnn-local-repo-ubuntu2204-8.7.0.84_1.0-1_amd64.deb

@groucho64738
Copy link

groucho64738 commented Dec 12, 2022

I'm having a similar issue on a system I'm using for K8S, and no containers can run that require nvidia drivers with the same error (about libnvidia-ml.so.1). I'm not sure what specific steps broke it for me though. I was able to reproduce the error message on a command line by running the cuda container directly on our node: docker run --gpus=all --runtime=nvidia nvidia/cuda:11.8.0-base-ubuntu20.04 nvidia-smi

I created a debug log for nvidia-container-toolkit:

I1212 19:31:12.254613 192312 nvc.c:376] initializing library context (version=1.10.0, build=395fd41701117121f1fd04ada01e1d7e006a37ae)
I1212 19:31:12.254655 192312 nvc.c:350] using root /run/nvidia/driver
I1212 19:31:12.254660 192312 nvc.c:351] using ldcache /etc/ld.so.cache
I1212 19:31:12.254665 192312 nvc.c:352] using unprivileged user 65534:65534
I1212 19:31:12.254683 192312 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I1212 19:31:12.254778 192312 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I1212 19:31:12.267767 192321 nvc.c:278] loading kernel module nvidia
E1212 19:31:12.267872 192321 nvc.c:280] could not load kernel module nvidia
I1212 19:31:12.267883 192321 nvc.c:296] loading kernel module nvidia_uvm
E1212 19:31:12.267904 192321 nvc.c:298] could not load kernel module nvidia_uvm
I1212 19:31:12.267914 192321 nvc.c:305] loading kernel module nvidia_modeset
E1212 19:31:12.267934 192321 nvc.c:307] could not load kernel module nvidia_modeset
I1212 19:31:12.268200 192322 rpc.c:71] starting driver rpc service
I1212 19:31:12.268825 192312 rpc.c:135] driver rpc service terminated with signal 15
I1212 19:31:12.268870 192312 nvc.c:434] shutting down library context

Not a lot of help there. If I run nvidia-container-cli -k -d /dev/tty info I get a list of all of the modules and libraries, so that functions. I've tried running the container in privileged mode as well and still get the same error. Each time, I'm root when trying to kick off the container to eliminate that piece as well.

This is an ubuntu 20.04 system, running docker 20.10.18. I've followed the installation directions in the install guide (pretty straightforward to follow)

If there any suggestions of what else to try to debug, I'm willing to give it a try. This has been a real headache.

@groucho64738
Copy link

I actually managed to fix this. At some point in time we had uncommented the option root = "/run/nvidia/driver" in /etc/nvidia-container-runtime/config.toml (must have seen directions on this somewhere). My best guess is that we had updated something on the system that made this no longer be a viable option, and after a reboot, everything stopped working. I commented out that option and everything popped up.

To find it, I created a wrapper around nvidia-container-cli:

#!/bin/bash

echo "$@" > /var/tmp/debuginfo
/usr/bin/nvidia-container-cli.real "$@"

That showed me a working and a non-working systems' option that were being passed.

Not working:

--root=/run/nvidia/driver --load-kmods --debug=/var/log/nvidia-container-toolkit.log configure [--ldconfig=@/sbin/ldconfig.real](mailto:--ldconfig=@/sbin/ldconfig.real) --device=all --compute --utility --require=cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516 --pid=3895576 /var/lib/docker/overlay2/47f7deb4479aa6b8c26f3b6e3ad4a2cd9bd86304736bf9aed68ed4127fbc0d00/merged

Working:

--load-kmods configure [--ldconfig=@/sbin/ldconfig.real](mailto:--ldconfig=@/sbin/ldconfig.real) --device=all --compute --utility --require=cuda>=11.8 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516 --pid=2830327 /var/lib/docker/overlay2/59206c16f5a12eadbe2e42287a7ff6aa3559b0666048d7578b29df90e3755d50/merged

@johndpope
Copy link

From the look of it - first line is 470 second is 511. It does seem like everything can be working fine - and then ubuntu automatically changes driver (rendering a broken system) - I recommend using timeshift to create a snapshot when everything is working (new driver / cuda update etc) - https://github.com/linuxmint/timeshift - it's trivial to roll back to working snapshot and you won't lose any personal files.

@ThatCooperLewis
Copy link

ThatCooperLewis commented Dec 14, 2022

Trying to build containers on Arch here, installed Docker through docker-desktop originally, but I've also installed nvidia-docker, cuda, cuda-tools, cudnn, and nvidia-container-toolkit on the host machine in an attempt to resolve this.

The only workaround I've found so far is to run docker as root. That resolves this specific issue but, of course, I'd rather not be forced to run all my docker commands via sudo (also Docker Desktop fails to recognize those containers/images).

Some relevant outputs from host machine:

$ ldconfig -p | grep cuda         
        
        libicudata.so.72 (libc6,x86-64) => /usr/lib/libicudata.so.72
        libicudata.so.72 (ELF) => /usr/lib32/libicudata.so.72
        libicudata.so (libc6,x86-64) => /usr/lib/libicudata.so
        libicudata.so (ELF) => /usr/lib32/libicudata.so
        libcuda.so.1 (libc6,x86-64) => /usr/lib/libcuda.so.1
        libcuda.so.1 (libc6) => /usr/lib32/libcuda.so.1
        libcuda.so (libc6,x86-64) => /usr/lib/libcuda.so
        libcuda.so (libc6) => /usr/lib32/libcuda.so
$ nvidia-smi
Wed Dec 14 15:20:49 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.11    Driver Version: 525.60.11    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:0C:00.0  On |                  N/A |
|  0%   32C    P8    30W / 370W |    937MiB / 10240MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
$ uname -a
6.0.12-arch1-1 NVIDIA/nvidia-docker#1 SMP PREEMPT_DYNAMIC Thu, 08 Dec 2022 11:03:38 +0000 x86_64 GNU/Linux

@ThatCooperLewis
Copy link

ThatCooperLewis commented Dec 15, 2022

I changed distros and still had a very similar issue with Docker Desktop + nvidia-docker together. But adding this workaround to the nvidia runtime config seemed to fix things for me. [UPDATE: It does not]

$ vi /etc/nvidia-container-runtime/config.toml

no-cgroups = true

Unsure of whether this was the cause in my old distro (EndeavorOS), but I will try to confirm later.

@shiwakant
Copy link

All instructions were helpful, but I had to start docker, docker build, and docker run at root privileges to make it work!!!
Even after repeated hard tries, unable to run at a user-level permissions.

@pochoi
Copy link

pochoi commented Jan 8, 2023

All instructions were helpful, but I had to start docker, docker build, and docker run at root privileges to make it work!!! Even after repeated hard tries, unable to run at a user-level permissions.

I have the same error [nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1](https://github.com/NVIDIA/nvidia-container-toolkit/issues/154) when running docker without sudo.

Is there possible ways to get thing work without sudo?

@ThatCooperLewis
Copy link

@shiwakant @pochoi You can get this working by avoiding Docker Desktop and instead setting up Docker Rootless Mode

@lbadi
Copy link

lbadi commented Jan 19, 2023

I know that this might be a dumb answer, but i was having the same issues and got fixed after i log in into docker login ghcr.io -u *** --password-stdin .

@turboazot
Copy link

turboazot commented Mar 24, 2023

From my side I used:

sudo ldconfig

Worked for me. But in case if you are using docker image with dind and nvidia-docker integration in it, execute this in entrypoint script, otherwise it may not work.

@justmiles
Copy link

I ran into this as well and was simply missing the nvidia-driver-<version> and nvidia-dkms-<version> packages. Would be worth double-checking the actual Nvidia drivers are installed.

@pfcouto
Copy link

pfcouto commented Apr 23, 2023

Hello, can I bring up this topic again?

1. Issue or feature description

Upon running the command docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi i get the error

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

2. Steps to reproduce the issue

I installed nvidia through Fedora Docs, not Nvidia, so as an example nvcc --version outputs an error saying that it does not recognize nvcc command but in my host machine I can run nvidia-smi

The commands I used to install nvidia are the following:

sudo dnf install akmod-nvidia
sudo dnf install xorg-x11-drv-nvidia-cuda

And as visible in the following image I am able to run the command nvidia-smi in my host machine

image

I followed this guide on how yo install nvidia-docker - - and did the following:

curl -s -L https://nvidia.github.io/libnvidia-container/centos8/libnvidia-container.repo | sudo tee /etc/yum.repos.d/nvidia-container-toolkit.repo
##############################
sudo dnf install nvidia-docker2
# Edit /etc/nvidia-container-runtime/config.toml and disable cgroups:
no-cgroups = true

sudo reboot
##############################
sudo systemctl start docker.service
##############################
docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

and upon running this docker command I get the error show in ### 1.

The thing is, I have the file that it says it is missing (check the following image), so maybe it is looking for it in a different directory?

image

3. Information to attach (optional if deemed irrelevant)

uname -a:

Linux fedora 6.2.10-200.fc37.x86_64 NVIDIA/nvidia-docker#1 SMP PREEMPT_DYNAMIC Thu Apr  6 23:30:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

docker version

Client: Docker Engine - Community
 Cloud integration: v1.0.31
 Version:           23.0.3
 API version:       1.41 (downgraded from 1.42)
 Go version:        go1.19.7
 Git commit:        3e7cbfd
 Built:             Tue Apr  4 22:10:33 2023
 OS/Arch:           linux/amd64
 Context:           desktop-linux

Server: Docker Desktop 4.18.0 (104112)
 Engine:
  Version:          20.10.24
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.19.7
  Git commit:       5d6db84
  Built:            Tue Apr  4 18:18:42 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.18
  GitCommit:        2456e983eb9e37e47538f59ea18f2043c9a73640
 runc:
  Version:          1.1.4
  GitCommit:        v1.1.4-0-g5fd4c4d
 docker-init:
  Version:          0.19.0
  GitCommit:        de40ad0

rpm -qa '*nvidia*'

 nvidia-gpu-firmware-20230310-148.fc37.noarch
xorg-x11-drv-nvidia-kmodsrc-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-cuda-libs-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-libs-530.41.03-1.fc37.x86_64
nvidia-settings-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-power-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-530.41.03-1.fc37.x86_64
akmod-nvidia-530.41.03-1.fc37.x86_64
kmod-nvidia-6.2.9-200.fc37.x86_64-530.41.03-1.fc37.x86_64
nvidia-persistenced-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-cuda-530.41.03-1.fc37.x86_64
xorg-x11-drv-nvidia-libs-530.41.03-1.fc37.i686
xorg-x11-drv-nvidia-cuda-libs-530.41.03-1.fc37.i686
kmod-nvidia-6.2.10-200.fc37.x86_64-530.41.03-1.fc37.x86_64
nvidia-container-toolkit-base-1.13.0-1.x86_64
libnvidia-container1-1.13.0-1.x86_64
libnvidia-container-tools-1.13.0-1.x86_64
nvidia-container-toolkit-1.13.0-1.x86_64
nvidia-docker2-2.13.0-1.noarch

nvidia-container-cli -V

cli-version: 1.13.0
lib-version: 1.13.0
build date: 2023-03-31T13:12+00:00
build revision: 20823911e978a50b33823a5783f92b6e345b241a
build compiler: gcc 8.5.0 20210514 (Red Hat 8.5.0-18)
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Thanks for your help!

@elezar
Copy link
Member

elezar commented Apr 24, 2023

@pfcouto and others that show this behaviour. Please enable debug logging for the nvidia-container-cli in the /etc/nvidia-container-toolkit/config.toml by uncommenting the #debug = line in that section.

Running a container should then generate a log at /var/log/nvidia-container-toolkit.log which may help to further debug this.

Note that the NVIDIA Container CLI needs to load libnvidia-ml.so.1 to retrieve the required information about the GPUs in the system. We have seen this behaviour when Docker Desktop is used, for example, since the hook is then executed in a VM that does not have access to the libraries and devices on the host. How is docker installed in this case?

Note, if you're able to install a recent version of podman, this could be an alternative as a CDI specification could be generated instead of relying on the nvidia-container-cli-based injection.

@pfcouto
Copy link

pfcouto commented Apr 24, 2023

Hello @elezar, I will do what you said said. Thanks! I think one of the issues is that since I installed Nvidia-Drivers through RPMfusion the file is not in the default location. Nvidia-Docker is looking for the file in a location, and I have the file in another location. How can I change the docker image to access my file that is in a different location?

As shown in the picture, when I installed the Nvidia-Drivers through RPM it installed a flatpak, but the file is present. The thing is the file is not where nvidia-docker expects it to be.

Can I create my own Dockerfile: Like:

FROM nvidia/docker
COPY (my lib file location) (Docker image location)

Or just change the default location in local machine to the correct location just to test if it works.

image

@elezar
Copy link
Member

elezar commented Apr 24, 2023

@pfcouto if the drivers are at a different location to expected, you could look at setting the root option in the config.toml. We use this setting when running the toolkit using our driver container. In this case we isntall the driver (and device nodes) to /run/nvidia/driver and root = /run/nvidia/driver is specified in the config.

@pfcouto
Copy link

pfcouto commented Apr 24, 2023

Hello @elezar .I do not have the folder /etc/nvidia-container-toolkit/ that you mentioned. However I do have a folder nvidia-container-runtime which as a config,toml file, as shown in the picture. Is it ok for me to change in this file what you said to change in the other? Thanks!

image

@pfcouto
Copy link

pfcouto commented Apr 24, 2023

I did change the file as visible in the first picture. Uncommented the debug line, and changed root to a directory where I have the libnvidia-ml.so.1 file, I don't know if I should have changed this, but I did. Ran the command docker run --privileged --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi and it outputed the same error:

Unable to find image 'nvidia/cuda:11.0.3-base-ubuntu20.04' locally
11.0.3-base-ubuntu20.04: Pulling from nvidia/cuda
d7bfe07ed847: Pull complete 
75eccf561042: Pull complete 
191419884744: Pull complete 
a17a942db7e1: Pull complete 
16156c70987f: Pull complete 
Digest: sha256:57455121f3393b7ed9e5a0bc2b046f57ee7187ea9ec562a7d17bf8c97174040d
Status: Downloaded newer image for nvidia/cuda:11.0.3-base-ubuntu20.04
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0003] error waiting for container:

Then I went ahead and tried to look into the log file but it was not created...
cat /var/log/nvidia-container-toolkit.log:

cat: /var/log/nvidia-container-toolkit.log: No such file or directory

image

@pfcouto
Copy link

pfcouto commented Apr 27, 2023

Hi again @elezar, also, I don't have the folder /run/nvidia/driver

image

@lishoulong
Copy link

lishoulong commented May 2, 2023

maybe just not install CUDA and NVIDIA container

@elezar
Copy link
Member

elezar commented May 2, 2023

Hi again @elezar, also, I don't have the folder /run/nvidia/driver

Sorry for the lack of clarity. I was using /run/nvidia/driver as an example of a path we use when intalling the driver using our driver container. The NVIDIA Container Toolkit considers the root setting when looking for libnvidia-ml.so.1 (the standard lib paths are prepended) and if your installation has these libraries at a non-standard location this will help to locate them.

Since your output in one of your comments does show /usr/lib64/libnvidia-ml.so.1, could you confirm where this symlink points? (Your output also shows some flatpack location).

Could you link the Fedora docs you used to install the driver?

@AskAlice
Copy link

AskAlice commented Sep 1, 2023

I have this issue unless I run as root. Using docker-desktop

❯ stat /usr/lib/libnvidia-ml.*
  File: /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1
  Size: 17              Blocks: 8          IO Block: 4096   symbolic link
Device: 0,26    Inode: 1753370     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-09-01 13:46:31.971672402 -0600
Modify: 2023-08-22 11:37:15.000000000 -0600
Change: 2023-08-26 22:24:35.978217529 -0600
 Birth: 2023-08-26 22:24:35.978217529 -0600
  File: /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.535.104.05
  Size: 26              Blocks: 8          IO Block: 4096   symbolic link
Device: 0,26    Inode: 1753371     Links: 1
Access: (0777/lrwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-09-01 13:46:31.971672402 -0600
Modify: 2023-08-22 11:37:15.000000000 -0600
Change: 2023-08-26 22:24:35.978217529 -0600
 Birth: 2023-08-26 22:24:35.978217529 -0600
  File: /usr/lib/libnvidia-ml.so.535.104.05
  Size: 1815872         Blocks: 3552       IO Block: 4096   regular file
Device: 0,26    Inode: 1753372     Links: 1
Access: (0777/-rwxrwxrwx)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2023-09-01 14:25:25.495551694 -0600
Modify: 2023-08-22 11:37:15.000000000 -0600
Change: 2023-09-01 14:24:06.427728737 -0600
 Birth: 2023-08-26 22:24:35.978217529 -0600

It also seems it is reproducable with the PKGBUILD i created here https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/issues/17#note_1530784413

here is my config.toml

disable-require = false
#swarm-resource = "DOCKER_RESOURCE_GPU"
#accept-nvidia-visible-devices-envvar-when-unprivileged = true
#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]
#root = "/run/nvidia/driver"
path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
ldcache = "/etc/ld.so.cache"
load-kmods = true
no-cgroups = false
#user = "root:video"
ldconfig = "/sbin/ldconfig"

[nvidia-container-runtime]
#debug = "/var/log/nvidia-container-runtime.log"
log-level = "info"

# Specify the runtimes to consider. This list is processed in order and the PATH
# searched for matching executables unless the entry is an absolute path.
runtimes = [
    "docker-runc",
    "runc",
]

mode = "auto"

    [nvidia-container-runtime.modes.csv]

    mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

@bkocis
Copy link

bkocis commented Sep 28, 2023

I had the same issue. For me a reinstall of docker fixed the issue:

I run as a bash script:

sudo apt-get update

sudo apt install apt-transport-https ca-certificates curl software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"

apt-cache policy docker-ce

sudo apt install docker-ce

@RobQuistNL
Copy link

RobQuistNL commented Oct 25, 2023

Looks like this just doesn't work with docker desktop.

When you run the script that @bkocis shared - you're installing docker-ce, most likely next to docker desktop. So the sudo version of docker runs the CE version, and the regular one will use your docker desktop version.

At least, this is what happens for me :)

$ docker run --privileged --gpus all nvidia/cuda:12.2.2-runtime-ubuntu22.04 nvidia-smi
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ERRO[0000] error waiting for container:                 
$ sudo docker run --privileged --gpus all nvidia/cuda:12.2.2-runtime-ubuntu22.04 nvidia-smi

==========
== CUDA ==
==========

CUDA Version 12.2.2

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

Wed Oct 25 18:34:56 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.125.06   Driver Version: 525.125.06   CUDA Version: 12.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  Off |
|  0%   51C    P0    72W / 450W |   1493MiB / 24564MiB |      4%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Before installing docker-ce, you'd get this error:

docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.

@JosephKuchar
Copy link

Hi all,

I'm having this same issue. It's perplexing because everything was working as of a few weeks ago, but it seems that since we've had to reboot the machine that somehow docker's ability to access the GPU has broken. In my case running docker as sudo does not make a difference.

sudo docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

The output of nvidia-smi is the following:

Wed Oct 25 16:30:24 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Quadro RTX 5000                 On | 00000000:B3:00.0 Off |                  Off |
| 33%   28C    P8               13W / 230W|     71MiB / 16384MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2463      G   /usr/lib/xorg/Xorg                           63MiB |
|    0   N/A  N/A      2957      G   /usr/bin/gnome-shell                          5MiB |
+---------------------------------------------------------------------------------------+

I also edited the /etc/nvidia-container-toolkit/config.toml by uncommenting the #debug = line in that section. The error suggests it's not able to find the nvidia devices:

I1026 12:51:01.432712 1726142 nvc.c:376] initializing library context (version=1.12.0, build=7678e1af094d865441d0bc1b97>I1026 12:51:01.432857 1726142 nvc.c:350] using root /
I1026 12:51:01.432876 1726142 nvc.c:351] using ldcache /etc/ld.so.cache
I1026 12:51:01.432891 1726142 nvc.c:352] using unprivileged user 65534:65534
I1026 12:51:01.432931 1726142 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for>I1026 12:51:01.433368 1726142 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
W1026 12:51:01.436341 1726142 nvc.c:258] failed to detect NVIDIA devices
I1026 12:51:01.436817 1726149 nvc.c:278] loading kernel module nvidia
I1026 12:51:01.437436 1726149 nvc.c:282] running mknod for /dev/nvidiactl
I1026 12:51:01.437525 1726149 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I1026 12:51:01.453626 1726149 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabi>I1026 12:51:01.453787 1726149 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabi>I1026 12:51:01.456764 1726149 nvc.c:296] loading kernel module nvidia_uvm
I1026 12:51:01.456957 1726149 nvc.c:300] running mknod for /dev/nvidia-uvm
I1026 12:51:01.457046 1726149 nvc.c:305] loading kernel module nvidia_modeset
I1026 12:51:01.457288 1726149 nvc.c:309] running mknod for /dev/nvidia-modeset
I1026 12:51:01.458019 1726150 rpc.c:71] starting driver rpc service
I1026 12:51:01.459088 1726142 rpc.c:132] driver rpc service terminated with signal 15
I1026 12:51:01.459205 1726142 nvc.c:434] shutting down library context

As I said, this was working a few weeks ago, so I'm not sure what's changed. We haven't updated any drivers or anything of that nature that I'm aware of. Any help appreciated!

@bkocis
Copy link

bkocis commented Oct 30, 2023

@JosephKuchar try reinstalling docker - I had similar problem, with the issue being the missing runtime (see docker info).
the solution was for me to reinstall docker NVIDIA/nvidia-docker#1648 (comment)

@destefy
Copy link

destefy commented Nov 8, 2023

Thanks @bkocis! Worked like a charm!

@elezar elezar transferred this issue from NVIDIA/nvidia-docker Nov 19, 2023
@archenroot
Copy link

archenroot commented Dec 7, 2023

Hi guys,

I hit same issue on Ubuntu 22.04 LTS, I followed instructions to reinstall as bellow (note I have installed as well docker-desktop initially)

sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl restart docker

And I was able to run this machine from above comment:
docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark
and I was finally capable to run RAPIDS:
docker run --gpus all --pull always --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/notebooks:23.12a-cuda11.2-py3.10

At moment the docker-desktop is uninstalled. I will try to install it again and run the tests.

@archenroot
Copy link

I additionally installed docker-desktop again and rebooted and gpu in containers still works....

@goldwater668
Copy link

@johndpope
I installed docker-desktop in Windows 10. The graphics card driver is 546.01 and the following error is reported. How should I solve it?
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

@archenroot
Copy link

archenroot commented Dec 8, 2023

@goldwater668 It seems that reinstalling the docker engine help on Linux machines..and as per my understanding its root cause is maybe in docker-desktop somewhere, but I didn't find root cause myself..

@goldwater668
Copy link

@elezar nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.

@johndpope
Copy link

@goldwater668 - I see you're on windows - try running things as administrator.

Starting Docker image cog-cog-svd-base and running setup()...
docker: Error response from daemon: failed to create task for container: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
ⅹ Failed to start container: exit status 127

running sudo fixed this for me.

@lihkinVerma
Copy link

lihkinVerma commented Dec 18, 2023

Hi guys,

I hit same issue on Ubuntu 22.04 LTS, I followed instructions to reinstall as bellow (note I have installed as well docker-desktop initially)

sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl restart docker

And I was able to run this machine from above comment: docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark and I was finally capable to run RAPIDS: docker run --gpus all --pull always --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/notebooks:23.12a-cuda11.2-py3.10

At moment the docker-desktop is uninstalled. I will try to install it again and run the tests.

This worked for me. Thankyou so much

@combofish
Copy link

Hi guys,
I hit same issue on Ubuntu 22.04 LTS, I followed instructions to reinstall as bellow (note I have installed as well docker-desktop initially)

sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl restart docker

And I was able to run this machine from above comment: docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark and I was finally capable to run RAPIDS: docker run --gpus all --pull always --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/notebooks:23.12a-cuda11.2-py3.10
At moment the docker-desktop is uninstalled. I will try to install it again and run the tests.

This worked for me. Thankyou so much

@huangpan2507
Copy link

huangpan2507 commented Jan 4, 2024

Hi guys,

I hit same issue on Ubuntu 22.04 LTS, I followed instructions to reinstall as bellow (note I have installed as well docker-desktop initially)

sudo apt-get purge docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin docker-ce-rootless-extras
sudo rm -rf /var/lib/docker
sudo rm -rf /var/lib/containerd
for pkg in docker.io docker-doc docker-compose docker-compose-v2 podman-docker containerd runc; do sudo apt-get remove $pkg; done
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl restart docker

And I was able to run this machine from above comment: docker run --rm -it --gpus=all nvcr.io/nvidia/k8s/cuda-sample:nbody nbody -gpu -benchmark and I was finally capable to run RAPIDS: docker run --gpus all --pull always --rm -it --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 -p 8888:8888 -p 8787:8787 -p 8786:8786 rapidsai/notebooks:23.12a-cuda11.2-py3.10

At moment the docker-desktop is uninstalled. I will try to install it again and run the tests.

Thanks,I had met the same issue, this solution helps me!!!!! But please note, all the docker images will be deleted because of sudo rm -rf /var/lib/docker !!!!

@zebin-huang
Copy link

I had the same issue. For me a reinstall of docker fixed the issue:

I run as a bash script:

sudo apt-get update

sudo apt install apt-transport-https ca-certificates curl software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"

apt-cache policy docker-ce

sudo apt install docker-ce

This works for me!

@sjmach
Copy link

sjmach commented Feb 8, 2024

I had the same issue. For me a reinstall of docker fixed the issue:

I run as a bash script:

sudo apt-get update

sudo apt install apt-transport-https ca-certificates curl software-properties-common

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/ubuntu focal stable"

apt-cache policy docker-ce

sudo apt install docker-ce

This works for me too. Please replace appropriate version of Ubuntu in the 4 line by running this command first in your terminal and getting the codename string lsb_release -a

@clh15683
Copy link

If you encounter this with Docker Desktop, make sure that you enable the WSL integration for your distribution under Settings->Resources->WSL Integration. It seems that Docker Desktop occasionally forgets this setting on Updates.

@GabrielDornelles
Copy link

I followed this conversation last night, turned off the pc and good. Today I went back to work, and it wasnt working anymore, same error occuring of not finding libnvidia-ml.so.1.

I don't really know how to solve this, its package issues as pointed by others. What I had to do to make it work again was

sudo snap remove --purge docker

Removing the docker stuff (the previous long shell command is what worked before, but now it doesnt).

Then re installing everything again from docker instructions:

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

@gerardo8a
Copy link

gerardo8a commented Mar 6, 2024

I had the issue with the nvidia library as well and after looking into one of my working nodes the /usr/local/nvidia/toolkit/.config/nvidia-container-runtime/config.toml was different the following two values where set different and that was the root of the error.

working

....
[nvidia-container-cli]
  environment = []
  ldconfig = "@/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/"
...

Not working

...
[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig.real"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"
...

@elezar
Copy link
Member

elezar commented Mar 7, 2024

@gerardo8a how was your NVIDIA Container Toolkit and the NVIDIA driver installed? Your non-working config seems to reference a containerized driver insallation (usually under the GPU Operator), whereas your working config references a driver installation on the host (note the root and ldconfig values).

@iganev
Copy link

iganev commented Apr 30, 2024

I followed this conversation last night, turned off the pc and good. Today I went back to work, and it wasnt working anymore, same error occuring of not finding libnvidia-ml.so.1.

I don't really know how to solve this, its package issues as pointed by others. What I had to do to make it work again was

sudo snap remove --purge docker

Removing the docker stuff (the previous long shell command is what worked before, but now it doesnt).

Then re installing everything again from docker instructions:

# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

As Gabe here describes, if reinstalling docker apt reinstall docker-ce fixes your issue temporarily, make sure you don't have another docker installed through snap ( snap list ).

@zhangxianwei2015
Copy link

For my case, configure the container runtime for Docker running in [Rootless mode] (https://docs.docker.com/engine/security/rootless/) works for me.

aelovikov-intel pushed a commit to intel/llvm that referenced this issue Jun 14, 2024
Updating the docker to 12.5 led to the below described problems. Since
the testing output of 12.1 matches 12.5, and we don't actually use any
cuda features later than 12.1 (which are minor updates) in the compiler,
this PR reverts back to the 12.1 image.
We can update the docker later only when we really need to (probably
when cuda 13 is released). For the purposes of intel/llvm CI 12.1 is
sufficient.
This fixes the "latest" docker image, allowing other updates to the
docker image to be made in the future.

CUDA docker issues:

Depending on the host setup of the runners, there are various issues on
recent nvidia docker images related to interactions with the host,
whereby nvidia devices are not visible.

- NVIDIA/nvidia-container-toolkit#48
- NVIDIA/nvidia-docker#1671
- NVIDIA/nvidia-container-toolkit#154

Signed-off-by: JackAKirk <jack.kirk@codeplay.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests