-
Notifications
You must be signed in to change notification settings - Fork 427
Description
Summary:
After a recent system upgrade on my Arch-based distribution (CachyOS), containers using the NVIDIA container runtime began failing with nvidia-smi due to missing symlinks for libnvidia-ml.so. Although the actual shared library (libnvidia-ml.so.570.153.02) is correctly injected into the container by the NVIDIA container runtime, the expected symlinks (libnvidia-ml.so.1 → libnvidia-ml.so) are no longer created automatically, causing dynamic linking failures.
As part of this system upgrade, the NVIDIA container runtime stack was specifically updated from:
nvidia-container-toolkit 1.17.6-1.1 → 1.17.7-1.1
libnvidia-container 1.17.6-1.1 → 1.17.7-1.1
This strongly suggests a regression or behavior change introduced between these two versions, affecting symlink resolution or runtime library injection.
Reproduction
Step-by-step:
Ensure valid NVIDIA driver and container stack installed:
nvidia, nvidia-utils, nvidia-container-toolkit, etc.
Run:
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
Observe:
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system.
Verify presence of the target .so:
docker run --rm --runtime=nvidia --gpus all \
nvidia/cuda:12.2.0-base-ubuntu22.04 \
bash -c 'ls -l /usr/lib/x86_64-linux-gnu/libnvidia-ml.so*'
Output:
-rwxr-xr-x 1 root root ... libnvidia-ml.so.570.153.02
Manually fix it inside container:
ln -s libnvidia-ml.so.570.153.02 libnvidia-ml.so.1
ln -s libnvidia-ml.so.1 libnvidia-ml.so
nvidia-smi
Now it works as expected.
Expected Behavior
The container runtime should:
Detect the injected libnvidia-ml.so.*
Automatically create necessary symlinks (.so.1, .so) for dynamic linking to work
What Changed?
This was working fine until a recent upgrade:
Key package changes (from pacman.log):
[2025-05-21] upgraded nvidia-utils (570.144-5 → 570.153.02-3)
[2025-05-21] upgraded lib32-nvidia-utils, opencl-nvidia, nvidia, nvidia-container-toolkit (1.17.6 → 1.17.7)
I suspect changes in:
Driver packaging (nvidia-utils)
libnvidia-container runtime hooks
Path handling on Arch-based systems
System Info
Component Version
OS CachyOS (Arch-based, rolling)
GPU NVIDIA GeForce RTX 4090
Driver version 570.153.02
CUDA (host) 12.8
Container image nvidia/cuda:12.2.0-base-ubuntu22.04
Docker runtime nvidia
Toolkit version nvidia-container-toolkit 1.17.7
Kernel 6.14.7-5-cachyos
Test Command
docker run --rm --runtime=nvidia --gpus all
nvidia/cuda:12.2.0-base-ubuntu22.04
nvidia-smi
Workaround
Manually run inside container:
cd /usr/lib/x86_64-linux-gnu
ln -s libnvidia-ml.so.570.153.02 libnvidia-ml.so.1
ln -s libnvidia-ml.so.1 libnvidia-ml.so
Or create a custom Docker image with this baked in.
Additional Context
This issue also broke model loading in LocalAI Docker setups relying on GPU inference. Only after digging deeper did I discover nvidia-smi was the root cause — due to the missing symlinks. I had also installed podman and made some driver upgrades around the same time, which might have influenced path behavior or runtime config.