Skip to content

Missing libnvidia-ml.so symlinks in containers after host library updates #1099

@mgeldi

Description

@mgeldi

Summary:

After a recent system upgrade on my Arch-based distribution (CachyOS), containers using the NVIDIA container runtime began failing with nvidia-smi due to missing symlinks for libnvidia-ml.so. Although the actual shared library (libnvidia-ml.so.570.153.02) is correctly injected into the container by the NVIDIA container runtime, the expected symlinks (libnvidia-ml.so.1 → libnvidia-ml.so) are no longer created automatically, causing dynamic linking failures.

As part of this system upgrade, the NVIDIA container runtime stack was specifically updated from:
nvidia-container-toolkit 1.17.6-1.1 → 1.17.7-1.1
libnvidia-container 1.17.6-1.1 → 1.17.7-1.1

This strongly suggests a regression or behavior change introduced between these two versions, affecting symlink resolution or runtime library injection.

Reproduction

Step-by-step:
Ensure valid NVIDIA driver and container stack installed:

    nvidia, nvidia-utils, nvidia-container-toolkit, etc.

Run:

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi

Observe:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system.

Verify presence of the target .so:

docker run --rm --runtime=nvidia --gpus all \
  nvidia/cuda:12.2.0-base-ubuntu22.04 \
  bash -c 'ls -l /usr/lib/x86_64-linux-gnu/libnvidia-ml.so*'

Output:

-rwxr-xr-x 1 root root ... libnvidia-ml.so.570.153.02

Manually fix it inside container:

ln -s libnvidia-ml.so.570.153.02 libnvidia-ml.so.1
ln -s libnvidia-ml.so.1 libnvidia-ml.so
nvidia-smi

Now it works as expected.

Expected Behavior

The container runtime should:

Detect the injected libnvidia-ml.so.*

Automatically create necessary symlinks (.so.1, .so) for dynamic linking to work

What Changed?

This was working fine until a recent upgrade:
Key package changes (from pacman.log):

[2025-05-21] upgraded nvidia-utils (570.144-5 → 570.153.02-3)
[2025-05-21] upgraded lib32-nvidia-utils, opencl-nvidia, nvidia, nvidia-container-toolkit (1.17.6 → 1.17.7)

I suspect changes in:

Driver packaging (nvidia-utils)

libnvidia-container runtime hooks

Path handling on Arch-based systems

System Info

Component Version
OS CachyOS (Arch-based, rolling)
GPU NVIDIA GeForce RTX 4090
Driver version 570.153.02
CUDA (host) 12.8
Container image nvidia/cuda:12.2.0-base-ubuntu22.04
Docker runtime nvidia
Toolkit version nvidia-container-toolkit 1.17.7
Kernel 6.14.7-5-cachyos

Test Command

docker run --rm --runtime=nvidia --gpus all
nvidia/cuda:12.2.0-base-ubuntu22.04
nvidia-smi

Workaround

Manually run inside container:

cd /usr/lib/x86_64-linux-gnu
ln -s libnvidia-ml.so.570.153.02 libnvidia-ml.so.1
ln -s libnvidia-ml.so.1 libnvidia-ml.so

Or create a custom Docker image with this baked in.

Additional Context

This issue also broke model loading in LocalAI Docker setups relying on GPU inference. Only after digging deeper did I discover nvidia-smi was the root cause — due to the missing symlinks. I had also installed podman and made some driver upgrades around the same time, which might have influenced path behavior or runtime config.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions