Missing libnvidia-ml.so symlinks in containers after host library updates

### Summary:
After a recent system upgrade on my Arch-based distribution (CachyOS), containers using the NVIDIA container runtime began failing with nvidia-smi due to missing symlinks for libnvidia-ml.so. Although the actual shared library (libnvidia-ml.so.570.153.02) is correctly injected into the container by the NVIDIA container runtime, the expected symlinks (libnvidia-ml.so.1 → libnvidia-ml.so) are no longer created automatically, causing dynamic linking failures.

As part of this system upgrade, the NVIDIA container runtime stack was specifically updated from:
    nvidia-container-toolkit 1.17.6-1.1 → 1.17.7-1.1
    libnvidia-container 1.17.6-1.1 → 1.17.7-1.1

This strongly suggests a regression or behavior change introduced between these two versions, affecting symlink resolution or runtime library injection.

###  Reproduction
Step-by-step:
Ensure valid NVIDIA driver and container stack installed:

        nvidia, nvidia-utils, nvidia-container-toolkit, etc.

Run:

```
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:12.2.0-base-ubuntu22.04 nvidia-smi
```
Observe:

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system.

Verify presence of the target .so:

```
docker run --rm --runtime=nvidia --gpus all \
  nvidia/cuda:12.2.0-base-ubuntu22.04 \
  bash -c 'ls -l /usr/lib/x86_64-linux-gnu/libnvidia-ml.so*'
```
Output:
```
-rwxr-xr-x 1 root root ... libnvidia-ml.so.570.153.02
```
Manually fix it inside container:

    ln -s libnvidia-ml.so.570.153.02 libnvidia-ml.so.1
    ln -s libnvidia-ml.so.1 libnvidia-ml.so
    nvidia-smi

    Now it works as expected.

###  Expected Behavior

The container runtime should:

    Detect the injected libnvidia-ml.so.*

    Automatically create necessary symlinks (.so.1, .so) for dynamic linking to work

###  What Changed?

This was working fine until a recent upgrade:
Key package changes (from pacman.log):

[2025-05-21] upgraded nvidia-utils (570.144-5 → 570.153.02-3)
[2025-05-21] upgraded lib32-nvidia-utils, opencl-nvidia, nvidia, nvidia-container-toolkit (1.17.6 → 1.17.7)

I suspect changes in:

    Driver packaging (nvidia-utils)

    libnvidia-container runtime hooks

    Path handling on Arch-based systems

#### System Info
Component	Version
OS	CachyOS (Arch-based, rolling)
GPU	NVIDIA GeForce RTX 4090
Driver version	570.153.02
CUDA (host)	12.8
Container image	nvidia/cuda:12.2.0-base-ubuntu22.04
Docker runtime	nvidia
Toolkit version	nvidia-container-toolkit 1.17.7
Kernel	6.14.7-5-cachyos
####  Test Command

docker run --rm --runtime=nvidia --gpus all \
  nvidia/cuda:12.2.0-base-ubuntu22.04 \
  nvidia-smi

####  Workaround

Manually run inside container:

cd /usr/lib/x86_64-linux-gnu
ln -s libnvidia-ml.so.570.153.02 libnvidia-ml.so.1
ln -s libnvidia-ml.so.1 libnvidia-ml.so

Or create a custom Docker image with this baked in.

#### Additional Context
This issue also broke model loading in LocalAI Docker setups relying on GPU inference. Only after digging deeper did I discover nvidia-smi was the root cause — due to the missing symlinks. I had also installed podman and made some driver upgrades around the same time, which might have influenced path behavior or runtime config.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Missing libnvidia-ml.so symlinks in containers after host library updates #1099

Summary:

Reproduction

Expected Behavior

What Changed?

System Info

Test Command

Workaround

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Missing libnvidia-ml.so symlinks in containers after host library updates #1099

Description

Summary:

Reproduction

Expected Behavior

What Changed?

System Info

Test Command

Workaround

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions