Failed to initialize NVML: Unknown Error After "systemctl daemon-reload"

## 1. Summary

On a gpu working node, running `systemctl daemon-reload` cause all running gpu containers to lost gpu devices.
```
(container) $ nvidia-smi -L
Failed to initialize NVML: Unknown Error
```

While it seems all good outside contianer:
```
(host) # nvidia-smi -L
GPU 0: NVIDIA A10 (UUID: GPU-f15ae561-21b1-6003-c768-861fc285203b)
GPU 1: NVIDIA A10 (UUID: GPU-466be1f8-6148-1a27-a450-592604f3df54)
GPU 2: Tesla T4 (UUID: GPU-bf507abe-ec25-2053-5435-1c82511a3ae8)
GPU 3: Tesla T4 (UUID: GPU-a1329f2e-1e73-88f2-4d4b-1ce18debbc6d)
```

## 2. Environment

```
cmd: hostnamectl  | egrep 'Operating System|Kernel'
--------------------------------------------------
Operating System: Ubuntu 22.04 LTS
          Kernel: Linux 5.15.0-25-generic

cmd: systemctl --version | head -n1
--------------------------------------------------
systemd 249 (249.11-0ubuntu3.11)

cmd: nvidia-smi  | head -n 4
--------------------------------------------------
Mon Jul 14 16:25:58 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+

cmd: dpkg -l | egrep 'container-toolkit|libnvidia-container'
--------------------------------------------------
ii  libnvidia-container-tools             1.17.8-1                            amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.17.8-1                            amd64        NVIDIA container runtime library
ii  nvidia-container-toolkit              1.17.8-1                            amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base         1.17.8-1                            amd64        NVIDIA Container Toolkit Base

cmd: runc --version
--------------------------------------------------
runc version 1.2.5-0ubuntu1~22.04.1
spec: 1.2.0
go: go1.22.2
libseccomp: 2.5.3

cmd: containerd -v
--------------------------------------------------
containerd github.com/containerd/containerd 1.7.27

cmd: docker version
--------------------------------------------------
Client:
 Version:           27.5.1
 API version:       1.47
 Go version:        go1.22.2
 Git commit:        27.5.1-0ubuntu3~22.04.2
 Built:             Mon Jun  2 12:18:38 2025
 OS/Arch:           linux/amd64
 Context:           default

Server:
 Engine:
  Version:          27.5.1
  API version:      1.47 (minimum version 1.24)
  Go version:       go1.22.2
  Git commit:       27.5.1-0ubuntu3~22.04.2
  Built:            Mon Jun  2 12:18:38 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:
 nvidia:
  Version:          1.2.5-0ubuntu1~22.04.1
 docker-init:
  Version:          0.19.0
  GitCommit:

cmd: docker info | egrep -i 'runtime|cgroup'
--------------------------------------------------
 Cgroup Driver: systemd
 Cgroup Version: 2
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia
  cgroupns

cmd: stat -fc %T /sys/fs/cgroup
--------------------------------------------------
cgroup2fs
```


## 3. How to reproduce

### 3.1  Start a gpu container

```
$ docker run -it --rm --runtime=nvidia --gpus all \
   ubuntu:latest  bash -c "while true; do nvidia-smi -L ; sleep 1; done"

GPU 0: NVIDIA A10 (UUID: GPU-f15ae561-21b1-6003-c768-861fc285203b)
GPU 1: NVIDIA A10 (UUID: GPU-466be1f8-6148-1a27-a450-592604f3df54)
GPU 2: Tesla T4 (UUID: GPU-bf507abe-ec25-2053-5435-1c82511a3ae8)
GPU 3: Tesla T4 (UUID: GPU-a1329f2e-1e73-88f2-4d4b-1ce18debbc6d)
```

### 3.2 Trigger fail down with command:

```
# systemctl daemon-reload
```

### 3.3 Pre-started container return errors:

```
Failed to initialize NVML: Unknown Error
```

### 3.4 If explictly set 1 mounted devices, then trigger daemon-reload, only 1 devices continue to output

```
$ docker run -it --rm --runtime=nvidia --gpus all \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
   ubuntu:latest  bash -c "while true; do nvidia-smi -L ; sleep 1; done"

# systemctl daemon-reload

$
... 
GPU 0: NVIDIA A10 (UUID: GPU-f15ae561-21b1-6003-c768-861fc285203b)
```


## 4. Which are related

**syscall returns EPERM error**

Do syscall with `nvidia-smi -L`, left is running inside with container like `3.1`, while right is running inside container like `3.4`.

It just turned out, without `--device=/dev/nvidiactl --device-=/dev/nvidia0`, syscall will return `EPERM` after `daemon-reload` triggered:

<img width="2544" height="496" alt="Image" src="https://github.com/user-attachments/assets/bd136ae7-26c0-40e6-8506-db813100babf" />

**nvidia-container-runtime start with uncompleted OCI spec file**

Do start a container with command like 3.1, then check log from nvidia container runtime `/var/log/nvidia-container-runtime.log`:

<img width="2033" height="1159" alt="Image" src="https://github.com/user-attachments/assets/b1976d36-9a8c-48bd-ab4f-7598d882e102" />

and the generated OCI config.json does not contains any nvidia devices:
```
{
  "linux": {
    "resources": {
      "devices": [
        {
          "allow": false,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 5,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 3,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 9,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 8,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 0,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 1,
          "access": "rwm"
        },
        {
          "allow": false,
          "type": "c",
          "major": 10,
          "minor": 229,
          "access": "rwm"
        },
        {
          "allow": false,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 5,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 3,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 9,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 8,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 0,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 1,
          "access": "rwm"
        },
        {
          "allow": false,
          "type": "c",
          "major": 10,
          "minor": 229,
          "access": "rwm"
        }
      ]
    },
    "cgroupsPath": "system.slice:docker:cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21"
  }
}
```

Related nvidia char devices:
```
root:~# ls -l /dev/char/* | grep nvidia
lrwxrwxrwx 1 root root 10 Aug  7 15:17 /dev/char/195:0 -> ../nvidia0
lrwxrwxrwx 1 root root 10 Aug  7 15:17 /dev/char/195:1 -> ../nvidia1
lrwxrwxrwx 1 root root 10 Aug  7 15:17 /dev/char/195:2 -> ../nvidia2
lrwxrwxrwx 1 root root 12 Aug  7 15:17 /dev/char/195:255 -> ../nvidiactl
lrwxrwxrwx 1 root root 10 Aug  7 15:17 /dev/char/195:3 -> ../nvidia3
lrwxrwxrwx 1 root root 26 Aug  7 15:17 /dev/char/234:1 -> ../nvidia-caps/nvidia-cap1
lrwxrwxrwx 1 root root 26 Aug  7 15:17 /dev/char/234:2 -> ../nvidia-caps/nvidia-cap2
lrwxrwxrwx 1 root root 13 Aug  7 15:17 /dev/char/504:0 -> ../nvidia-uvm
lrwxrwxrwx 1 root root 19 Aug  7 15:17 /dev/char/504:1 -> ../nvidia-uvm-tools
```

The cgroup was delgated to systemd, and `50-DeviceAllow.conf` seems ignore any nvidia char devices.
After `daemon-reload` trigged, systemd rebuild cgroup on device allowed, I think that's the core reason which cause gpu lost in container.

```
# systemctl  status docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope
  Transient: yes
    Drop-In: /run/systemd/transient/docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope.d
             └─50-DevicePolicy.conf, 50-DeviceAllow.conf
     Active: active (running) since Thu 2025-08-07 15:11:03 CST; 8min ago
         IO: 252.0K read, 0B written
      Tasks: 2 (limit: 618450)
     Memory: 1.7M
        CPU: 2.831s
     CGroup: /system.slice/docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope
             ├─1128567 bash -c " while [ true ]; do nvidia-smi -L; date && sleep 5; done"
             └─1129186 sleep 5

# cat /run/systemd/transient/docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope.d/50-DeviceAllow.conf
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:1 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m
```


## 5. References

nvidia-container-toolkit issue 48: lack of char devices symlinks
https://github.com/NVIDIA/nvidia-container-toolkit/issues/48

nvidia-container-toolkit issue 192 - wrong cgroup_device ebpf program attached
https://github.com/NVIDIA/nvidia-container-toolkit/issues/192

runc issue 3708 - runc break support for NVIDIA GPUs
https://github.com/opencontainers/runc/issues/3708


**Any idea on why nvidia-container-runtime generated incompleted OCI spec config.json? Further details would be attached if anyone interested.**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failed to initialize NVML: Unknown Error After "systemctl daemon-reload" #1227

1. Summary

2. Environment

3. How to reproduce

3.1 Start a gpu container

3.2 Trigger fail down with command:

3.3 Pre-started container return errors:

3.4 If explictly set 1 mounted devices, then trigger daemon-reload, only 1 devices continue to output

4. Which are related

5. References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failed to initialize NVML: Unknown Error After "systemctl daemon-reload" #1227

Description

1. Summary

2. Environment

3. How to reproduce

3.1 Start a gpu container

3.2 Trigger fail down with command:

3.3 Pre-started container return errors:

3.4 If explictly set 1 mounted devices, then trigger daemon-reload, only 1 devices continue to output

4. Which are related

5. References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions