Skip to content

Failed to initialize NVML: Unknown Error After "systemctl daemon-reload" #1227

@TaylorTrz

Description

@TaylorTrz

1. Summary

On a gpu working node, running systemctl daemon-reload cause all running gpu containers to lost gpu devices.

(container) $ nvidia-smi -L
Failed to initialize NVML: Unknown Error

While it seems all good outside contianer:

(host) # nvidia-smi -L
GPU 0: NVIDIA A10 (UUID: GPU-f15ae561-21b1-6003-c768-861fc285203b)
GPU 1: NVIDIA A10 (UUID: GPU-466be1f8-6148-1a27-a450-592604f3df54)
GPU 2: Tesla T4 (UUID: GPU-bf507abe-ec25-2053-5435-1c82511a3ae8)
GPU 3: Tesla T4 (UUID: GPU-a1329f2e-1e73-88f2-4d4b-1ce18debbc6d)

2. Environment

cmd: hostnamectl  | egrep 'Operating System|Kernel'
--------------------------------------------------
Operating System: Ubuntu 22.04 LTS
          Kernel: Linux 5.15.0-25-generic

cmd: systemctl --version | head -n1
--------------------------------------------------
systemd 249 (249.11-0ubuntu3.11)

cmd: nvidia-smi  | head -n 4
--------------------------------------------------
Mon Jul 14 16:25:58 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+

cmd: dpkg -l | egrep 'container-toolkit|libnvidia-container'
--------------------------------------------------
ii  libnvidia-container-tools             1.17.8-1                            amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64            1.17.8-1                            amd64        NVIDIA container runtime library
ii  nvidia-container-toolkit              1.17.8-1                            amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base         1.17.8-1                            amd64        NVIDIA Container Toolkit Base

cmd: runc --version
--------------------------------------------------
runc version 1.2.5-0ubuntu1~22.04.1
spec: 1.2.0
go: go1.22.2
libseccomp: 2.5.3

cmd: containerd -v
--------------------------------------------------
containerd github.com/containerd/containerd 1.7.27

cmd: docker version
--------------------------------------------------
Client:
 Version:           27.5.1
 API version:       1.47
 Go version:        go1.22.2
 Git commit:        27.5.1-0ubuntu3~22.04.2
 Built:             Mon Jun  2 12:18:38 2025
 OS/Arch:           linux/amd64
 Context:           default

Server:
 Engine:
  Version:          27.5.1
  API version:      1.47 (minimum version 1.24)
  Go version:       go1.22.2
  Git commit:       27.5.1-0ubuntu3~22.04.2
  Built:            Mon Jun  2 12:18:38 2025
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.7.27
  GitCommit:
 nvidia:
  Version:          1.2.5-0ubuntu1~22.04.1
 docker-init:
  Version:          0.19.0
  GitCommit:

cmd: docker info | egrep -i 'runtime|cgroup'
--------------------------------------------------
 Cgroup Driver: systemd
 Cgroup Version: 2
 Runtimes: io.containerd.runc.v2 nvidia runc
 Default Runtime: nvidia
  cgroupns

cmd: stat -fc %T /sys/fs/cgroup
--------------------------------------------------
cgroup2fs

3. How to reproduce

3.1 Start a gpu container

$ docker run -it --rm --runtime=nvidia --gpus all \
   ubuntu:latest  bash -c "while true; do nvidia-smi -L ; sleep 1; done"

GPU 0: NVIDIA A10 (UUID: GPU-f15ae561-21b1-6003-c768-861fc285203b)
GPU 1: NVIDIA A10 (UUID: GPU-466be1f8-6148-1a27-a450-592604f3df54)
GPU 2: Tesla T4 (UUID: GPU-bf507abe-ec25-2053-5435-1c82511a3ae8)
GPU 3: Tesla T4 (UUID: GPU-a1329f2e-1e73-88f2-4d4b-1ce18debbc6d)

3.2 Trigger fail down with command:

# systemctl daemon-reload

3.3 Pre-started container return errors:

Failed to initialize NVML: Unknown Error

3.4 If explictly set 1 mounted devices, then trigger daemon-reload, only 1 devices continue to output

$ docker run -it --rm --runtime=nvidia --gpus all \
    --device=/dev/nvidiactl \
    --device=/dev/nvidia0 \
   ubuntu:latest  bash -c "while true; do nvidia-smi -L ; sleep 1; done"

# systemctl daemon-reload

$
... 
GPU 0: NVIDIA A10 (UUID: GPU-f15ae561-21b1-6003-c768-861fc285203b)

4. Which are related

syscall returns EPERM error

Do syscall with nvidia-smi -L, left is running inside with container like 3.1, while right is running inside container like 3.4.

It just turned out, without --device=/dev/nvidiactl --device-=/dev/nvidia0, syscall will return EPERM after daemon-reload triggered:

Image

nvidia-container-runtime start with uncompleted OCI spec file

Do start a container with command like 3.1, then check log from nvidia container runtime /var/log/nvidia-container-runtime.log:

Image

and the generated OCI config.json does not contains any nvidia devices:

{
  "linux": {
    "resources": {
      "devices": [
        {
          "allow": false,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 5,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 3,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 9,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 8,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 0,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 1,
          "access": "rwm"
        },
        {
          "allow": false,
          "type": "c",
          "major": 10,
          "minor": 229,
          "access": "rwm"
        },
        {
          "allow": false,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 5,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 3,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 9,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 1,
          "minor": 8,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 0,
          "access": "rwm"
        },
        {
          "allow": true,
          "type": "c",
          "major": 5,
          "minor": 1,
          "access": "rwm"
        },
        {
          "allow": false,
          "type": "c",
          "major": 10,
          "minor": 229,
          "access": "rwm"
        }
      ]
    },
    "cgroupsPath": "system.slice:docker:cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21"
  }
}

Related nvidia char devices:

root:~# ls -l /dev/char/* | grep nvidia
lrwxrwxrwx 1 root root 10 Aug  7 15:17 /dev/char/195:0 -> ../nvidia0
lrwxrwxrwx 1 root root 10 Aug  7 15:17 /dev/char/195:1 -> ../nvidia1
lrwxrwxrwx 1 root root 10 Aug  7 15:17 /dev/char/195:2 -> ../nvidia2
lrwxrwxrwx 1 root root 12 Aug  7 15:17 /dev/char/195:255 -> ../nvidiactl
lrwxrwxrwx 1 root root 10 Aug  7 15:17 /dev/char/195:3 -> ../nvidia3
lrwxrwxrwx 1 root root 26 Aug  7 15:17 /dev/char/234:1 -> ../nvidia-caps/nvidia-cap1
lrwxrwxrwx 1 root root 26 Aug  7 15:17 /dev/char/234:2 -> ../nvidia-caps/nvidia-cap2
lrwxrwxrwx 1 root root 13 Aug  7 15:17 /dev/char/504:0 -> ../nvidia-uvm
lrwxrwxrwx 1 root root 19 Aug  7 15:17 /dev/char/504:1 -> ../nvidia-uvm-tools

The cgroup was delgated to systemd, and 50-DeviceAllow.conf seems ignore any nvidia char devices.
After daemon-reload trigged, systemd rebuild cgroup on device allowed, I think that's the core reason which cause gpu lost in container.

# systemctl  status docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope
  Transient: yes
    Drop-In: /run/systemd/transient/docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope.d
             └─50-DevicePolicy.conf, 50-DeviceAllow.conf
     Active: active (running) since Thu 2025-08-07 15:11:03 CST; 8min ago
         IO: 252.0K read, 0B written
      Tasks: 2 (limit: 618450)
     Memory: 1.7M
        CPU: 2.831s
     CGroup: /system.slice/docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope
             ├─1128567 bash -c " while [ true ]; do nvidia-smi -L; date && sleep 5; done"
             └─1129186 sleep 5

# cat /run/systemd/transient/docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope.d/50-DeviceAllow.conf
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:1 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m

5. References

nvidia-container-toolkit issue 48: lack of char devices symlinks
#48

nvidia-container-toolkit issue 192 - wrong cgroup_device ebpf program attached
#192

runc issue 3708 - runc break support for NVIDIA GPUs
opencontainers/runc#3708

Any idea on why nvidia-container-runtime generated incompleted OCI spec config.json? Further details would be attached if anyone interested.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions