-
Notifications
You must be signed in to change notification settings - Fork 455
Description
1. Summary
On a gpu working node, running systemctl daemon-reload cause all running gpu containers to lost gpu devices.
(container) $ nvidia-smi -L
Failed to initialize NVML: Unknown Error
While it seems all good outside contianer:
(host) # nvidia-smi -L
GPU 0: NVIDIA A10 (UUID: GPU-f15ae561-21b1-6003-c768-861fc285203b)
GPU 1: NVIDIA A10 (UUID: GPU-466be1f8-6148-1a27-a450-592604f3df54)
GPU 2: Tesla T4 (UUID: GPU-bf507abe-ec25-2053-5435-1c82511a3ae8)
GPU 3: Tesla T4 (UUID: GPU-a1329f2e-1e73-88f2-4d4b-1ce18debbc6d)
2. Environment
cmd: hostnamectl | egrep 'Operating System|Kernel'
--------------------------------------------------
Operating System: Ubuntu 22.04 LTS
Kernel: Linux 5.15.0-25-generic
cmd: systemctl --version | head -n1
--------------------------------------------------
systemd 249 (249.11-0ubuntu3.11)
cmd: nvidia-smi | head -n 4
--------------------------------------------------
Mon Jul 14 16:25:58 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20 Driver Version: 570.133.20 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
cmd: dpkg -l | egrep 'container-toolkit|libnvidia-container'
--------------------------------------------------
ii libnvidia-container-tools 1.17.8-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.17.8-1 amd64 NVIDIA container runtime library
ii nvidia-container-toolkit 1.17.8-1 amd64 NVIDIA Container toolkit
ii nvidia-container-toolkit-base 1.17.8-1 amd64 NVIDIA Container Toolkit Base
cmd: runc --version
--------------------------------------------------
runc version 1.2.5-0ubuntu1~22.04.1
spec: 1.2.0
go: go1.22.2
libseccomp: 2.5.3
cmd: containerd -v
--------------------------------------------------
containerd github.com/containerd/containerd 1.7.27
cmd: docker version
--------------------------------------------------
Client:
Version: 27.5.1
API version: 1.47
Go version: go1.22.2
Git commit: 27.5.1-0ubuntu3~22.04.2
Built: Mon Jun 2 12:18:38 2025
OS/Arch: linux/amd64
Context: default
Server:
Engine:
Version: 27.5.1
API version: 1.47 (minimum version 1.24)
Go version: go1.22.2
Git commit: 27.5.1-0ubuntu3~22.04.2
Built: Mon Jun 2 12:18:38 2025
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.7.27
GitCommit:
nvidia:
Version: 1.2.5-0ubuntu1~22.04.1
docker-init:
Version: 0.19.0
GitCommit:
cmd: docker info | egrep -i 'runtime|cgroup'
--------------------------------------------------
Cgroup Driver: systemd
Cgroup Version: 2
Runtimes: io.containerd.runc.v2 nvidia runc
Default Runtime: nvidia
cgroupns
cmd: stat -fc %T /sys/fs/cgroup
--------------------------------------------------
cgroup2fs
3. How to reproduce
3.1 Start a gpu container
$ docker run -it --rm --runtime=nvidia --gpus all \
ubuntu:latest bash -c "while true; do nvidia-smi -L ; sleep 1; done"
GPU 0: NVIDIA A10 (UUID: GPU-f15ae561-21b1-6003-c768-861fc285203b)
GPU 1: NVIDIA A10 (UUID: GPU-466be1f8-6148-1a27-a450-592604f3df54)
GPU 2: Tesla T4 (UUID: GPU-bf507abe-ec25-2053-5435-1c82511a3ae8)
GPU 3: Tesla T4 (UUID: GPU-a1329f2e-1e73-88f2-4d4b-1ce18debbc6d)
3.2 Trigger fail down with command:
# systemctl daemon-reload
3.3 Pre-started container return errors:
Failed to initialize NVML: Unknown Error
3.4 If explictly set 1 mounted devices, then trigger daemon-reload, only 1 devices continue to output
$ docker run -it --rm --runtime=nvidia --gpus all \
--device=/dev/nvidiactl \
--device=/dev/nvidia0 \
ubuntu:latest bash -c "while true; do nvidia-smi -L ; sleep 1; done"
# systemctl daemon-reload
$
...
GPU 0: NVIDIA A10 (UUID: GPU-f15ae561-21b1-6003-c768-861fc285203b)
4. Which are related
syscall returns EPERM error
Do syscall with nvidia-smi -L, left is running inside with container like 3.1, while right is running inside container like 3.4.
It just turned out, without --device=/dev/nvidiactl --device-=/dev/nvidia0, syscall will return EPERM after daemon-reload triggered:
nvidia-container-runtime start with uncompleted OCI spec file
Do start a container with command like 3.1, then check log from nvidia container runtime /var/log/nvidia-container-runtime.log:
and the generated OCI config.json does not contains any nvidia devices:
{
"linux": {
"resources": {
"devices": [
{
"allow": false,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 5,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 3,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 9,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 8,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 5,
"minor": 0,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 5,
"minor": 1,
"access": "rwm"
},
{
"allow": false,
"type": "c",
"major": 10,
"minor": 229,
"access": "rwm"
},
{
"allow": false,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 5,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 3,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 9,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 1,
"minor": 8,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 5,
"minor": 0,
"access": "rwm"
},
{
"allow": true,
"type": "c",
"major": 5,
"minor": 1,
"access": "rwm"
},
{
"allow": false,
"type": "c",
"major": 10,
"minor": 229,
"access": "rwm"
}
]
},
"cgroupsPath": "system.slice:docker:cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21"
}
}
Related nvidia char devices:
root:~# ls -l /dev/char/* | grep nvidia
lrwxrwxrwx 1 root root 10 Aug 7 15:17 /dev/char/195:0 -> ../nvidia0
lrwxrwxrwx 1 root root 10 Aug 7 15:17 /dev/char/195:1 -> ../nvidia1
lrwxrwxrwx 1 root root 10 Aug 7 15:17 /dev/char/195:2 -> ../nvidia2
lrwxrwxrwx 1 root root 12 Aug 7 15:17 /dev/char/195:255 -> ../nvidiactl
lrwxrwxrwx 1 root root 10 Aug 7 15:17 /dev/char/195:3 -> ../nvidia3
lrwxrwxrwx 1 root root 26 Aug 7 15:17 /dev/char/234:1 -> ../nvidia-caps/nvidia-cap1
lrwxrwxrwx 1 root root 26 Aug 7 15:17 /dev/char/234:2 -> ../nvidia-caps/nvidia-cap2
lrwxrwxrwx 1 root root 13 Aug 7 15:17 /dev/char/504:0 -> ../nvidia-uvm
lrwxrwxrwx 1 root root 19 Aug 7 15:17 /dev/char/504:1 -> ../nvidia-uvm-tools
The cgroup was delgated to systemd, and 50-DeviceAllow.conf seems ignore any nvidia char devices.
After daemon-reload trigged, systemd rebuild cgroup on device allowed, I think that's the core reason which cause gpu lost in container.
# systemctl status docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope
Transient: yes
Drop-In: /run/systemd/transient/docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope.d
└─50-DevicePolicy.conf, 50-DeviceAllow.conf
Active: active (running) since Thu 2025-08-07 15:11:03 CST; 8min ago
IO: 252.0K read, 0B written
Tasks: 2 (limit: 618450)
Memory: 1.7M
CPU: 2.831s
CGroup: /system.slice/docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope
├─1128567 bash -c " while [ true ]; do nvidia-smi -L; date && sleep 5; done"
└─1129186 sleep 5
# cat /run/systemd/transient/docker-cad78cd224a912e88bb1af5495f50077c8a844f035e9cf331f75e8f8d18c5c21.scope.d/50-DeviceAllow.conf
# This is a drop-in unit file extension, created via "systemctl set-property"
# or an equivalent operation. Do not edit.
[Scope]
DeviceAllow=
DeviceAllow=char-pts rwm
DeviceAllow=/dev/char/10:200 rwm
DeviceAllow=/dev/char/5:2 rwm
DeviceAllow=/dev/char/5:1 rwm
DeviceAllow=/dev/char/5:0 rwm
DeviceAllow=/dev/char/1:9 rwm
DeviceAllow=/dev/char/1:8 rwm
DeviceAllow=/dev/char/1:7 rwm
DeviceAllow=/dev/char/1:5 rwm
DeviceAllow=/dev/char/1:3 rwm
DeviceAllow=char-* m
DeviceAllow=block-* m
5. References
nvidia-container-toolkit issue 48: lack of char devices symlinks
#48
nvidia-container-toolkit issue 192 - wrong cgroup_device ebpf program attached
#192
runc issue 3708 - runc break support for NVIDIA GPUs
opencontainers/runc#3708
Any idea on why nvidia-container-runtime generated incompleted OCI spec config.json? Further details would be attached if anyone interested.