Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

Failed to initialize NVML: Unknown Error without any kublet update(cpu-manager-policy is default none) #1618

Closed
3 of 9 tasks
ReyRen opened this issue Mar 3, 2022 · 8 comments

Comments

@ReyRen
Copy link

ReyRen commented Mar 3, 2022

1. Issue or feature description

Yes, @klueska already described in #1469. And I really tried all of those methods, including use nvidia-device-plugin-compat-with-cpumanager.yml. But, the error still there. So, let me give more details.

Failed to initialize NVML: Unknown Error not occurred in initial NVIDIA docker created and not in couple of seconds(my kubernetes config file is using default nodeStatusUpdateFrequency time, which is 10s), it's happened after couple of days(sometimes some of hours).

  • Not set any kubelet update configuration
  • cpu-manager-policy not set(default is none)
  • container inspect information can see devices all mounted with rw permission, but /sys/fs/cgroup/devices/devices.list shows m
  • If I directly use nvidia-docker create container in kubernetes environment, the error also occur

2. What I Found

The Error docker:

cat /sys/fs/cgroup/devices/devices.list
...                                                                                                                                                         
c 195:* m
c 507:* m
...
root@node1:/workspace# ll /dev/ | grep nvidia
crw-rw-rw- 1 root root 195, 254 Feb 23 06:28 nvidia-modeset
crw-rw-rw- 1 root root 507,   0 Dec  1 09:36 nvidia-uvm
crw-rw-rw- 1 root root 507,   1 Dec  1 09:36 nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   0 Dec  1 07:09 nvidia0
crw-rw-rw- 1 root root 195, 255 Dec  1 07:09 nvidiactl

The healthy docker:

cat /sys/fs/cgroup/devices/devices.list
c 136:* rwm
c 10:200 rwm
c 5:2 rwm
c 5:1 rwm
c 5:0 rwm
c 1:9 rwm
c 1:8 rwm
c 1:7 rwm
c 1:5 rwm
c 1:3 rwm
b *:* m
c *:* m
c 195:1 rw
c 195:254 rw
c 195:255 rw
c 507:0 rw
c 507:1 rw
root@node1:/workspace# ll /dev/ | grep nvidia
crw-rw-rw- 1 root root 195, 254 Mar  3 03:06 nvidia-modeset
crw-rw-rw- 1 root root 507,   0 Dec  1 09:36 nvidia-uvm
crw-rw-rw- 1 root root 507,   1 Dec  1 09:36 nvidia-uvm-tools
crw-rw-rw- 1 root root 195,   1 Dec  1 07:09 nvidia1
crw-rw-rw- 1 root root 195, 255 Dec  1 07:09 nvidiactl

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
  • Driver information from nvidia-smi -a
  • Docker version from docker version
    Client: Docker Engine - Community
    Version: 20.10.5
    API version: 1.41
    Go version: go1.13.15
    Git commit: 55c4c88
    Built: Tue Mar 2 20:18:05 2021
    OS/Arch: linux/amd64
    Context: default
    Experimental: true

Server: Docker Engine - Community
Engine:
Version: 20.10.5
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 363e9a8
Built: Tue Mar 2 20:16:00 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
nvidia:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0

  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
    Desired=Unknown/Install/Remove/Purge/Hold
    | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
    |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
    ||/ Name Version Architecture Description
    +++-=====================================-=======================-=======================-===============================================================================
    un libgldispatch0-nvidia (no description available)
    ii libnvidia-container-tools 1.6.0rc.2-1 amd64 NVIDIA container runtime library (command-line tools)
    ii libnvidia-container1:amd64 1.6.0
    rc.2-1 amd64 NVIDIA container runtime library
    un nvidia-304 (no description available)
    un nvidia-340 (no description available)
    un nvidia-384 (no description available)
    un nvidia-common (no description available)
    ii nvidia-container-runtime 3.6.0rc.1-1 amd64 NVIDIA container runtime
    un nvidia-container-runtime-hook (no description available)
    ii nvidia-container-toolkit 1.6.0
    rc.2-1 amd64 NVIDIA container runtime hook
    un nvidia-docker (no description available)
    ii nvidia-docker2 2.7.0rc.2-1 all nvidia-docker CLI wrapper
    ii nvidia-prime 0.8.16
    0.18.04.1 all Tools to enable NVIDIA's Prime
  • NVIDIA container library version from nvidia-container-cli -V
    version: 1.6.0~rc.2
    build date: 2021-11-05T14:19+00:00
    build revision: badec1fa4a2c085aa9396f95b6bb1d69f1c7996b
    build compiler: x86_64-linux-gnu-gcc-7 7.5.0
    build platform: x86_64
    build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
  • NVIDIA container library logs (see troubleshooting)
  • Docker command, image and tag used
@ReyRen
Copy link
Author

ReyRen commented Mar 11, 2022

I'd really appreciate any help to let me can dig more.

@t-jonesy
Copy link

t-jonesy commented Apr 9, 2022

@ReyRen did you try adding systemd.unified_cgroup_hierarchy=0 on your boot cmdline? that solved the issue for me

@ReyRen
Copy link
Author

ReyRen commented Apr 14, 2022

@su-tjones Thanks your suggestion, and I will give it a try. But I really want know what's the systemd.unified_cgroup_hierarchy=0 means and why it can solve this problem?

@kentwelcome
Copy link

unified_cgroup_hierarchy=0 means switching the cgroup from v2 to v1.
By default, the version of cgroup is v2.
Ref: https://wiki.archlinux.org/title/Cgroups

@ReyRen
Copy link
Author

ReyRen commented Apr 21, 2022

@kentwelcome Thanks.
I found I use the cgroup v1 on host, refer to enumerate-cgroups

root@master:~# cat /sys/fs/cgroup/cgroup.controllers
cat: /sys/fs/cgroup/cgroup.controllers: No such file or directory

and cgroup v1 in docker

# docker info
...
Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 1
...

The host version is 5.4.0-90-generic **#101~18.04.1-Ubuntu, docker version is 20.10.5

@klueska
Copy link
Contributor

klueska commented Apr 25, 2022

I've actually heard that switching to cgroupv2 (i.e. flipping unified_cgroup_hierarchy=1) helps to resolve this issue because devices are treated differently under cgroupv2, such that an update on other cgroup subsystems no longer affects devices.

@ReyRen
Copy link
Author

ReyRen commented Apr 26, 2022

@klueska Thanks a lot. I'll give it a shot

@ReyRen
Copy link
Author

ReyRen commented May 7, 2022

It's worked for me!!! Thanks everyone, especially @klueska .
Here are some notes:
I changed the host system cgroup from v1 to v2 as following:
10505
And you have to make sure the libnvidia-container version you installed is enough to support cgroupv2:
1549

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants