nvidia-smi command in container returns "Failed to initialize NVML: Unknown Error" after couple of times #1678

yuan6711043 · 2022-09-14T04:51:35Z

1. Issue or feature description

Nvidia gpu works well upon the container has started, but when it runs a couple of times(maybe several days), gpus mounted by nvidia container runtime becomes invalid. Command Nvidia-smi returns "Failed to initialize NVML: Unknown Error" in container, while it works well on the host machine.

Nvidia-smi looks well on host,and we can see the training process information through host nvidia-smi command output. If now we stop the training process, it can no longer restart.

Referring to the solution from issue #1618 . We try to upgrade cgroup to v2 version, but it does not work.

Surprising, we cannot find any devices.list files in the container，which is mentioned in #1618

2. Steps to reproduce the issue

We find this issue can be reproduced when running "systemctl daemon-reload" on host,but actually we have not run any similar commands in our production environment

Can anyone give some good ideas for positioning this problem

3. Information to attach (optional if deemed irrelevant)

docker: 20.10.7

k8s: v1.22.5

nvidia driver version: 470.103.01

nvidia-container-runtime: 3.8.1-1

containerd: 1.5.5-0ubuntu3~20.04.2

mbentley · 2022-09-14T14:42:36Z

I've noticed the same behavior for some time on Debian 11; at least since March as that is when I started regularly checking for nvidia-smi functioning in containers, and thanks for calling out systemctl daemon-reload as something that triggers it. In my case, I have automatic updates enabled in Debian using unattended upgrades and your mention of daemon-reload makes me think that the package updates may be triggering a daemon-reload event to occur. I'm only updating packages from the Debian repos automatically, applying nvidia-docker and 3rd party repo updates manually.

Example from today where I can see the auto updates happening where in this case, telegraf is being updated and then a daemon-reload occurs, or at least that is what I believe I am seeing from systemd[1]: Reloading. based on the output when I manually run a systemctl daemon-reload:

Sep 14 06:46:17 athena systemd[1]: Starting Daily apt upgrade and clean activities...
Sep 14 06:46:50 athena systemd[1]: Stopping The plugin-driven server agent for reporting metrics into InfluxDB...
Sep 14 06:46:52 athena telegraf[6829]: 2022-09-14T10:46:52Z I! [agent] Hang on, flushing any cached metrics before shutdown
Sep 14 06:46:52 athena telegraf[6829]: 2022-09-14T10:46:52Z I! [agent] Stopping running outputs
Sep 14 06:46:52 athena systemd[1]: telegraf.service: Succeeded.
Sep 14 06:46:52 athena systemd[1]: Stopped The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 14 06:46:52 athena systemd[1]: telegraf.service: Consumed 14h 46min 50.974s CPU time.
Sep 14 06:46:54 athena systemd[1]: Reloading.
Sep 14 06:46:54 athena systemd[1]: /lib/systemd/system/plymouth-start.service:16: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Sep 14 06:46:54 athena systemd[1]: nut-monitor.service: Supervising process 7162 which is not our child. We'll most likely not notice when it exits.
Sep 14 06:46:54 athena systemd[1]: Reloading.
Sep 14 06:46:55 athena systemd[1]: /lib/systemd/system/plymouth-start.service:16: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
Sep 14 06:46:55 athena systemd[1]: nut-monitor.service: Supervising process 7162 which is not our child. We'll most likely not notice when it exits.
Sep 14 06:46:55 athena systemd[1]: Starting The plugin-driven server agent for reporting metrics into InfluxDB...
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z W! DeprecationWarning: Option "perdevice" of plugin "inputs.docker" deprecated since version 1.18.0 and will be removed in 2.0.0: use 'perdevice_include' instead
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Starting Telegraf 1.24.0
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Available plugins: 222 inputs, 9 aggregators, 26 processors, 20 parsers, 57 outputs
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded inputs: cpu disk diskio docker exec file ipmi_sensor kernel mem net netstat nvidia_smi processes smart swap system zfs
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded aggregators:
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded processors:
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Loaded outputs: influxdb_v2
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! Tags enabled: host=athena
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z W! Deprecated inputs: 0 and 1 options
Sep 14 06:46:55 athena telegraf[4068755]: 2022-09-14T10:46:55Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"athena", Flush Interval:10s
Sep 14 06:46:55 athena systemd[1]: Started The plugin-driven server agent for reporting metrics into InfluxDB.
Sep 14 06:47:14 athena systemd[1]: apt-daily-upgrade.service: Succeeded.
Sep 14 06:47:14 athena systemd[1]: Finished Daily apt upgrade and clean activities.
Sep 14 06:47:14 athena systemd[1]: apt-daily-upgrade.service: Consumed 58.400s CPU time.
Sep 14 06:47:29 athena systemd[1]: Starting Cleanup of Temporary Directories...
Sep 14 06:47:29 athena systemd[1]: systemd-tmpfiles-clean.service: Succeeded.
Sep 14 06:47:29 athena systemd[1]: Finished Cleanup of Temporary Directories.
Sep 14 06:50:00 athena systemd[1]: Starting system activity accounting tool...
Sep 14 06:50:00 athena systemd[1]: sysstat-collect.service: Succeeded.
Sep 14 06:50:00 athena systemd[1]: Finished system activity accounting tool.

elezar · 2022-09-14T14:45:10Z

Do the packages in the debian repositories include the NVIDIA Drivers?

mbentley · 2022-09-14T15:23:44Z

Fair point and callout - they do. I was looking back, doing a cross-check of the times I detected the issue & sending myself a notification and the packages that were upgraded at the time, I recorded the following:

2022-09-14:

google-chrome-stable:amd64
telegraf:amd64
handbrake-cli:amd64
handbrake:amd64
handbrake-gtk:amd64

2022-08-17:

epiphany-browser-data:amd64
libjavascriptcoregtk-4.0-18:amd64
libsnmp40:amd64
libsnmp-base:amd64
telegraf:amd64
google-chrome-stable:amd64
epiphany-browser:amd64

2022-08-13:

python3-samba:amd64
libldb2:amd64
samba-vfs-modules:amd64
samba:amd64
libwbclient0:amd64
libsmbclient:amd64
samba-dsdb-modules:amd64
samba-common-bin:amd64
python3-ldb:amd64
samba-libs:amd64
samba-common:amd64

2022-07-27:

linux-kbuild-5.10:amd64
linux-compiler-gcc-10-x86:amd64
telegraf:amd64
linux-libc-dev:amd64
libcpupower1:amd64

2022-07-13:

telegraf:amd64

2022-07-06:

google-chrome-stable:amd64
telegraf:amd64

2022-05-29:

rsyslog:amd64

2022-05-20:

libldap-common:amd64
ldap-utils:amd64
libldap-2.4-2:amd64
libldap-2.4-2:i386

2022-05-17:

telegraf:amd64

2022-04-29:

telegraf:amd64

2022-04-27:

telegraf:amd64

2022-04-20:

libnvpair3linux:amd64
libuutil3linux:amd64
zfs-dkms:amd64
libzpool5linux:amd64
libzfs4linux:amd64
zfsutils-linux:amd64

While I see telegraf frequently, it's not consistent. I may just be reading into it too much based on the daemon-reload behavior but in almost every case, I can see where a package was upgraded that does have a system unit which I would expect is triggering a daemon-reload to deal with the update. Unfortunately I do not have syslog logs from that far back to match that in all cases but I can see that it doesn't seem to correspond to driver package updates.

iFede94 · 2022-09-15T10:00:31Z

I'm encountering the same issue. I'm currently testing some solutions proposed in NVIDIA/nvidia-container-toolkit#251 and #1671 and I will let you know if something works for me.

yuan6711043 · 2022-09-15T11:01:23Z

@mbentley thanks for reminding，I will check if there are any auto upgrade packages in our production environments

mbentley · 2022-09-20T11:13:22Z

At least in my case where telegraf is a big culprit, I can see that in the post install script it does call a systemctl daemon-reload which matches the behavior I've been seeing.

Same for rsyslog (not sure where the Debian packaging is source code wise but here is the postinst script).

So far, I haven't seen any instances where driver upgrades have impacted running containers but I've only seen one instance where the drivers were updated on 9/12 so there is only a sample size of one to go on from my logs. It would be easy enough to add the nvidia-drivers to the package blacklist if it was causing an issue but at least from the best of what I can tell, that does not seem to be the trigger.

dcarrion87 · 2023-03-21T02:47:40Z

We recently had to solve this for runc interactive issue. E.g.:

OCI runtime exec failed: exec failed: unable to start container process: open /dev/pts/0: operation not permitted: unknown opencontainers/runc#3551 (comment)
Bug in runc causes kubectl exec failure after systemd daemon-reload k3s-io/k3s#6064

We only just realised we're hitting this now for GPUs dropping out in containers too.

elezar · 2023-11-27T11:16:05Z

I am closing this as a duplicate of NVIDIA/nvidia-container-toolkit#48 -- a known issue with certain runc / systemd version combinations. Please see the steps to address this there or create a new issue against https://github.com/NVIDIA/nvidia-container-toolkit if you are still having problems.

elezar closed this as completed Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvidia-smi command in container returns "Failed to initialize NVML: Unknown Error" after couple of times #1678

nvidia-smi command in container returns "Failed to initialize NVML: Unknown Error" after couple of times #1678

yuan6711043 commented Sep 14, 2022 •

edited

Loading

mbentley commented Sep 14, 2022 •

edited

Loading

elezar commented Sep 14, 2022

mbentley commented Sep 14, 2022 •

edited

Loading

iFede94 commented Sep 15, 2022

yuan6711043 commented Sep 15, 2022

mbentley commented Sep 20, 2022 •

edited

Loading

dcarrion87 commented Mar 21, 2023

elezar commented Nov 27, 2023

nvidia-smi command in container returns "Failed to initialize NVML: Unknown Error" after couple of times #1678

nvidia-smi command in container returns "Failed to initialize NVML: Unknown Error" after couple of times #1678

Comments

yuan6711043 commented Sep 14, 2022 • edited Loading

1. Issue or feature description

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

mbentley commented Sep 14, 2022 • edited Loading

elezar commented Sep 14, 2022

mbentley commented Sep 14, 2022 • edited Loading

iFede94 commented Sep 15, 2022

yuan6711043 commented Sep 15, 2022

mbentley commented Sep 20, 2022 • edited Loading

dcarrion87 commented Mar 21, 2023

elezar commented Nov 27, 2023

yuan6711043 commented Sep 14, 2022 •

edited

Loading

mbentley commented Sep 14, 2022 •

edited

Loading

mbentley commented Sep 14, 2022 •

edited

Loading

mbentley commented Sep 20, 2022 •

edited

Loading