cgroup failures with v1.8.0 #158

NHellFire · 2022-02-06T21:53:00Z

Since v1.8.0, I'm unable to start any docker containers that use the GPU:

$ docker run --rm --gpus all ubuntu nvidia-smi
docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: container error: failed to get device cgroup mount path: unable to create cgroupv0 interface: invalid version: unknown.
$ dpkg -l libnvidia-container1
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                       Version      Architecture Description
+++-==========================-============-============-=================================
ii  libnvidia-container1:amd64 1.8.0-1      amd64        NVIDIA container runtime library

Had a look around and found that the original error message is being overwritten, so after making GetDeviceCGroupVersion print the error itself, I get:

docker: Error response from daemon: OCI runtime create failed: container_linux.go:380: starting container process caused: process_linux.go:545: container init caused: Running hook #0:: error running hook: exit status 1, stdout: failed to open cgroup path for pid '2259321': open /proc/2259321/cgroup: no such file or directory
, stderr: nvidia-container-cli: container error: failed to get device cgroup mount path: unable to create cgroupv0 interface: invalid version: unknown.

Downgrading to v1.7.0 allows me to run containers again:

$ sudo dpkg -i dist/ubuntu18.04/amd64/libnvidia-container1_1.7.0-1_amd64.deb
dpkg: warning: downgrading libnvidia-container1:amd64 from 1.8.0-1 to 1.7.0-1
(Reading database ... 513159 files and directories currently installed.)
Preparing to unpack .../libnvidia-container1_1.7.0-1_amd64.deb ...
Unpacking libnvidia-container1:amd64 (1.7.0-1) over (1.8.0-1) ...
Setting up libnvidia-container1:amd64 (1.7.0-1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.3) ...
$ sudo dpkg -i dist/ubuntu18.04/amd64/libnvidia-container-tools_1.7.0-1_amd64.deb
dpkg: warning: downgrading libnvidia-container-tools from 1.8.0-1 to 1.7.0-1
(Reading database ... 513157 files and directories currently installed.)
Preparing to unpack .../libnvidia-container-tools_1.7.0-1_amd64.deb ...
Unpacking libnvidia-container-tools (1.7.0-1) over (1.8.0-1) ...
Setting up libnvidia-container-tools (1.7.0-1) ...
$ docker run --rm --gpus all ubuntu nvidia-smi
Sun Feb  6 20:53:40 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 495.46       Driver Version: 495.46       CUDA Version: 11.5     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA T400         Off  | 00000000:81:00.0  On |                  N/A |
| 38%   34C    P8    N/A /  31W |    144MiB /  1873MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Docker info:

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Docker Buildx (Docker Inc., v0.7.1-docker)
  scan: Docker Scan (Docker Inc., v0.12.0)

Server:
 Containers: 12
  Running: 0
  Paused: 0
  Stopped: 12
 Images: 17
 Server Version: 20.10.12
 Storage Driver: zfs
  Zpool: rpool
  Zpool Health: ONLINE
  Parent Dataset: rpool/ROOT/ubuntu_lxg58y/var/lib
  Space Used By Parent: 140127559680
  Space Available: 1258287800320
  Parent Quota: no
  Compression: lz4
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: active
  NodeID: m7ifhapo10a7redthuyd9x1hx
  Is Manager: false
  Node Address: 10.0.0.1
  Manager Addresses:
   192.168.2.100:2377
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 7b11cfaabd73bb80907dd23182b9347b4245eb5d
 runc version: v1.0.2-0-g52b36a2
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.13.0-28-generic
 Operating System: Ubuntu 20.04.3 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 56
 Total Memory: 125.5GiB
 Name: localhost
 ID: 3CO2:XYSI:74R7:SNDX:FM2Z:VEIR:3JTT:BTNM:L57Z:SOTJ:MCZC:3SCE
 Docker Root Dir: /var/lib/docker
 Debug Mode: true
  File Descriptors: 32
  Goroutines: 89
  System Time: 2022-02-06T20:47:43.939569497Z
  EventsListeners: 0
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Other installed nvidia packages:

$ dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                             Version      Architecture Description
+++-================================-============-============-=====================================================
un  libgldispatch0-nvidia            <none>       <none>       (no description available)
ii  libnvidia-container-tools        1.7.0-1      amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64       1.7.0-1      amd64        NVIDIA container runtime library
un  libnvidia-encode1                <none>       <none>       (no description available)
un  nvidia-common                    <none>       <none>       (no description available)
ii  nvidia-container-runtime         3.8.0-1      all          NVIDIA container runtime
un  nvidia-container-runtime-hook    <none>       <none>       (no description available)
ii  nvidia-container-toolkit         1.8.0-1      amd64        NVIDIA container runtime hook
un  nvidia-docker                    <none>       <none>       (no description available)
rc  nvidia-docker2                   2.9.0-1      all          nvidia-docker CLI wrapper
un  nvidia-legacy-304xx-vdpau-driver <none>       <none>       (no description available)
un  nvidia-legacy-340xx-vdpau-driver <none>       <none>       (no description available)
un  nvidia-libopencl1-dev            <none>       <none>       (no description available)
un  nvidia-opencl-icd                <none>       <none>       (no description available)
un  nvidia-prime                     <none>       <none>       (no description available)
un  nvidia-vdpau-driver              <none>       <none>       (no description available)

The text was updated successfully, but these errors were encountered:

NHellFire · 2022-02-07T01:14:49Z

Tracked this down to mounting proc with hidepid=2. Need to allow CAP_SYS_PTRACE in init to have unrestricted /proc, opened #159 to add it.

klueska · 2022-02-07T07:40:28Z

Hmm. I’d rather not have to add this CAP. Maybe we can find a better way of determining the cgroupv2 version.

klueska · 2022-02-07T14:49:43Z

I spent a little time thinking about this, and I think the correct fix is as follows. This ensures we have the same set of caps set (at the same time) as we did in the old cgroups implementation.

diff --git a/src/cgroup.c b/src/cgroup.c
index 187c7dbd..4f088602 100644
--- a/src/cgroup.c
+++ b/src/cgroup.c
@@ -54,6 +54,17 @@ nvcgo_get_device_cgroup_version_1_svc(ptr_t ctxptr, char *proc_root, pid_t pid,

         memset(res, 0, sizeof(*res));

+        // Explicitly set CAP_EFFECTIVE to NVC_CONTAINER across the
+        // 'GetDeviceCGroupVersion()' call.  This is only done because we
+        // happen to know these are the effective capabilities set by the
+        // nvidia-container-cli (i.e. the only known user of this library)
+        // anytime this RPC handler is invoked. In the future we should
+        // consider setting effective capabilities on the server to match
+        // whatever capabilities were in effect in the client when the RPC call
+        // was made.
+        if (perm_set_capabilities(err, CAP_EFFECTIVE, ecaps[NVC_CONTAINER], ecaps_size(NVC_CONTAINER)) < 0)
+                goto fail;
+
         if ((rv = nvcgo->api.GetDeviceCGroupVersion(proc_root, pid, &version, &rerr) < 0)) {
                 error_setx(err, "failed to get device cgroup version: %s", rerr);
                 goto fail;
@@ -63,6 +74,7 @@ nvcgo_get_device_cgroup_version_1_svc(ptr_t ctxptr, char *proc_root, pid_t pid,
         rv = 0;

  fail:
+        perm_set_capabilities(err, CAP_EFFECTIVE, NULL, 0);
         free(rerr);
         if (rv < 0)
                 error_to_xdr(err, res);
@@ -103,6 +115,17 @@ nvcgo_find_device_cgroup_path_1_svc(ptr_t ctxptr, int dev_cg_version, char *proc

         memset(res, 0, sizeof(*res));

+        // Explicitly set CAP_EFFECTIVE to NVC_CONTAINER across the
+        // 'GetDeviceCGroupMountPath()' and 'GetDeviceCGroupRootPath()' calls.
+        // This is only done because we happen to know these are the effective
+        // capabilities set by the nvidia-container-cli (i.e. the only known
+        // user of this library) anytime this RPC handler is invoked. In the
+        // future we should consider setting effective capabilities on the
+        // server to match whatever capabilities were in effect in the client
+        // when the RPC call was made.
+        if (perm_set_capabilities(err, CAP_EFFECTIVE, ecaps[NVC_CONTAINER], ecaps_size(NVC_CONTAINER)) < 0)
+                goto fail;
+
         if ((rv = nvcgo->api.GetDeviceCGroupMountPath(dev_cg_version, proc_root, mp_pid, &cgroup_mount, &rerr)) < 0) {
                 error_setx(err, "failed to get device cgroup mount path: %s", rerr);
                 goto fail;
@@ -127,6 +150,7 @@ nvcgo_find_device_cgroup_path_1_svc(ptr_t ctxptr, int dev_cg_version, char *proc
         rv = 0;

  fail:
+        perm_set_capabilities(err, CAP_EFFECTIVE, NULL, 0);
         free(rerr);
         free(cgroup_mount);
         free(cgroup_root);
@@ -192,13 +216,10 @@ nvcgo_setup_device_cgroup_1_svc(ptr_t ctxptr, int dev_cg_version, char *dev_cg,
                 goto fail;
         }

-        // Reset the effective capabilities to NULL.
-        if (perm_set_capabilities(err, CAP_EFFECTIVE, NULL, 0) < 0)
-                goto fail;
-
         rv = 0;

 fail:
+        perm_set_capabilities(err, CAP_EFFECTIVE, NULL, 0);
         free(rerr);
         if (rv < 0)
                 error_to_xdr(err, res);

klueska · 2022-02-07T14:58:13Z

MR created to add this change: https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/merge_requests/137

elezar · 2022-02-14T14:05:03Z

Hi @NHellFire. We have just published NVIDIA Container Toolkit v1.8.1 which should address this issue. Please upgrade to the new version and let us know if the problem persists or close this issue otherwise.

NHellFire · 2022-02-14T18:49:32Z

@elezar Tested and working, thanks!

NHellFire closed this as completed Feb 14, 2022

mdallaire mentioned this issue Dec 16, 2022

nvidia-device-plugin won't start on talos 1.3.0 siderolabs/extensions#101

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cgroup failures with v1.8.0 #158

cgroup failures with v1.8.0 #158

NHellFire commented Feb 6, 2022

NHellFire commented Feb 7, 2022

klueska commented Feb 7, 2022

klueska commented Feb 7, 2022 •

edited

Loading

klueska commented Feb 7, 2022

elezar commented Feb 14, 2022

NHellFire commented Feb 14, 2022

cgroup failures with v1.8.0 #158

cgroup failures with v1.8.0 #158

Comments

NHellFire commented Feb 6, 2022

NHellFire commented Feb 7, 2022

klueska commented Feb 7, 2022

klueska commented Feb 7, 2022 • edited Loading

klueska commented Feb 7, 2022

elezar commented Feb 14, 2022

NHellFire commented Feb 14, 2022

klueska commented Feb 7, 2022 •

edited

Loading