Skip to content

Using nvidia runtime with buildkit #122

@rmax

Description

@rmax

I'm trying to have nvidia driver available during build which works with the default build command but not when using buildkit.

I have this minimal Dockerfile

FROM nvidia/cuda:11.1-base
RUN ls /dev/nvidia*
RUN nvidia-smi                                                                                                                                                                                         

Which I can build as follow:

❯ docker build -f Dockerfile . --no-cache
Sending build context to Docker daemon  5.167kB
Step 1/3 : FROM nvidia/cuda:11.1-base
 ---> 287475453634
Step 2/3 : RUN ls /dev/nvidia*
 ---> Running in e8c12b8a398b
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia0
/dev/nvidiactl
Removing intermediate container e8c12b8a398b
 ---> 17884a1f0b6a
Step 3/3 : RUN nvidia-smi
 ---> Running in 4a5bbd2337c0
Thu Sep 23 14:08:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.119.03   Driver Version: 450.119.03   CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
| N/A   35C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
Removing intermediate container 4a5bbd2337c0
 ---> 308dfa443901
Successfully built 308dfa443901

But when using buildkit I get

❯ DOCKER_BUILDKIT=1 docker build -f Dockerfile .
[+] Building 0.4s (5/6)
 => [internal] load build definition from Dockerfile                                                                                                                                     0.0s
 => => transferring dockerfile: 116B                                                                                                                                                                   0.0s
 => [internal] load .dockerignore                                                                                                                                                                      0.0s
 => => transferring context: 2B                                                                                                                                                                        0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.1-base                                                                                                                                       0.0s
 => CACHED [1/3] FROM docker.io/nvidia/cuda:11.1-base                                                                                                                                                  0.0s
 => ERROR [2/3] RUN ls /dev/nvidia*                                                                                                                                                                    0.3s
------
 > [2/3] RUN ls /dev/nvidia*:
NVIDIA/nvidia-container-runtime#5 0.299 ls: cannot access '/dev/nvidia*': No such file or directory
------
executor failed running [/bin/sh -c ls /dev/nvidia*]: exit code: 2

Then I figured out I have to use RUN --security=insecure and use docker buildx as follows

# Dockerfile.buildkit
# syntax = docker/dockerfile:experimental
FROM nvidia/cuda:11.1-base
RUN --security=insecure nvidia-smi

I create the builder

❯ docker buildx create --driver docker-container --name local --buildkitd-flags '--allow-insecure-entitlement security.insecure' --use
local

then I build the image as follows

❯ docker buildx build -f Dockerfile.buildkit . --allow security.insecure
WARN[0000] No output specified for docker-container driver. Build result will only remain in the build cache. To push result image into registry use --push or to load image into docker use --load
[+] Building 0.7s (9/9) FINISHED
 => [internal] load build definition from Dockerfile.buildkit                                                                                                                                          0.0s
 => => transferring dockerfile: 150B                                                                                                                                                                   0.0s
 => [internal] load .dockerignore                                                                                                                                                                      0.0s
 => => transferring context: 2B                                                                                                                                                                        0.0s
 => resolve image config for docker.io/docker/dockerfile:experimental                                                                                                                                  0.1s
 => CACHED docker-image://docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                                             0.0s
 => => resolve docker.io/docker/dockerfile:experimental@sha256:600e5c62eedff338b3f7a0850beb7c05866e0ef27b2d2e8c02aa468e78496ff5                                                                        0.0s
 => [internal] load .dockerignore                                                                                                                                                                      0.0s
 => [internal] load build definition from Dockerfile.buildkit                                                                                                                                          0.0s
 => [internal] load metadata for docker.io/nvidia/cuda:11.1-base                                                                                                                                       0.1s
 => CACHED [1/2] FROM docker.io/nvidia/cuda:11.1-base@sha256:c6bb47a62ad020638aeaf66443de9c53c6dc8a0376e97b2d053ac774560bd0fa                                                                          0.0s
 => => resolve docker.io/nvidia/cuda:11.1-base@sha256:c6bb47a62ad020638aeaf66443de9c53c6dc8a0376e97b2d053ac774560bd0fa                                                                                 0.0s
 => ERROR [2/2] RUN --security=insecure nvidia-smi                                                                                                                                                     0.1s
------
 > [2/2] RUN --security=insecure nvidia-smi:
NVIDIA/nvidia-container-runtime#8 0.074 /bin/sh: 1: nvidia-smi: not found
------
Dockerfile.buildkit:3
--------------------
   1 |     # syntax = docker/dockerfile:experimental
   2 |     FROM nvidia/cuda:11.1-base
   3 | >>> RUN --security=insecure nvidia-smi
   4 |
--------------------
error: failed to solve: process "/bin/sh -c nvidia-smi" did not complete successfully: exit code: 127

I know the insecure flag works when I use another command that requires privilege (i.e.: mount --bind /dev /tmp)

This is my daemon.json

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia"
}

The output of docker info

Client:
 Context:    default
 Debug Mode: false
 Plugins:
  app: Docker App (Docker Inc., v0.9.1-beta3)
  buildx: Build with BuildKit (Docker Inc., v0.6.1-docker)
  scan: Docker Scan (Docker Inc., v0.8.0)

Server:
 Containers: 10
  Running: 1
  Paused: 0
  Stopped: 9
 Images: 191
 Server Version: 20.10.8
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc io.containerd.runc.v2 io.containerd.runtime.v1.linux nvidia
 Default Runtime: nvidia
 Init Binary: docker-init
 containerd version: e25210fe30a0a703442421b0f60afac609f950a3
 runc version: v1.0.1-0-g4144b63
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.4.0-1056-aws
 Operating System: Ubuntu 18.04.5 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 59.86GiB
 Name: ip-172-31-13-189
 ID: JSBF:DURT:RVBM:P7XL:YIWL:IKJU:3WIS:25N6:UH72:ALJA:XDRO:R35Q
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support

The output of nvidia-container-cli -k -d /dev/tty info


-- WARNING, the following logs are for debugging purposes only --

I0923 14:17:11.705485 3265 nvc.c:372] initializing library context (version=1.4.0, build=704a698b7a0ceec07a48e56c37365c741718c2df)
I0923 14:17:11.705537 3265 nvc.c:346] using root /
I0923 14:17:11.705556 3265 nvc.c:347] using ldcache /etc/ld.so.cache
I0923 14:17:11.705571 3265 nvc.c:348] using unprivileged user 1000:1000
I0923 14:17:11.705600 3265 nvc.c:389] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0923 14:17:11.705791 3265 nvc.c:391] dxcore initialization failed, continuing assuming a non-WSL environment
W0923 14:17:11.711421 3266 nvc.c:269] failed to set inheritable capabilities
W0923 14:17:11.711475 3266 nvc.c:270] skipping kernel modules load due to failure
I0923 14:17:11.711714 3267 driver.c:101] starting driver service
I0923 14:17:11.715296 3265 nvc_info.c:676] requesting driver information with ''
I0923 14:17:11.717376 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.450.119.03
I0923 14:17:11.717652 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvoptix.so.450.119.03
I0923 14:17:11.717748 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.450.119.03
I0923 14:17:11.717819 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.450.119.03
I0923 14:17:11.717897 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.450.119.03
I0923 14:17:11.718008 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.450.119.03
I0923 14:17:11.718111 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.450.119.03
I0923 14:17:11.718180 3265 nvc_info.c:171] skipping /usr/lib/x86_64-linux-gnu/libnvidia-nscq-dcgm.so.450.51.06
I0923 14:17:11.718256 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.450.119.03
I0923 14:17:11.718322 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.450.119.03
I0923 14:17:11.718430 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-ifr.so.450.119.03
I0923 14:17:11.718548 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.450.119.03
I0923 14:17:11.718625 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.450.119.03
I0923 14:17:11.718713 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.450.119.03
I0923 14:17:11.718794 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.450.119.03
I0923 14:17:11.718902 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.450.119.03
I0923 14:17:11.719021 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.450.119.03
I0923 14:17:11.719108 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.450.119.03
I0923 14:17:11.719184 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.450.119.03
I0923 14:17:11.719293 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-cbl.so.450.119.03
I0923 14:17:11.719381 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.450.119.03
I0923 14:17:11.719472 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libnvcuvid.so.450.119.03
I0923 14:17:11.719897 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libcuda.so.450.119.03
I0923 14:17:11.720133 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.450.119.03
I0923 14:17:11.720217 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.450.119.03
I0923 14:17:11.720300 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.450.119.03
I0923 14:17:11.720380 3265 nvc_info.c:169] selecting /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.450.119.03
W0923 14:17:11.720453 3265 nvc_info.c:350] missing library libnvidia-nscq.so
W0923 14:17:11.720469 3265 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0923 14:17:11.720491 3265 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0923 14:17:11.720501 3265 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0923 14:17:11.720516 3265 nvc_info.c:354] missing compat32 library libnvidia-nscq.so
W0923 14:17:11.720534 3265 nvc_info.c:354] missing compat32 library libcuda.so
W0923 14:17:11.720540 3265 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0923 14:17:11.720547 3265 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0923 14:17:11.720561 3265 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0923 14:17:11.720567 3265 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0923 14:17:11.720582 3265 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0923 14:17:11.720600 3265 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0923 14:17:11.720612 3265 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0923 14:17:11.720628 3265 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0923 14:17:11.720642 3265 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0923 14:17:11.720658 3265 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0923 14:17:11.720667 3265 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so
W0923 14:17:11.720674 3265 nvc_info.c:354] missing compat32 library libnvidia-glcore.so
W0923 14:17:11.720686 3265 nvc_info.c:354] missing compat32 library libnvidia-tls.so
W0923 14:17:11.720693 3265 nvc_info.c:354] missing compat32 library libnvidia-glsi.so
W0923 14:17:11.720710 3265 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0923 14:17:11.720725 3265 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0923 14:17:11.720740 3265 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0923 14:17:11.720755 3265 nvc_info.c:354] missing compat32 library libnvoptix.so
W0923 14:17:11.720774 3265 nvc_info.c:354] missing compat32 library libGLX_nvidia.so
W0923 14:17:11.720792 3265 nvc_info.c:354] missing compat32 library libEGL_nvidia.so
W0923 14:17:11.720799 3265 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so
W0923 14:17:11.720807 3265 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so
W0923 14:17:11.720818 3265 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so
W0923 14:17:11.720823 3265 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0923 14:17:11.721761 3265 nvc_info.c:276] selecting /usr/bin/nvidia-smi
I0923 14:17:11.721799 3265 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump
I0923 14:17:11.721837 3265 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
I0923 14:17:11.721887 3265 nvc_info.c:276] selecting /usr/bin/nv-fabricmanager
I0923 14:17:11.721930 3265 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control
I0923 14:17:11.721972 3265 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server
I0923 14:17:11.722021 3265 nvc_info.c:438] listing device /dev/nvidiactl
I0923 14:17:11.722043 3265 nvc_info.c:438] listing device /dev/nvidia-uvm
I0923 14:17:11.722049 3265 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0923 14:17:11.722056 3265 nvc_info.c:438] listing device /dev/nvidia-modeset
W0923 14:17:11.722103 3265 nvc_info.c:321] missing ipc /var/run/nvidia-persistenced/socket
W0923 14:17:11.722155 3265 nvc_info.c:321] missing ipc /var/run/nvidia-fabricmanager/socket
W0923 14:17:11.722193 3265 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0923 14:17:11.722213 3265 nvc_info.c:733] requesting device information with ''
I0923 14:17:11.729066 3265 nvc_info.c:623] listing device /dev/nvidia0 (GPU-f82fe76f-d403-d34b-8b80-9a9316b19b18 at 00000000:00:1e.0)
NVRM version:   450.119.03
CUDA version:   11.0

Device Index:   0
Device Minor:   0
Model:          Tesla K80
Brand:          Tesla
GPU UUID:       GPU-f82fe76f-d403-d34b-8b80-9a9316b19b18
Bus Location:   00000000:00:1e.0
Architecture:   3.7
I0923 14:17:11.729156 3265 nvc.c:423] shutting down library context
I0923 14:17:11.729661 3267 driver.c:163] terminating driver service
I0923 14:17:11.730042 3265 driver.c:203] driver service terminated successfully

Looks like the builder is not using the nvidia runtime. What am I missing?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions