Skip to content
This repository has been archived by the owner on Jan 22, 2024. It is now read-only.

nvidia driver version baked into docker build #856

Closed
douglas-gibbons opened this issue Nov 5, 2018 · 13 comments
Closed

nvidia driver version baked into docker build #856

douglas-gibbons opened this issue Nov 5, 2018 · 13 comments

Comments

@douglas-gibbons
Copy link

If I build a docker on Machine A I should be able to run it on Machine B even if the nvidia drivers are different minor versions. I'm getting errors building a c++ app because the libcuda.so points to the wrong version.

c++ error

Linking CXX executable example-app
/usr/lib/gcc/x86_64-linux-gnu/5/../../../x86_64-linux-gnu/libcuda.so: file not recognized: File truncated
collect2: error: ld returned 1 exit status

It seems the root problem is that the nvidia driver library simlinks from our build machines are leaking into our docker images. See example Dockerfile.nvidia below on Machine A with nvidia driver version libcuda.so.390.77 that I stripped down to the minimum needed to reproduce. Then I tried running on Machine B with driver version libcuda.so.390.87 which demonstrates the issue.

Machine A

$ ls -lah /usr/lib/x86_64-linux-gnu/libcuda*
lrwxrwxrwx 1 root root   12 Jul 17 03:00 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root   17 Jul 17 03:00 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.390.77
-rw-r--r-- 1 root root 9.6M Jul 10 22:18 /usr/lib/x86_64-linux-gnu/libcuda.so.390.77
$ cat Dockerfile.nvidia 
FROM nvidia/cudagl:9.0-devel-ubuntu16.04

RUN echo "Hello Docker!!"
$ docker build -t my.registry.com/nvidia:driver_test -f ./Dockerfile.nvidia ./
Sending build context to Docker daemon   5.12kB
Step 1/2 : FROM nvidia/cudagl:9.0-devel-ubuntu16.04
 ---> 6c072aad8335
Step 2/2 : RUN echo "Hello Docker!!"
 ---> Using cache
 ---> 1e2a278b1d9f
Successfully built 1e2a278b1d9f
Successfully tagged my.registry.com/nvidia:driver_test

Machine B

$ ls -lah /usr/lib/x86_64-linux-gnu/libcuda*
lrwxrwxrwx 1 root root   12 Aug 27 11:24 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root   17 Aug 27 11:24 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.390.87
-rw-r--r-- 1 root root 9.6M Aug 21 16:24 /usr/lib/x86_64-linux-gnu/libcuda.so.390.87
$ docker run -it --rm --runtime nvidia my.registry.com/nvidia:driver_test bash -c "ls -lah /usr/lib/x86_64-linux-gnu/libcuda*"
lrwxrwxrwx 1 root root   17 Nov  5 19:13 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.390.77
lrwxrwxrwx 1 root root   17 Nov  5 19:19 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.390.87
-rw-r--r-- 1 root root    0 Nov  5 19:13 /usr/lib/x86_64-linux-gnu/libcuda.so.390.77
-rw-r--r-- 1 root root 9.6M Aug 21 23:24 /usr/lib/x86_64-linux-gnu/libcuda.so.390.87

Why do I have residual simlinks for the nvidia driver /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.390.77 in the docker image built on Machine A and how do I remove them?

lrwxrwxrwx 1 root root 17 Nov 5 19:13 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.390.77
-rw-r--r-- 1 root root 0 Nov 5 19:13 /usr/lib/x86_64-linux-gnu/libcuda.so.390.77

Is this expected behavior? We want to support client machines with different minor versions. How do we fix this?

Detailed Information below

Machine A

$ dpkg -l "*cuda*" | grep ii
ii  libcuda1-390   390.77-0ubuntu0~gpu16.04.1 amd64        NVIDIA CUDA runtime library
$ dpkg -l "*nvidia*" | grep ii
ii  libnvidia-container-tools        1.0.0~rc.2-1               amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64       1.0.0~rc.2-1               amd64        NVIDIA container runtime library
ii  nvidia-390                       390.77-0ubuntu0~gpu16.04.1 amd64        NVIDIA binary driver - version 390.77
ii  nvidia-390-dev                   390.77-0ubuntu0~gpu16.04.1 amd64        NVIDIA binary Xorg driver development files
ii  nvidia-container-runtime         2.0.0+docker18.03.1-1      amd64        NVIDIA container runtime
ii  nvidia-container-runtime-hook    1.4.0-1                    amd64        NVIDIA container runtime hook
ii  nvidia-docker2                   2.0.3+docker18.03.1-1      all          nvidia-docker CLI wrapper
ii  nvidia-opencl-icd-390            390.77-0ubuntu0~gpu16.04.1 amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                     0.8.2                      amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                  396.24-0ubuntu0~gpu16.04.1 amd64        Tool for configuring the NVIDIA graphics driver
$ docker version
Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:20 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:30 2018
  OS/Arch:      linux/amd64
  Experimental: false
$ nvidia-container-cli -V
version: 1.0.0
build date: 2018-06-11T22:51+00:00
build revision: e3a2035da5a44b8a83d9568b91a8a0b542ee15d5
build compiler: gcc-5 5.4.0 20160609
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections

Machine B

$ dpkg -l "*cuda*" | grep ii
ii  libcuda1-390   390.87-0ubuntu0~gpu16.04.1 amd64        NVIDIA CUDA runtime library
$ dpkg -l "*nvidia*" | grep ii
ii  libnvidia-container-tools        1.0.0-1                    amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64       1.0.0-1                    amd64        NVIDIA container runtime library
ii  nvidia-390                       390.87-0ubuntu0~gpu16.04.1 amd64        NVIDIA binary driver - version 390.87
ii  nvidia-390-dev                   390.87-0ubuntu0~gpu16.04.1 amd64        NVIDIA binary Xorg driver development files
ii  nvidia-container-runtime         2.0.0+docker18.06.1-1      amd64        NVIDIA container runtime
ii  nvidia-container-runtime-hook    1.4.0-1                    amd64        NVIDIA container runtime hook
ii  nvidia-docker2                   2.0.3+docker18.06.1-1      all          nvidia-docker CLI wrapper
ii  nvidia-modprobe                  361.28-1                   amd64        utility to load NVIDIA kernel modules and create device nodes
ii  nvidia-prime                     0.8.2                      amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                  410.73-0ubuntu0~gpu16.04.1 amd64        Tool for configuring the NVIDIA graphics driver
$ docker version
Client:
 Version:           18.06.1-ce
 API version:       1.38
 Go version:        go1.10.3
 Git commit:        e68fc7a
 Built:             Tue Aug 21 17:24:56 2018
 OS/Arch:           linux/amd64
 Experimental:      false

Server:
 Engine:
  Version:          18.06.1-ce
  API version:      1.38 (minimum version 1.12)
  Go version:       go1.10.3
  Git commit:       e68fc7a
  Built:            Tue Aug 21 17:23:21 2018
  OS/Arch:          linux/amd64
  Experimental:     false
$ nvidia-container-cli -V
version: 1.0.0
build date: 2018-09-20T20:18+00:00
build revision: 881c88e2e5bb682c9bb14e68bd165cfb64563bb1
build compiler: gcc-5 5.4.0 20160609
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
@3XX0
Copy link
Member

3XX0 commented Nov 5, 2018

You are most likely building on a machine which is configured with the nvidia runtime as the default docker runtime.
This is not recommended, you should build your image without gpu support (see https://github.com/NVIDIA/nvidia-docker/wiki/Frequently-Asked-Questions#can-i-use-the-gpu-during-a-container-build-ie-docker-build)

@bhack
Copy link

bhack commented Nov 5, 2018

@3XX0 So do you think that currently it is not using the stub libcuda.so cause gpu is enabled at build time right?

@3XX0
Copy link
Member

3XX0 commented Nov 5, 2018

Yes it's definitely not using the stub for linking otherwise you wouldn't see such error.

Usually if you see empty driver files in your final build image, it means that you used gpu support at build-time. These files should be harmless but can cause confusion if you are not expecting them.

Also if you see something like libcuda.so -> libcuda.so.<DRIVER_VERSION> you're probably using libnvidia-container < 1.0 (in this specific case Machine A needs to be updated). There was an issue with symlinks not pointing to sonames which has been fixed.

@3XX0
Copy link
Member

3XX0 commented Nov 5, 2018

So @douglas-gibbons TLDR:

  • Update libnvidia-container on machine A
  • Remove default-runtime=nvidia from your docker config on your build machine (machine A?)
  • Make sure you link against the libcuda stub library

@bhack
Copy link

bhack commented Nov 5, 2018

@3XX0 In the image I already see /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.<DRIVER_VERSION> and the stub is at /usr/local/cuda/lib64/stubs/libcuda.so.
But when linking also without the default-runtime=nvidia it was going to link /usr/lib/x86_64-linux-gnu/libcuda.so. Who can have introduced this wrong symlink?

@bhack
Copy link

bhack commented Nov 5, 2018

Of course <DRIVER_VERSION> mismatch exactly like the Machine B example of @douglas-gibbons where I see /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.390.77 but the machine is 390.87

@3XX0
Copy link
Member

3XX0 commented Nov 5, 2018

Who can have introduced this wrong symlink?

Whoever built the image you're using as a base.

@bhack
Copy link

bhack commented Nov 5, 2018

@3XX0 Ok I try to rebuild the full chain myself.

@bhack
Copy link

bhack commented Nov 6, 2018

@3XX0 Ok rebuilding the full chain with runc the stub is fine.
@douglas-gibbons try to rebuild with the default runtime.

@bhack
Copy link

bhack commented Nov 7, 2018

@3XX0 Sometime I am building on a host that uses docker compose and docker compose doesn't support runtime selection anymore (see docker/app#241).
Do you think that removing i.e. Nvidia docker symbolic link like /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.390.77 as the latest build step in the Dockerfile will have any side consequence?
I know, It is not a elegant solution but I don'see any alternative in this specific case.

@bhack
Copy link

bhack commented Nov 7, 2018

@3XX0 Do you think that set NVIDIA_VISIBLE_DEVICES to void could be a valid workaround at build time?

@douglas-gibbons
Copy link
Author

Many thanks @3XX0 - I've been trying to use the nvidia runtime as an all-purpose work-horse. You've made me re-consider that approach. While I go about fixing things properly, for now I've just (forgive me) added this to the end of the Dockerfile:

 RUN rm -f /usr/lib/x86_64-linux-gnu/libcuda.so /usr/lib/x86_64-linux-gnu/libcuda.so.1 

..and that has solved the immediate issue.

@bhack
Copy link

bhack commented Nov 9, 2018

@3XX0 @flx42 Same here. I don't think that docker/app#241 will be resolved in the mid term. So it is a pain to change the default run-time between the build and docker compose. If you have any elegant workaround let me know.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants