Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tensorflow: libcuda not discovered. #65515

Closed
c00w opened this issue Jul 28, 2019 · 8 comments · Fixed by #65584
Closed

tensorflow: libcuda not discovered. #65515

c00w opened this issue Jul 28, 2019 · 8 comments · Fixed by #65584

Comments

@c00w
Copy link
Contributor

c00w commented Jul 28, 2019

Describe the bug
Tensorflow is not detecting GPUs in python3 - this appears to be caused by a failure to import libcuda.so

To Reproduce
Steps to reproduce the behavior:
% cat gpu.py ~/brood/src/ml
#! /usr/bin/env nix-shell
#! nix-shell -i python3 -p "with python3Packages; [Keras tensorflowWithCuda]"
import tensorflow
print(tensorflow.test.is_gpu_available())

  1. chmod +x gpu.py && ./gpu.py

Expected behavior
The script should print true and have no warnings about libcuda

Screenshots

 % ./gpu.py                                                    ~/brood/src/ml
2019-07-28 13:45:18.710887: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-28 13:45:18.736256: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2592000000 Hz
2019-07-28 13:45:18.736869: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xc4b5c0 executing computations on platform Host. Devices:
2019-07-28 13:45:18.736902: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-28 13:45:18.738233: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2019-07-28 13:45:18.738252: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2019-07-28 13:45:18.738271: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: galaxy
2019-07-28 13:45:18.738278: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: galaxy
2019-07-28 13:45:18.738303: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2019-07-28 13:45:18.738336: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 430.34.0
False

Additional context

% nvidia-smi                                                  ~/brood/src/ml
Sun Jul 28 13:46:00 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34       Driver Version: 430.34       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:09:00.0  On |                  N/A |
|  0%   47C    P0    29W / 180W |    125MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       907      G   ...i2cqbhmv1bgxp5-xorg-server-1.20.5/bin/X   123MiB |
+-----------------------------------------------------------------------------+

Metadata
Please run nix run nixpkgs.nix-info -c nix-info -m and paste the result.

% nix run nixpkgs.nix-info -c nix-info -m                     ~/brood/src/ml
 - system: `"x86_64-linux"`
 - host os: `Linux 4.19.60, NixOS, 19.09pre-git (Loris)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.2.2`
 - channels(colin): `""`
 - channels(root): `"nixos-19.09pre186563.b5f5c97f7d6"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
- python3Packages.tensorflowWithCuda
# a list of nixos modules affected by the problem
module:
@c00w
Copy link
Contributor Author

c00w commented Jul 28, 2019

libcuda.so.1 does appear to be in the correct directory

% ls /var/run/opengl-driver/lib/                              ~/brood/src/ml
libcuda.so                     libnvidia-compiler.so                libnvidia-ifr.so
libcuda.so.1                   libnvidia-compiler.so.1              libnvidia-ifr.so.1
libcuda.so.430.34              libnvidia-compiler.so.430.34         libnvidia-ifr.so.430.34
libEGL_nvidia.so               libnvidia-eglcore.so                 libnvidia-ml.so
libEGL_nvidia.so.0             libnvidia-eglcore.so.1               libnvidia-ml.so.1
libEGL_nvidia.so.430.34        libnvidia-eglcore.so.430.34          libnvidia-ml.so.430.34
libGLESv1_CM_nvidia.so         libnvidia-egl-wayland.so             libnvidia-opencl.so
libGLESv1_CM_nvidia.so.1       libnvidia-egl-wayland.so.1           libnvidia-opencl.so.1
libGLESv1_CM_nvidia.so.430.34  libnvidia-egl-wayland.so.1.1.2       libnvidia-opencl.so.430.34
libGLESv2_nvidia.so            libnvidia-encode.so                  libnvidia-opticalflow.so
libGLESv2_nvidia.so.1          libnvidia-encode.so.1                libnvidia-opticalflow.so.1
libGLESv2_nvidia.so.430.34     libnvidia-encode.so.430.34           libnvidia-opticalflow.so.430.34
libGLX_nvidia.so               libnvidia-fatbinaryloader.so         libnvidia-ptxjitcompiler.so
libGLX_nvidia.so.0             libnvidia-fatbinaryloader.so.1       libnvidia-ptxjitcompiler.so.1
libGLX_nvidia.so.430.34        libnvidia-fatbinaryloader.so.430.34  libnvidia-ptxjitcompiler.so.430.34
libglxserver_nvidia.so         libnvidia-fbc.so                     libnvidia-rtcore.so
libglxserver_nvidia.so.1       libnvidia-fbc.so.1                   libnvidia-rtcore.so.1
libglxserver_nvidia.so.430.34  libnvidia-fbc.so.430.34              libnvidia-rtcore.so.430.34
libnvcuvid.so                  libnvidia-glcore.so                  libnvidia-tls.so
libnvcuvid.so.1                libnvidia-glcore.so.1                libnvidia-tls.so.1
libnvcuvid.so.430.34           libnvidia-glcore.so.430.34           libnvidia-tls.so.430.34
libnvidia-cbl.so               libnvidia-glsi.so                    libnvoptix.so
libnvidia-cbl.so.1             libnvidia-glsi.so.1                  libnvoptix.so.1
libnvidia-cbl.so.430.34        libnvidia-glsi.so.430.34             libnvoptix.so.430.34
libnvidia-cfg.so               libnvidia-glvkspirv.so               vdpau
libnvidia-cfg.so.1             libnvidia-glvkspirv.so.1
libnvidia-cfg.so.430.34        libnvidia-glvkspirv.so.430.34

@c00w
Copy link
Contributor Author

c00w commented Jul 28, 2019

I can confirm that reverting back to 19.03 from unstable fixes this.

@gloaming
Copy link
Contributor

We don't set LD_LIBRARY_PATH=/run/opengl-driver/lib any more - you can specify it on the command line for a workaround.

I'm not sure why it's not working, though. The fix is to use patchelf to fix the library path, but it's already been done:

postFixup = let
rpath = stdenv.lib.makeLibraryPath
([ stdenv.cc.cc.lib zlib ] ++ lib.optionals cudaSupport [ cudatoolkit_joined cudnn nvidia_x11 ]);
in
lib.optionalString (stdenv.isLinux) ''
rrPath="$out/${python.sitePackages}/tensorflow/:$out/${python.sitePackages}/tensorflow/contrib/tensor_forest/:${rpath}"
internalLibPath="$out/${python.sitePackages}/tensorflow/python/_pywrap_tensorflow_internal.so"
find $out -name '*${stdenv.hostPlatform.extensions.sharedLibrary}' -exec patchelf --set-rpath "$rrPath" {} \;
'';

Can you use strace to see which folders it's looking in?

cc @ambrop72

(It looks like patchelf also needs to be applied in pkgs/development/python-modules/tensorflow/default.nix, although nothing is using it AFAICT.)

@ambrop72
Copy link
Contributor

The problem is that lib/python3.7/site-packages/tensorflow/libtensorflow_framework.so.1 is not matched by the find, which only matches *.so, and doesn't get the RUNPATH set. See for yourself using readelf -d.

However, even if I run it with LD_LIBRARY_PATH=/run/opengl-driver/lib ./gpu.py, it fails later on due to failing to load libcudart.so.10.0 (which I think is not from the driver but from cudatoolkit):

$ LD_LIBRARY_PATH=/run/opengl-driver/lib ./gpu.py    
2019-07-29 17:09:28.219361: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-29 17:09:28.244423: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2808000000 Hz
2019-07-29 17:09:28.244957: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xe27d40 executing computations on platform Host. Devices:
2019-07-29 17:09:28.244995: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-29 17:09:28.246734: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-07-29 17:09:28.324738: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-29 17:09:28.325203: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1dd87c0 executing computations on platform CUDA. Devices:
2019-07-29 17:09:28.325228: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1060, Compute Capability 6.1
2019-07-29 17:09:28.325381: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-29 17:09:28.325958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
2019-07-29 17:09:28.326019: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326059: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326078: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326097: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326117: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326136: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326156: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326164: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-07-29 17:09:28.326178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-29 17:09:28.326186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-07-29 17:09:28.326193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
False

Finally, I think that is wrong that the package directly refers to nvidia-x11. This will be a problem if the system is configured to use a different version of the driver instead of the default, such as one of the legacy drivers or a beta driver. It should be fixed to put ${addOpenGLRunpath.driverLink}/lib into RUNPATH instead of nvidia-x11.

@ambrop72
Copy link
Contributor

Did anyone else try with LD_LIBRARY_PATH, are you getting the same error? It looks unrelated to the LD_LIBRARY_PATH removal.

@gloaming
Copy link
Contributor

Nice! Cool bug. I don't have an nvidia gpu, so I can't test it on my machine.

@c00w
Copy link
Contributor Author

c00w commented Jul 30, 2019

The relevant lines from stracing appear to be

openat(AT_FDCWD, "/nix/store/kvagf73z1v6y0wqfvqm04f5h02mdgsfi-python3.7-tensorflow-1.14.0/lib/python3.7/site-packages/tensorflow/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/nix/store/iykxb0bmfjmi7s53kfg6pjbfpd8jmza6-glibc-2.27/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=3536, ...}) = 0
write(2, "2019-07-29 20:11:16.731597: I te"..., 2142019-07-29 20:11:16.731597: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory

@c00w
Copy link
Contributor Author

c00w commented Jul 30, 2019

I can confirm the following patch fixes it

 % git diff                                          ~/nixpkgs/pkgs/development/python-modules/tensorflow
diff --git a/pkgs/development/python-modules/tensorflow/bin.nix b/pkgs/development/python-modules/tensorflow/bin.nix
index d02a4e1b9f2..9affbac9142 100644
--- a/pkgs/development/python-modules/tensorflow/bin.nix
+++ b/pkgs/development/python-modules/tensorflow/bin.nix
@@ -92,6 +92,7 @@ in buildPythonPackage rec {
     rrPath="$out/${python.sitePackages}/tensorflow/:$out/${python.sitePackages}/tensorflow/contrib/tensor_forest/:${rpath}"
     internalLibPath="$out/${python.sitePackages}/tensorflow/python/_pywrap_tensorflow_internal.so"
     find $out -name '*${stdenv.hostPlatform.extensions.sharedLibrary}' -exec patchelf --set-rpath "$rrPath" {} \;
+    find $out -name '*${stdenv.hostPlatform.extensions.sharedLibrary}.1' -exec patchelf --set-rpath "$rrPath" {} \;
   '';

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants