tensorflow: libcuda not discovered. #65515

c00w · 2019-07-28T17:46:50Z

Describe the bug
Tensorflow is not detecting GPUs in python3 - this appears to be caused by a failure to import libcuda.so

To Reproduce
Steps to reproduce the behavior:
% cat gpu.py ~/brood/src/ml
#! /usr/bin/env nix-shell
#! nix-shell -i python3 -p "with python3Packages; [Keras tensorflowWithCuda]"
import tensorflow
print(tensorflow.test.is_gpu_available())

chmod +x gpu.py && ./gpu.py

Expected behavior
The script should print true and have no warnings about libcuda

Screenshots

 % ./gpu.py                                                    ~/brood/src/ml
2019-07-28 13:45:18.710887: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-28 13:45:18.736256: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2592000000 Hz
2019-07-28 13:45:18.736869: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xc4b5c0 executing computations on platform Host. Devices:
2019-07-28 13:45:18.736902: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-28 13:45:18.738233: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2019-07-28 13:45:18.738252: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: UNKNOWN ERROR (303)
2019-07-28 13:45:18.738271: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: galaxy
2019-07-28 13:45:18.738278: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: galaxy
2019-07-28 13:45:18.738303: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: Not found: was unable to find libcuda.so DSO loaded into this program
2019-07-28 13:45:18.738336: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 430.34.0
False

Additional context

% nvidia-smi                                                  ~/brood/src/ml
Sun Jul 28 13:46:00 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.34       Driver Version: 430.34       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106...  Off  | 00000000:09:00.0  On |                  N/A |
|  0%   47C    P0    29W / 180W |    125MiB /  6078MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0       907      G   ...i2cqbhmv1bgxp5-xorg-server-1.20.5/bin/X   123MiB |
+-----------------------------------------------------------------------------+

Metadata
Please run nix run nixpkgs.nix-info -c nix-info -m and paste the result.

% nix run nixpkgs.nix-info -c nix-info -m                     ~/brood/src/ml
 - system: `"x86_64-linux"`
 - host os: `Linux 4.19.60, NixOS, 19.09pre-git (Loris)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.2.2`
 - channels(colin): `""`
 - channels(root): `"nixos-19.09pre186563.b5f5c97f7d6"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Maintainer information:

# a list of nixpkgs attributes affected by the problem
attribute:
- python3Packages.tensorflowWithCuda
# a list of nixos modules affected by the problem
module:

The text was updated successfully, but these errors were encountered:

c00w · 2019-07-28T17:47:18Z

libcuda.so.1 does appear to be in the correct directory

% ls /var/run/opengl-driver/lib/                              ~/brood/src/ml
libcuda.so                     libnvidia-compiler.so                libnvidia-ifr.so
libcuda.so.1                   libnvidia-compiler.so.1              libnvidia-ifr.so.1
libcuda.so.430.34              libnvidia-compiler.so.430.34         libnvidia-ifr.so.430.34
libEGL_nvidia.so               libnvidia-eglcore.so                 libnvidia-ml.so
libEGL_nvidia.so.0             libnvidia-eglcore.so.1               libnvidia-ml.so.1
libEGL_nvidia.so.430.34        libnvidia-eglcore.so.430.34          libnvidia-ml.so.430.34
libGLESv1_CM_nvidia.so         libnvidia-egl-wayland.so             libnvidia-opencl.so
libGLESv1_CM_nvidia.so.1       libnvidia-egl-wayland.so.1           libnvidia-opencl.so.1
libGLESv1_CM_nvidia.so.430.34  libnvidia-egl-wayland.so.1.1.2       libnvidia-opencl.so.430.34
libGLESv2_nvidia.so            libnvidia-encode.so                  libnvidia-opticalflow.so
libGLESv2_nvidia.so.1          libnvidia-encode.so.1                libnvidia-opticalflow.so.1
libGLESv2_nvidia.so.430.34     libnvidia-encode.so.430.34           libnvidia-opticalflow.so.430.34
libGLX_nvidia.so               libnvidia-fatbinaryloader.so         libnvidia-ptxjitcompiler.so
libGLX_nvidia.so.0             libnvidia-fatbinaryloader.so.1       libnvidia-ptxjitcompiler.so.1
libGLX_nvidia.so.430.34        libnvidia-fatbinaryloader.so.430.34  libnvidia-ptxjitcompiler.so.430.34
libglxserver_nvidia.so         libnvidia-fbc.so                     libnvidia-rtcore.so
libglxserver_nvidia.so.1       libnvidia-fbc.so.1                   libnvidia-rtcore.so.1
libglxserver_nvidia.so.430.34  libnvidia-fbc.so.430.34              libnvidia-rtcore.so.430.34
libnvcuvid.so                  libnvidia-glcore.so                  libnvidia-tls.so
libnvcuvid.so.1                libnvidia-glcore.so.1                libnvidia-tls.so.1
libnvcuvid.so.430.34           libnvidia-glcore.so.430.34           libnvidia-tls.so.430.34
libnvidia-cbl.so               libnvidia-glsi.so                    libnvoptix.so
libnvidia-cbl.so.1             libnvidia-glsi.so.1                  libnvoptix.so.1
libnvidia-cbl.so.430.34        libnvidia-glsi.so.430.34             libnvoptix.so.430.34
libnvidia-cfg.so               libnvidia-glvkspirv.so               vdpau
libnvidia-cfg.so.1             libnvidia-glvkspirv.so.1
libnvidia-cfg.so.430.34        libnvidia-glvkspirv.so.430.34

c00w · 2019-07-28T18:29:21Z

I can confirm that reverting back to 19.03 from unstable fixes this.

gloaming · 2019-07-29T11:12:27Z

We don't set LD_LIBRARY_PATH=/run/opengl-driver/lib any more - you can specify it on the command line for a workaround.

I'm not sure why it's not working, though. The fix is to use patchelf to fix the library path, but it's already been done:

nixpkgs/pkgs/development/python-modules/tensorflow/bin.nix

Lines 87 to 95 in cf82a58

    
             postFixup = let 
        
               rpath = stdenv.lib.makeLibraryPath 
        
                 ([ stdenv.cc.cc.lib zlib ] ++ lib.optionals cudaSupport [ cudatoolkit_joined cudnn nvidia_x11 ]); 
        
             in 
        
             lib.optionalString (stdenv.isLinux) '' 
        
               rrPath="$out/${python.sitePackages}/tensorflow/:$out/${python.sitePackages}/tensorflow/contrib/tensor_forest/:${rpath}" 
        
               internalLibPath="$out/${python.sitePackages}/tensorflow/python/_pywrap_tensorflow_internal.so" 
        
               find $out -name '*${stdenv.hostPlatform.extensions.sharedLibrary}' -exec patchelf --set-rpath "$rrPath" {} \; 
        
             '';

Can you use strace to see which folders it's looking in?

cc @ambrop72

(It looks like patchelf also needs to be applied in pkgs/development/python-modules/tensorflow/default.nix, although nothing is using it AFAICT.)

ambrop72 · 2019-07-29T15:13:28Z

The problem is that lib/python3.7/site-packages/tensorflow/libtensorflow_framework.so.1 is not matched by the find, which only matches *.so, and doesn't get the RUNPATH set. See for yourself using readelf -d.

However, even if I run it with LD_LIBRARY_PATH=/run/opengl-driver/lib ./gpu.py, it fails later on due to failing to load libcudart.so.10.0 (which I think is not from the driver but from cudatoolkit):

$ LD_LIBRARY_PATH=/run/opengl-driver/lib ./gpu.py    
2019-07-29 17:09:28.219361: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-07-29 17:09:28.244423: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2808000000 Hz
2019-07-29 17:09:28.244957: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xe27d40 executing computations on platform Host. Devices:
2019-07-29 17:09:28.244995: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>
2019-07-29 17:09:28.246734: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcuda.so.1
2019-07-29 17:09:28.324738: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-29 17:09:28.325203: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1dd87c0 executing computations on platform CUDA. Devices:
2019-07-29 17:09:28.325228: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce GTX 1060, Compute Capability 6.1
2019-07-29 17:09:28.325381: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1005] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-07-29 17:09:28.325958: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1640] Found device 0 with properties: 
name: GeForce GTX 1060 major: 6 minor: 1 memoryClockRate(GHz): 1.6705
pciBusID: 0000:01:00.0
2019-07-29 17:09:28.326019: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326059: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326078: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326097: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326117: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326136: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326156: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /run/opengl-driver/lib
2019-07-29 17:09:28.326164: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...
2019-07-29 17:09:28.326178: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1181] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-07-29 17:09:28.326186: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1187]      0 
2019-07-29 17:09:28.326193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1200] 0:   N 
False

Finally, I think that is wrong that the package directly refers to nvidia-x11. This will be a problem if the system is configured to use a different version of the driver instead of the default, such as one of the legacy drivers or a beta driver. It should be fixed to put ${addOpenGLRunpath.driverLink}/lib into RUNPATH instead of nvidia-x11.

ambrop72 · 2019-07-29T15:14:32Z

Did anyone else try with LD_LIBRARY_PATH, are you getting the same error? It looks unrelated to the LD_LIBRARY_PATH removal.

gloaming · 2019-07-29T17:36:23Z

Nice! Cool bug. I don't have an nvidia gpu, so I can't test it on my machine.

c00w · 2019-07-30T00:16:32Z

The relevant lines from stracing appear to be

openat(AT_FDCWD, "/nix/store/kvagf73z1v6y0wqfvqm04f5h02mdgsfi-python3.7-tensorflow-1.14.0/lib/python3.7/site-packages/tensorflow/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/nix/store/iykxb0bmfjmi7s53kfg6pjbfpd8jmza6-glibc-2.27/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
stat("/etc/localtime", {st_mode=S_IFREG|0444, st_size=3536, ...}) = 0
write(2, "2019-07-29 20:11:16.731597: I te"..., 2142019-07-29 20:11:16.731597: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory

c00w · 2019-07-30T00:23:58Z

I can confirm the following patch fixes it

 % git diff                                          ~/nixpkgs/pkgs/development/python-modules/tensorflow
diff --git a/pkgs/development/python-modules/tensorflow/bin.nix b/pkgs/development/python-modules/tensorflow/bin.nix
index d02a4e1b9f2..9affbac9142 100644
--- a/pkgs/development/python-modules/tensorflow/bin.nix
+++ b/pkgs/development/python-modules/tensorflow/bin.nix
@@ -92,6 +92,7 @@ in buildPythonPackage rec {
     rrPath="$out/${python.sitePackages}/tensorflow/:$out/${python.sitePackages}/tensorflow/contrib/tensor_forest/:${rpath}"
     internalLibPath="$out/${python.sitePackages}/tensorflow/python/_pywrap_tensorflow_internal.so"
     find $out -name '*${stdenv.hostPlatform.extensions.sharedLibrary}' -exec patchelf --set-rpath "$rrPath" {} \;
+    find $out -name '*${stdenv.hostPlatform.extensions.sharedLibrary}.1' -exec patchelf --set-rpath "$rrPath" {} \;
   '';

c00w added the 0.kind: bug label Jul 28, 2019

c00w mentioned this issue Jul 30, 2019

pythonPackages.tensorflow: Hardcode a second search class. #65584

Merged

10 tasks

abbradar closed this as completed in #65584 Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensorflow: libcuda not discovered. #65515

tensorflow: libcuda not discovered. #65515

c00w commented Jul 28, 2019

c00w commented Jul 28, 2019

c00w commented Jul 28, 2019

gloaming commented Jul 29, 2019

ambrop72 commented Jul 29, 2019

ambrop72 commented Jul 29, 2019

gloaming commented Jul 29, 2019

c00w commented Jul 30, 2019

c00w commented Jul 30, 2019

tensorflow: libcuda not discovered. #65515

tensorflow: libcuda not discovered. #65515

Comments

c00w commented Jul 28, 2019

c00w commented Jul 28, 2019

c00w commented Jul 28, 2019

gloaming commented Jul 29, 2019

ambrop72 commented Jul 29, 2019

ambrop72 commented Jul 29, 2019

gloaming commented Jul 29, 2019

c00w commented Jul 30, 2019

c00w commented Jul 30, 2019