Skip to content
This repository has been archived by the owner on May 27, 2021. It is now read-only.

Error 999 on bumblebee with wrong libGL activated #50

Closed
gdkrmr opened this issue Jul 4, 2017 · 24 comments
Closed

Error 999 on bumblebee with wrong libGL activated #50

gdkrmr opened this issue Jul 4, 2017 · 24 comments

Comments

@gdkrmr
Copy link

gdkrmr commented Jul 4, 2017

tying to build CUDAdrv I get the following error, could this be, because I am running on a Laptop with bumblebee and two graphics cards? I used optirun julia and did Pkg.checkout("CUDAdrv"), bumblebee is working for other programms:

               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: https://docs.julialang.org
   _ _   _| |_  __ _   |  Type "?help" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.6.1-pre.0 (2017-06-19 13:06 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit dcf39a1* (15 days old release-0.6)
|__/                   |  x86_64-linux-gnu

julia> Pkg.build("CUDAdrv")
INFO: Building CUDAdrv
=========================================[ ERROR: CUDAdrv ]==========================================

LoadError: Initializing CUDA driver failed: unknown error (code 999).
while loading /home/gkraemer/.julia/v0.6/CUDAdrv/deps/build.jl, in expression starting on line 119

=====================================================================================================

==========================================[ BUILD ERRORS ]===========================================

WARNING: CUDAdrv had build errors.

 - packages with build errors remain installed in /home/gkraemer/.julia/v0.6
 - build the package(s) and all dependencies with `Pkg.build("CUDAdrv")`
 - build a single package by running its `deps/build.jl` script

=====================================================================================================

@maleadt
Copy link
Member

maleadt commented Jul 4, 2017

could this be, because I am running on a Laptop with bumblebee and two graphics cards?

No idea, you'll need to provide more information for any diagnosis.

Did you see the suggestions in the documentation?

unknown error (code 999): this often indicates that your set-up is broken, eg. because you didn't load the correct, or any, kernel module. Please verify your set-up, on Linux by executing nvidia-smi or on other platforms by compiling and running CUDA C code using nvcc.

@gdkrmr
Copy link
Author

gdkrmr commented Jul 5, 2017

Digging a little bit deeper I found that tensorflow is not working either, probably because the provided binaries are for CUDA 7.5 and I am on 8.0 (I haven't rebuild tensorflow yet to check this). Could that be an issue? What nvidia driver versions do you support, I currently have 375 installed?

@maleadt
Copy link
Member

maleadt commented Jul 5, 2017

What nvidia driver versions do you support, I currently have 375 installed?

Shouldn't be a problem, I'm currently working with (and hence support):

1. 375.39 (installed)
2. 375.66: long-lived (installed)
3. 378.13 (installed)
4. 381.09 (installed)
5. 381.22: short-lived (installed)
6. 384.47: beta (installed)

but in general, our support depends on the CUDA API level exported by the driver library, which is currently 8.0. So unless you have access to the CUDA 9 beta, driver support shouldn't be an issue.

Again, can you execute nvidia-smi and compile & execute regular CUDA C code using nvcc?

@gdkrmr
Copy link
Author

gdkrmr commented Jul 5, 2017

I could compile all the samples that came with cuda (the folder /usr/share/cuda-8.0/samples) but they wouldn't run.
Then I canged

update-alternatives --config x86_64-linux-gnu_gl_conf 

to auto (which is using the nvidia driver) and suddenly I can build CUDAdrv and CUDArt. I hope I did not break anything else. I will let you know if I run into more trouble, thanks for the help!

@maleadt
Copy link
Member

maleadt commented Jul 5, 2017

That is strange, using NVIDIA's libgl shouldn't impact raw usage of libcuda.so.
Again, did and could you run nvidia-smi? Assuming it failed, did you see any error in dmesg?
Maybe you have multiple libcuda.so files in your system, and that update-alternatives call cascaded into changing the active symlink to a different one of those?

@gdkrmr
Copy link
Author

gdkrmr commented Jul 5, 2017

i could run nvidia-smi before and after (with optirun). I had cuda 7.5 before, but it got removed from the system when I installed cuda 8.0. Also graphical applications worked before with optirun. Also compiling cuda applications worked before, they just didn't want to run complaining about not finding a device.

@maleadt
Copy link
Member

maleadt commented Jul 5, 2017

That is all very confusing... can't really use the info to improve the build system. Glad it's working now though!
Next time you run into the issue, would you mind gathering as much information as possible? eg. running Pkg.build with DEBUG=1 (if that option still exists by then, as it is bound to change, just check the documentation at that point), run a compiled CUDA application through strace and ldd to see exactly what it picks up, etc. Thanks!

@maleadt maleadt closed this as completed Jul 5, 2017
@gdkrmr
Copy link
Author

gdkrmr commented Jul 6, 2017

julia> ENV["DEBUG"] = "1"
julia> Pkg.build("CUDAdrv")
INFO: Building CUDAdrv
DEBUG: Found libcuda at /usr/lib/x86_64-linux-gnu/libcuda.so
DEBUG: Vendor: NVIDIA
===============================[ ERROR: CUDAdrv ]===============================

LoadError: CUDA error 999 calling cuInit
while loading /home/gkraemer/.julia/v0.5/CUDAdrv/deps/build.jl, in expression starting on line 107

================================================================================

================================[ BUILD ERRORS ]================================

WARNING: CUDAdrv had build errors.

 - packages with build errors remain installed in /home/gkraemer/.julia/v0.5
 - build the package(s) and all dependencies with `Pkg.build("CUDAdrv")`
 - build a single package by running its `deps/build.jl` script

================================================================================

@gdkrmr
Copy link
Author

gdkrmr commented Jul 6, 2017

the good part, is that I can control the error now :D

@maleadt
Copy link
Member

maleadt commented Jul 6, 2017

OK, we can simplify all that to the following now:

$ julia -e 'ccall((:cuInit, "/usr/lib/x86_64-linux-gnu/libcuda.so"), Cint, (Cint,), 0)' 

This still produces 999, right?

What does ldd /usr/lib/x86_64-linux-gnu/libcuda.so produce?

If you compile the following file, test.cu:

#include <cuda_runtime.h>
int main() { cudaFree(0); return 0; }

with:

nvcc test.cu -o test

what libraries does it open:

strace ./test |& grep libcuda

Of course, add optirun wherever necessary, I'm not familiar with bumblebee.

@maleadt maleadt reopened this Jul 6, 2017
@maleadt maleadt changed the title error 999 when building Error 999 on bumblebee with wrong libGL activated Jul 6, 2017
@gdkrmr
Copy link
Author

gdkrmr commented Jul 6, 2017

all of this is with

update-alternatives --config x86_64-linux-gnu_gl_conf

set to the mesa driver

OK, we can simplify all that to the following now:

$ julia -e 'ccall((:cuInit, "/usr/lib/x86_64-linux-gnu/libcuda.so"), Cint, (Cint,), 0)'

This still produces 999, right?

no, does not produce any error message

What does ldd /usr/lib/x86_64-linux-gnu/libcuda.so produce?

$ ldd /usr/lib/x86_64-linux-gnu/libcuda.so
	linux-vdso.so.1 =>  (0x00007ffd39bd5000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb49289b000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb4924d0000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb4922cc000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb4920af000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb491ea6000)
	libnvidia-fatbinaryloader.so.375.66 => /usr/lib/nvidia-375/libnvidia-fatbinaryloader.so.375.66 (0x00007fb491c5a000)
	/lib64/ld-linux-x86-64.so.2 (0x00005560e05de000)

If you compile the following file, test.cu:

#include <cuda_runtime.h>
int main() { cudaFree(0); return 0; }

with:

nvcc test.cu -o test

$ nvcc test.cu -o test
nvcc warning : The 'compute_20', 'sm_20', and 'sm_21' architectures are deprecated, and may be removed in a future release (Use -Wno-deprecated-gpu-targets to suppress warning).

what libraries does it open:

strace ./test |& grep libcuda

$ strace ./test |&grep libcuda
open("/home/gkraemer/progs/deeplearning/torch/install/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/mesa/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/nvidia-375/tls/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/nvidia-375/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/cuda-8.0/lib64/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/libcuda.so.1", O_RDONLY|O_CLOEXEC) = 3
$ optirun strace ./test |&grep libcuda
open("/usr/lib/x86_64-linux-gnu/primus/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/nvidia-375/tls/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/nvidia-375/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib32/nvidia-375/tls/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib32/nvidia-375/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/home/gkraemer/progs/deeplearning/torch/install/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/mesa/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/cuda-8.0/lib64/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/libcuda.so.1", O_RDONLY|O_CLOEXEC) = 3

Of course, add optirun wherever necessary, I'm not familiar with bumblebee.

now setting

update-alternatives --config x86_64-linux-gnu_gl_conf

to auto (the nvidia driver)

optirun julia -e 'Pkg.build("CUDAdrv")'

works fine

julia -e 'Pkg.build("CUDAdrv")'

gives the same error with code 999 as before
test.cu compiles fine, same as above

the strace outputs are a little bit different:

$ strace ./test |&grep libcuda
open("/home/gkraemer/progs/deeplearning/torch/install/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/mesa/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/nvidia-375/tls/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/nvidia-375/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/cuda-8.0/lib64/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/libcuda.so.1", O_RDONLY|O_CLOEXEC) = 3
$ optirun strace ./test |&grep libcuda
open("/usr/lib/x86_64-linux-gnu/primus/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/nvidia-375/tls/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/nvidia-375/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib32/nvidia-375/tls/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib32/nvidia-375/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/home/gkraemer/progs/deeplearning/torch/install/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/mesa/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/lib/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/local/cuda-8.0/lib64/libcuda.so.1", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("/usr/lib/x86_64-linux-gnu/libcuda.so.1", O_RDONLY|O_CLOEXEC) = 3

@gdkrmr
Copy link
Author

gdkrmr commented Jul 6, 2017

and:

$ optirun julia -e 'ccall((:cuInit, "/usr/lib/x86_64-linux-gnu/libcuda.so"), Cint, (Cint,), 0)'
$ julia -e 'ccall((:cuInit, "/usr/lib/x86_64-linux-gnu/libcuda.so"), Cint, (Cint,), 0)'

simply return without error messages

@maleadt
Copy link
Member

maleadt commented Jul 6, 2017

Right, add a @show or smth to display the returned value.

@gdkrmr
Copy link
Author

gdkrmr commented Jul 6, 2017

update-alternatives --config x86_64-linux-gnu_gl_conf

to auto (the nvidia driver)

ccall((:cuInit,"/usr/lib/x86_64-linux-gnu/libcuda.so"),Cint,(Cint,),0) = 999
gkraemer@laaja:~$ optirun julia -e '@show ccall((:cuInit, "/usr/lib/x86_64-linux-gnu/libcuda.so"), Cint, (Cint,), 0)'
ccall((:cuInit,"/usr/lib/x86_64-linux-gnu/libcuda.so"),Cint,(Cint,),0) = 0

set to mesa:

gkraemer@laaja:~$ optirun julia -e '@show ccall((:cuInit, "/usr/lib/x86_64-linux-gnu/libcuda.so"), Cint, (Cint,), 0)'
ccall((:cuInit,"/usr/lib/x86_64-linux-gnu/libcuda.so"),Cint,(Cint,),0) = 999

@maleadt
Copy link
Member

maleadt commented Jul 6, 2017

Thanks for the details.

/usr/lib/x86_64-linux-gnu/libcuda.so isn't a symlink, is it?

What I'm gathering from this, and some posts on the internet, is that optirun enables/disables your NVIDIA GPU, but shouldn't impact CUDA in any other way. That would explain the error 999, but shouldn't be impacted by the libGL change. Also, nvidia-smi does run without optirun, and you mention other CUDA applications, when run without optirun, erroring out with no device...

Debugging this remotely is going to be annoying. I might try to replicate your set-up; I take it you're running Ubuntu? Which versions? Any peculiarities, on eg. the bumblebee set-up?

@maleadt maleadt closed this as completed Jul 6, 2017
@maleadt maleadt reopened this Jul 6, 2017
@gdkrmr
Copy link
Author

gdkrmr commented Jul 6, 2017

it is a symlink, it links to libcuda.so.1 which links to libcuda.so.375.66, all are in /usr/lib/x86_64-linux-gnu/

@gdkrmr
Copy link
Author

gdkrmr commented Jul 6, 2017

Getting bumblebee to run is quite annoying, I will try to give you the steps for it as well as I remember, I am sure that there are some details missing.

3d accelerated programs should work on the GPU now if run with optirun.
In case you lost 3d acceleration from the Intel card you have to set $LD_LIBRARY_PATH to include the mesa drivers (see: Bumblebee-Project/Bumblebee#869).

you probably want to include /usr/local/cuda-8.0/bin and /usr/lib/nvidia-375/bin into your $PATH.

@gdkrmr
Copy link
Author

gdkrmr commented Jul 7, 2017

@dfdx
Copy link

dfdx commented Aug 22, 2017

Just fixed the same error by upgrading the driver from version 375 to the latest 384.

For reference, here's my setup:

  • GeForce GTX 960M
  • Ubuntu 16.04
  • NVidia driver 384 installed/upgraded using built-in "Additional drivers" application
  • CUDA 8
  • no Bumblebee / Optirun

Before upgrading running (driver version 375):

$ julia -e '@show ccall((:cuInit, "/usr/lib/x86_64-linux-gnu/libcuda.so"), Cint, (Cint,), 0)' 

failed with error 999, CUDA samples compiled, but at run-time failed with error 30, even though nvidia-smi worked fined.

After upgrading (driver version 384) all example work fine.

@maleadt
Copy link
Member

maleadt commented Aug 23, 2017

but at run-time failed with error 30

This seems different from the original report here?
Not sure I can do much about that, looks more like a defunct toolkit installation.

I've tested on 375.39 and 375.66, and CUDAdrv works fine in both cases. But then again, the toolkit already worked properly before that... One thing you might want to consider, is to run any of the samples as root. Sometimes, parts of nvidia-uvm aren't initialized properly yet, but only root can do so.

@dfdx
Copy link

dfdx commented Aug 23, 2017

This seems different from the original report here?

@gdkrmr didn't report the error code from CUDA samples, so I'm not sure about this one. For cuInit, however, the error was 999, i.e. the same as reported here.

Not sure I can do much about that, looks more like a defunct toolkit installation.

Oh, I didn't mean you should, sorry if I made you think so! At least for me, CUDA on Ubuntu fails every now and then by itself, without any relation to CUDAdrv. I posted the report just for other people who may encounter the same error and have already tried other solutions (like using update-alternatives or running as root, both of which didn't work for me).

@gdkrmr
Copy link
Author

gdkrmr commented Aug 24, 2017

just tried with nvidia driver 384 and still the same issue.

@dfdx
Copy link

dfdx commented Aug 25, 2017

And one more update. It turns out what really caused the error for me was going to sleep mode, and what fixed it was rebooting.

image

@maleadt
Copy link
Member

maleadt commented Jan 11, 2018

We just encountered another case of this (or a similar) issue, resolved by loading the nvidia_uvm kernel module.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants