Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

find_libcudadevrt doesn't work on our cluster installation #12

Open
simonbyrne opened this issue Apr 12, 2021 · 3 comments
Open

find_libcudadevrt doesn't work on our cluster installation #12

simonbyrne opened this issue Apr 12, 2021 · 3 comments

Comments

@simonbyrne
Copy link
Contributor

Describe the bug

On our cluster the CUDA installation (both 10.2 and 11.2) places libcudadevrt.a under targets/x86_64-linux/lib/libcudadevrt.a. find_libcudadevrt doesn't search this directory.

Manually editing deps/discovery.jl lets this work and all other libraries are found correctly.

Can we either add that directory to the search path, or add an environment variable that lets us specify the path manually?

cc: @jakebolewski

@simonbyrne simonbyrne added the bug Something isn't working label Apr 12, 2021
@maleadt
Copy link
Member

maleadt commented Apr 13, 2021

libcudadevrt always resides in that directory, but there should be a lib64 -> targets/x86_64-linux/lib/ link. Together with the libcuda.so issue, I have a feeling your CUDA distribution is a little messed up.

That said, I'm not really opposed to adding some additional code here: https://github.com/JuliaGPU/CUDA.jl/blob/631e278b56a6355492b4722382c1bec1b323e8af/deps/discovery.jl#L544-L547 (maybe add a comment about the missing link though).

@jakebolewski
Copy link
Member

jakebolewski commented Apr 13, 2021

I do think the issue is with how (this cluster's) particular Cuda installation is setup. CUDA_HOME in the module envrionemnt points to the globally installed version of the cuda assets but for the GPU nodes the shared and static libraries are installed under /usr/lib64 (for the versioned so) and /usr/local/cuda-11.2/ (for the static library and other supporting libraries).

julia> print(ENV["CUDA_HOME"])
/central/software/CUDA/11.2
shell> /usr/lib64/libcuda
libcuda.so.1             libcuda_wrapper.so        libcuda.so.460.32.03      libcuda.so
libcuda_wrapper.la       libcuda_wrapper.so.0      libcuda_wrapper.so.0.0.0
shell> /usr/local/cuda-11.2/lib64/libcuda
libcudart_static.a   libcudart.so          libcudart.so.11.2.72  libcudart.so.11.0     libcudadevrt.a

Not sure the best way to resolve this particular setup (maybe just selectively re-direct CUDA_HOME on GPU nodes?) or if we could add another JULIA_CUDA_ env variable to inject other search paths in toolkit_dirs for weird cluster setups with login node / cpu / gpu node differences.

@jakebolewski jakebolewski removed the bug Something isn't working label Apr 13, 2021
@maleadt
Copy link
Member

maleadt commented Apr 13, 2021

An alternative thought is to remove the local CUDA detection altogether, fully bet on artifacts, and have cluster users provide an Overrides.toml which should give you the necessary flexibility (although at a usability cost). But that requires some additional work on the artifact side (probably including the CUDA version in the triple), so a temporary hack with env vars is OK for now.

@maleadt maleadt transferred this issue from JuliaGPU/CUDA.jl Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants