Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve method finding CUDA libraries #1161

Open
green-br opened this issue Jul 11, 2024 · 2 comments
Open

Improve method finding CUDA libraries #1161

green-br opened this issue Jul 11, 2024 · 2 comments

Comments

@green-br
Copy link

Describe your problem

Whilst installing Relion it seems the cmake configuration still uses the deprecated CUDA package to find dependencies. Unfortunately when using the HPC SDK from Nvidia the cmake CUDA package cannot find the dependencies due to the change in file structure. I have hit this issue elsewhere such as in Gromacs and requires moving from find_package(CUDA) to find_package(CUDAToolkit) which was introduced in cmake 3.17 (but actually needs cmake 3.26 to work properly with HPC SDK).

Having spent some time making changes I think the following branch may be beginnings of a solution which could be tidied up but would like to have some comment on approach and also whether upgrading cmake required version is suitable?

https://github.com/green-br/relion/tree/cudatoolkit_update

Environment:

  • OS: OpenSUSE 15
  • MPI runtime: Cray-mpich
  • RELION version 4.0.1
  • Memory: 900GB
  • GPU: GH200

Dataset:

Not a runtime issue.

Job options:

Not a runtime issue.

Error message:

No real error message except for not finding the CUDA libraries. e.g.

  >> 128    CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
     129    Please set them or make sure they are set and tested correctly in the CMake files:
     130    CUDA_cufft_LIBRARY (ADVANCED)
@biochem-fan
Copy link
Member

Thank you very much for your contribution. Indeed this has been on our TODO list (#1016) for a long time but we were unable to do anything concrete, so your patch is very useful.

I have several questions:

  1. Is it possible to somehow keep the CUDA variable? I understand this conflicts with the module so it is reasonable to change the internal variable name, but we don't want to change user-facing arguments unless it is absolutely necessary.

  2. when using the HPC SDK from Nvidia the cmake CUDA package cannot find the dependencies due to the change in file structure

    Does FindCUDA fail even when CMP0146 is enabled? This is to understand the urgency of the problem.

  3. which was introduced in cmake 3.17 (but actually needs cmake 3.26 to work properly with HPC SDK).

    I thought it was introduced in 3.10 (as stated in the above CMP0146 page). Dropping <= 3.9 is probably fine but requiring 3.17 or 3.26 might be too strict. Can we make it compatible with both versions by failing back to FindCUDA when CMake is old?

  4. Did you make sure NVCC compiler flags (e.g. OpenMP) are properly passed? This is critically important; without it, mutex locks in parallelization are disabled and the resulting binary is broken (e.g. Unrecognized OpenMP pragma #1038).

@green-br
Copy link
Author

To answer your questions:

  1. It maybe possible to use CUDA and then set a variable to store the option and then unset CUDA. Have proposed change in branch and will test to check it still works.
  2. If CMP0146 is enabled it still doesn't solve the finding of some of the CUDA libraries. Not sure it helps other than retire the CUDA package.
  3. It seems it was deprecated in 3.10 but FindCUDAToolkit seems to have been made available in 3.17 e.g. https://cmake.org/cmake/help/latest/module/FindCUDAToolkit.html - maybe be possible to wrap logic around the newer bits to keep the older bits - will take a look if old behaviour should stay.
  4. I have just added a possible fix for the OpenMP support - will have to test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants