Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when building relion with many recent commits that changed CMakeLists #1076

Closed
xeniorn opened this issue Feb 5, 2024 · 3 comments
Closed

Comments

@xeniorn
Copy link

xeniorn commented Feb 5, 2024

Can't build relion v5.0 for commits newer than fbf4f71, starting from fb5d7c9. Using CMake via easybuild. Error is in the build part, doesn't seem to be related to easybuild.

Full log attached:
easybuild-relion-test_ja240205_5.0-beta-fb5d7c9e693c6e9ba1deaac5727e487092d82a40-20240205.184401.cZJzJ.log

Relevant part:

/software/f2022/software/cuda/12.0.0/bin/nvcc /tmp/relion/test_ja240205_5.0-beta-fb5d7c9e693c6e9ba1deaac5727e487092d82a40/foss-2022b/relion-fb5d7c9e693c6e9ba1deaac5727e487092d82a40/src/jaz/cuda/test00.cu -c -o /tmp/relion/test_ja240205_5.0-beta-fb5d7c9e693c6e9ba1deaac5727e487092d82a40/
/software/f2022/software/gcccore/12.2.0/include/c++/12.2.0/type_traits:77:52: error: redefinition of constexpr const _Tp std::integral_constant<_Tp, __v>::value
   77 |   template<typename _Tp, _Tp __v>
      |                                                    ^
/software/f2022/software/gcccore/12.2.0/include/c++/12.2.0/type_traits:64:29: note: constexpr const _Tp value previously declared here
   64 |       static constexpr _Tp                  value = __v;
      |                             ^~~~~

Trying to install the latest commit (b75b38c) gives a different error:

/software/f2022/software/cuda/12.0.0/bin/nvcc -M -D__CUDACC__ /tmp/relion/test_ja240205_5.0-beta-0b4561936134996d5a34d83694e1d9e8ef595dc2/foss-2022b/relion-0b4561936134996d5a34d83694e1d9e8ef595dc2/src/jaz/cuda/test00.cu -o /tmp/relion/test_ja240205_5.0-beta-0b4561936134996d5a34d83694e1d9e8ef595dc2/foss-2022b/easybuild_obj/src/apps/CMakeFiles/relion_jaz_gpu_util.dir/__/jaz/cuda/relion_jaz_gpu_util_generated_test00.cu.o.NVCC-depend -ccbin /software/f2022/software/openmpi/3.1.6-gcc-12.2.0/bin/mpicc -m64 -DINSTALL_LIBRARY_DIR=/tmp/tmp.O7HcagOBao/software/relion/test_ja240205_5.0-beta-0b4561936134996d5a34d83694e1d9e8ef595dc2-foss-2022b/lib/ -DSOURCE_DIR=/tmp/relion/test_ja240205_5.0-beta-0b4561936134996d5a34d83694e1d9e8ef595dc2/foss-2022b/relion-0b4561936134996d5a34d83694e1d9e8ef595dc2/src/ -DACC_HIP=3 -DACC_CUDA=2 -DACC_CPU=1 -D_CUDA_ENABLED -DHAVE_SINCOS -DHAVE_TIFF -DHAVE_PNG -DHAVE_JPEG -Xcompiler ,\"-fPIC\",\"-std=c++14\",\"-pthread\",\"-fopenmp\",\"-O3\",\"-DNDEBUG\" -arch=sm_12.0.0 -D__INTEL_COMPILER --default-stream per-thread --std=c++14 --disable-warnings -DNVCC -I/software/f2022/software/cuda/12.0.0/include -I/tmp/relion/test_ja240205_5.0-beta-0b4561936134996d5a34d83694e1d9e8ef595dc2/foss-2022b/relion-0b4561936134996d5a34d83694e1d9e8ef595dc2 -I/software/f2022/software/fftw.mpi/3.3.10-gompi-2022b/include -I/software/f2022/software/fltk/1.3.8-gcccore-12.2.0/include -I/software/f2022/software/libtiff/4.4.0-gcccore-12.2.0/include -I/software/f2022/software/libpng/1.6.38-gcccore-12.2.0/include -I/software/f2022/software/zlib/1.2.12-gcccore-12.2.0/include -I/software/f2022/software/libjpeg-turbo/2.1.4-gcccore-12.2.0/include
nvcc fatal   : Value 'sm_12.0.0' is not defined for option 'gpu-architecture'
CMake Error at relion_jaz_gpu_util_generated_test00.cu.o.Release.cmake:220 (message):
  Error generating
  /tmp/relion/test_ja240205_5.0-beta-0b4561936134996d5a34d83694e1d9e8ef595dc2/foss-2022b/easybuild_obj/src/apps/CMakeFiles/relion_jaz_gpu_util.dir/__/jaz/cuda/./relion_jaz_gpu_util_generated_test00.cu.o


make[2]: *** [src/apps/CMakeFiles/relion_jaz_gpu_util.dir/__/jaz/cuda/relion_jaz_gpu_util_generated_test00.cu.o] Error 1

Environment:
OS: CentOS7
MPI runtime: OpenMPI 3.1.6
Cmake 3.24.3
CUDA 12.0.0

Please advise

@xeniorn
Copy link
Author

xeniorn commented Feb 5, 2024

Apologies, obviously an error on our end, we were using a flag -DCUDA_ARCH=12.0.0 erroneously. I wonder why it ever worked? Works after this is taken out.

Perhaps a follow-up question - is there an advantage of making separate compilations of relion for nodes having different gpus? We have P100s, V100s, and A100s, all supporting different compute capabilities. We can go for a single install with the lowest common denominator (P100 = sm60), but will we have better performance on A100s if we compile it there with sm80?

If yes, would multi-node jobs still work between nodes that have relion compiled with different cuda compute capabilities?

@xeniorn xeniorn changed the title Error when building relion with CUDA12 with many recent commits Error when building relion with many recent commits that changed CMakeLists Feb 5, 2024
@biochem-fan
Copy link
Member

will we have better performance on A100s if we compile it there with sm80?

Theoretically yes but when I tested this before, the difference was less than 5 %. Of course this depends on the card and the task (is the GPU really limiting?). You should test on your hardware and dataset.

would multi-node jobs still work between nodes that have relion compiled with different cuda compute capabilities?

Yes.

@xeniorn
Copy link
Author

xeniorn commented Feb 5, 2024

Ok, given that we have many people running very different projects and so can't do focused optimization, 5 % potential gain we'll probably skip.

Many thanks for your input!

@xeniorn xeniorn closed this as completed Feb 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants