-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compilation Issues with QMCPACK on Perlmutter Using Updated Modules #4937
Comments
Issue caused by cpe/23.12. We will use it eventually but LLVM on Perlmutter is not ready. |
Thanks for reporting this Roman. Ye's fix has the simple update needed to bump the versions of the different pieces of software used in the build script. |
Thank you for your quick response. The compilation was done successfully, but when I ran test ctest -J 64 -R deterministic --output-on-failure, I got many tests failed with the following message: 49/1067 Test #114: deterministic-restart-8-2 ................................................................................***Failed Required regular expression not found. Regex=[QMCPACK execution completed successfully]158.01 secOpen MPI's OFI driver detected multiple equidistant NICs from the current process, Note: This message is displayed only when the OFI component's verbosity level is
|
Did you run this from within a job? And with the same modules setup? Asking because at most centers running an mpi job like this from the command line will cause issues; they don't want users running on the login nodes. |
Yes, I ran it with the following submission script: #SBATCH --account=m4290 testdir=/global/cfs/cdirs/m4290/codes/qmcpack/qmcpack/build_perlmutter_Clang17_offload_cuda_cplx cd $testdir module load cpe/23.05 module load PrgEnv-llvm/0.5 llvm/17 ctest -J 64 -R deterministic --output-on-failure |
OK, thanks. I noticed on another machine recently that the "grep" used by ctest was getting confused by earlier job output (I had MPI logging enabled). Maybe that is happening now on Perlmutter. The runtime for the restart test you include looks OK. If all the tests are in the few second-few minutes range, they are likely running OK. Can you manually check for "QMCPACK execution completed successfully" in the output associated with the restart test? We will have to fix this but it would be good to verify the MPI execution is actually OK. |
How many? All? Their names? Need more details to understand the issue. |
I was running it for more than 1 hour, and I was not able to get more than 100 of the tests done because these failed ones took a long time to finish. The following tests failed from the 100, I was able to run: 49/1067 Test #114: deterministic-restart-8-2 |
My error of deterministic-restart-8-2 is different on Perlmutter
This is expected when using OpenMPI and cores are oversubscribed.
The same multi-rank error. My feeling is that your error was causing by mixing Cray MPI and OpenMPI bits. Make sure you build qmcpack from empty build directories and use the build script from QMCPACK. Here is my script for unit tests
|
I downloaded the latest version, ran the build script, and then ran the "ctest -L unit" with the submission script you provided in the build folder. I got the same error message as you posted:
With the same script, I also ran it for "deterministic-restart-8-2," and I got precisely the same error as you ( I tried to run the script on 4 nodes to satisfy the 16 MPI slots in total, but I got the same error I posted before (more in the attached file). Here is the incomplete ctest run (it is shorter than the one I mentioned because I overwrote the first one, this one was running for just 30 minutes but with the same errors). |
Having headache. Reported an issue to NERSC. |
Please try out #4942
|
I've tested complex and real offload versions, and except for these two tests below, all others passed without problem.
|
@romanfanta4 the mmap failure will be investigate separately. |
Dear QMCPACK Development Team,
I am writing to seek assistance with an issue I've encountered while trying to compile the QMCPACK software on the Perlmutter.
I have been following the provided script for installation, but unfortunately, I've run into several roadblocks due to module obsolescence and compatibility issues.
Initially, the script failed because it depends on cray-hdf5-parallel/1.12.2.3 module, which appears to be obsolete and has been removed from the system. Attempting to substitute this with the newer cray-hdf5-parallel/1.12.2.9 module did not resolve the issue, as the script does not seem to work with this updated version.
To overcome error messages related to environment modules, I tried loading PrgEnv-llvm/0.5, llvm/17.0.6, and cray-libsci/23.09.1.1. While these changes allowed me to progress further in the compilation process, I encountered a failure at around 20% completion with the following error:
typescript
Copy code
Performing C++ SOURCE FILE Test DISABLE_HOST_DEVMEM_WORKS failed with the following output:
Change Dir: /global/cfs/cdirs/m4290/codes/qmcpack_new_version/qmcpack_3.17.1/build_perlmutter_Clang16_offload_cuda_cplx/CMakeFiles/CMakeTmp
... [Error output related to '-fdisable-host-devmem'] ...
clang++: error: unknown argument: '-fdisable-host-devmem'
This issue seems to stem from the use of the -fdisable-host-devmem compiler flag, which is not recognized by clang++ in the environment I am using.
Given these challenges, I am reaching out for your advice on how to proceed.
I appreciate any guidance you can provide. Thank you for your time and support.
The text was updated successfully, but these errors were encountered: