Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compilation Issues with QMCPACK on Perlmutter Using Updated Modules #4937

Closed
romanfanta4 opened this issue Feb 28, 2024 · 14 comments · Fixed by #4938
Closed

Compilation Issues with QMCPACK on Perlmutter Using Updated Modules #4937

romanfanta4 opened this issue Feb 28, 2024 · 14 comments · Fixed by #4938

Comments

@romanfanta4
Copy link

Dear QMCPACK Development Team,

I am writing to seek assistance with an issue I've encountered while trying to compile the QMCPACK software on the Perlmutter.
I have been following the provided script for installation, but unfortunately, I've run into several roadblocks due to module obsolescence and compatibility issues.

Initially, the script failed because it depends on cray-hdf5-parallel/1.12.2.3 module, which appears to be obsolete and has been removed from the system. Attempting to substitute this with the newer cray-hdf5-parallel/1.12.2.9 module did not resolve the issue, as the script does not seem to work with this updated version.

To overcome error messages related to environment modules, I tried loading PrgEnv-llvm/0.5, llvm/17.0.6, and cray-libsci/23.09.1.1. While these changes allowed me to progress further in the compilation process, I encountered a failure at around 20% completion with the following error:

typescript
Copy code
Performing C++ SOURCE FILE Test DISABLE_HOST_DEVMEM_WORKS failed with the following output:
Change Dir: /global/cfs/cdirs/m4290/codes/qmcpack_new_version/qmcpack_3.17.1/build_perlmutter_Clang16_offload_cuda_cplx/CMakeFiles/CMakeTmp

... [Error output related to '-fdisable-host-devmem'] ...

clang++: error: unknown argument: '-fdisable-host-devmem'
This issue seems to stem from the use of the -fdisable-host-devmem compiler flag, which is not recognized by clang++ in the environment I am using.

Given these challenges, I am reaching out for your advice on how to proceed.
I appreciate any guidance you can provide. Thank you for your time and support.

@ye-luo
Copy link
Contributor

ye-luo commented Feb 28, 2024

Issue caused by cpe/23.12. We will use it eventually but LLVM on Perlmutter is not ready.

@prckent
Copy link
Contributor

prckent commented Feb 28, 2024

Thanks for reporting this Roman. Ye's fix has the simple update needed to bump the versions of the different pieces of software used in the build script.

@romanfanta4
Copy link
Author

Thank you for your quick response. The compilation was done successfully, but when I ran test ctest -J 64 -R deterministic --output-on-failure, I got many tests failed with the following message:

49/1067 Test #114: deterministic-restart-8-2 ................................................................................***Failed Required regular expression not found. Regex=[QMCPACK execution completed successfully]158.01 sec

Open MPI's OFI driver detected multiple equidistant NICs from the current process,
but had insufficient information to ensure MPI processes fairly pick a NIC for use.
This may negatively impact performance. A more modern PMIx server is necessary to
resolve this issue.

Note: This message is displayed only when the OFI component's verbosity level is
62859480 or higher.

Input file(s): qmc_short.in.xml

14 more processes have sent help message help-common-ofi.txt / package_rank failed

1 more process has sent help message help-common-ofi.txt / package_rank failed

@prckent
Copy link
Contributor

prckent commented Feb 28, 2024

Did you run this from within a job? And with the same modules setup?

Asking because at most centers running an mpi job like this from the command line will cause issues; they don't want users running on the login nodes.

@romanfanta4
Copy link
Author

Yes, I ran it with the following submission script:
#!/bin/bash

#SBATCH --account=m4290
#SBATCH --constraint=gpu
#SBATCH --qos=debug
#SBATCH --nodes=1
#SBATCH --gpus-per-node=1
#SBATCH --time=0:30:00
#SBATCH --job-name=QMCPACK_test_GPU
#SBATCH --output=NERSC_ctest-deter_slurm-%j.output
#SBATCH --error=NERSC_ctest-deter_slurm-%j.error

testdir=/global/cfs/cdirs/m4290/codes/qmcpack/qmcpack/build_perlmutter_Clang17_offload_cuda_cplx

cd $testdir

module load cpe/23.05
module load PrgEnv-gnu
module load cray-libsci
CRAY_LIBSCI_LIB=$CRAY_LIBSCI_PREFIX_DIR/lib/libsci_gnu_mp.so

module load PrgEnv-llvm/0.5 llvm/17
module load cray-fftw/3.3.10.3
module load cray-hdf5-parallel/1.12.2.3
module load cmake/3.24.3

ctest -J 64 -R deterministic --output-on-failure

@prckent
Copy link
Contributor

prckent commented Feb 28, 2024

OK, thanks. I noticed on another machine recently that the "grep" used by ctest was getting confused by earlier job output (I had MPI logging enabled). Maybe that is happening now on Perlmutter. The runtime for the restart test you include looks OK. If all the tests are in the few second-few minutes range, they are likely running OK. Can you manually check for "QMCPACK execution completed successfully" in the output associated with the restart test? We will have to fix this but it would be good to verify the MPI execution is actually OK.

@prckent prckent reopened this Feb 28, 2024
@ye-luo
Copy link
Contributor

ye-luo commented Feb 28, 2024

I got many tests failed with the following message

How many? All? Their names? Need more details to understand the issue.

@romanfanta4
Copy link
Author

romanfanta4 commented Feb 28, 2024

I was running it for more than 1 hour, and I was not able to get more than 100 of the tests done because these failed ones took a long time to finish. The following tests failed from the 100, I was able to run:

49/1067 Test #114: deterministic-restart-8-2
50/1067 Test #115: deterministic-restart-8-2-restart
52/1067 Test #117: deterministic-restart-1-16
53/1067 Test #118: deterministic-restart-1-16-restart
55/1067 Test #120: deterministic-restart_batch-8-2
56/1067 Test #121: deterministic-restart_batch-8-2-restart
58/1067 Test #123: deterministic-restart_batch-8-2-exists-qmc_short_batch.s000.config.h5
59/1067 Test #124: deterministic-restart_batch-8-2-exists-qmc_short_batch.s000.random.h5
62/1067 Test #127: deterministic-restart_batch-1-16
63/1067 Test #128: deterministic-restart_batch-1-16-restart
65/1067 Test #130: deterministic-restart_dmc-8-2
66/1067 Test #131: deterministic-restart_dmc-8-2-restart
68/1067 Test #133: deterministic-restart_dmc-1-16
69/1067 Test #134: deterministic-restart_dmc-1-16-restart
71/1067 Test #136: deterministic-restart_dmc_disable_branching-8-2
72/1067 Test #137: deterministic-restart_dmc_disable_branching-8-2-restart
74/1067 Test #139: deterministic-restart_dmc_disable_branching-1-16
75/1067 Test #140: deterministic-restart_dmc_disable_branching-1-16-restart
77/1067 Test #142: deterministic-save_spline_coefs-8-2
78/1067 Test #143: deterministic-save_spline_coefs-8-2-restart
79/1067 Test #144: deterministic-save_spline_coefs-8-2-check
80/1067 Test #145: deterministic-save_spline_coefs-1-16
81/1067 Test #146: deterministic-save_spline_coefs-1-16-restart
82/1067 Test #147: deterministic-save_spline_coefs-1-16-check
83/1067 Test #367: deterministic-heg_14_gamma-sj-batch-1-1
84/1067 Test #368: deterministic-heg_14_gamma-sj-batch-1-1-kinetic
85/1067 Test #369: deterministic-heg_14_gamma-sj-batch-1-1-totenergy
86/1067 Test #370: deterministic-heg_14_gamma-sj-batch-1-1-potential
87/1067 Test #410: deterministic-heg_14_gamma-sjb-1-1
88/1067 Test #411: deterministic-heg_14_gamma-sjb-1-1-kinetic
89/1067 Test #412: deterministic-heg_14_gamma-sjb-1-1-totenergy
90/1067 Test #413: deterministic-heg_14_gamma-sjb-1-1-potential
91/1067 Test #414: deterministic-heg_14_gamma-sjb-opt-1-1
92/1067 Test #415: deterministic-heg_14_gamma-sjb-opt-1-1-check
93/1067 Test #416: deterministic-heg_14_gamma-sjb-opt_vmc-1-1
94/1067 Test #417: deterministic-heg_14_gamma-sjb-opt_vmc-1-1-totenergy
95/1067 Test #420: deterministic-heg_54_J2rpa-1-1
96/1067 Test #421: deterministic-heg_54_J2rpa-1-1-kinetic
97/1067 Test #422: deterministic-heg_54_J2rpa-1-1-totenergy
98/1067 Test #423: deterministic-heg_54_J2rpa-1-1-eeenergy
99/1067 Test #424: deterministic-heg_54_J2rpa-1-1-potential

@ye-luo
Copy link
Contributor

ye-luo commented Feb 28, 2024

My error of deterministic-restart-8-2 is different on Perlmutter

here are not enough slots available in the system to satisfy the 8
slots that were requested by the application:

  /global/homes/y/yeluo/opt/qmcpack/build_perlmutter_Clang17_offload_cuda_real_MP/bin/qmcpack

Either request fewer procs for your application, or make more slots
available for use.

A "slot" is the PRRTE term for an allocatable unit where we can
launch a process.  The number of slots available are defined by the
environment in which PRRTE processes are run:

  1. Hostfile, via "slots=N" clauses (N defaults to number of
     processor cores if not provided)
  2. The --host command line parameter, via a ":N" suffix on the
     hostname (N defaults to 1 if not provided)
  3. Resource manager (e.g., SLURM, PBS/Torque, LSF, etc.)
  4. If none of a hostfile, the --host command line parameter, or an
     RM is present, PRRTE defaults to the number of processor cores

In all the above cases, if you want PRRTE to default to the number
of hardware threads instead of the number of processor cores, use the
--use-hwthread-cpus option.

Alternatively, you can use the --map-by :OVERSUBSCRIBE option to ignore the
number of available slots when deciding the number of processes to
launch.

This is expected when using OpenMPI and cores are oversubscribed.

$ ctest -L unit 
...
98% tests passed, 1 tests failed out of 56

Label Time Summary:
deterministic      = 305.95 sec*proc (56 tests)
quality_unknown    = 305.87 sec*proc (48 tests)
unit               = 305.95 sec*proc (56 tests)

Total Test time (real) = 204.90 sec

The following tests FAILED:
	 47 - deterministic-unit_test_new_drivers_mpi-r16 (Failed)

The same multi-rank error.

My feeling is that your error was causing by mixing Cray MPI and OpenMPI bits.
@romanfanta4 could you provide Testing/Temporary/LastTest.log.tmp from you incomplete ctest run?

Make sure you build qmcpack from empty build directories and use the build script from QMCPACK.

Here is my script for unit tests

#!/bin/bash
#SBATCH -A m2113
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH --time=0:60:00
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=32
#SBATCH --gpus-per-task=1

module load cpe/23.05
module load PrgEnv-gnu
module load cray-libsci
CRAY_LIBSCI_LIB=$CRAY_LIBSCI_PREFIX_DIR/lib/libsci_gnu_mp.so

module load PrgEnv-llvm/0.5 llvm/17
module load cray-fftw/3.3.10.3
module load cray-hdf5-parallel/1.12.2.3
module load cmake/3.24.3

export SLURM_CPU_BIND="cores"

module list >& modules.txt
ldd bin/qmcpack >& ldd.out
ctest -L unit -j32 --output-on-failure

@romanfanta4
Copy link
Author

I downloaded the latest version, ran the build script, and then ran the "ctest -L unit" with the submission script you provided in the build folder. I got the same error message as you posted:

There are not enough slots available in the system to satisfy the 16
...
98% tests passed, 1 tests failed out of 

Label Time Summary:
deterministic      = 302.97 sec*proc (56 tests)
quality_unknown    = 302.80 sec*proc (48 tests)
unit               = 302.97 sec*proc (56 tests)

Total Test time (real) = 207.61 sec

The following tests FAILED:
         47 - deterministic-unit_test_new_drivers_mpi-r16 (Failed)

With the same script, I also ran it for "deterministic-restart-8-2," and I got precisely the same error as you (here are not enough slots available in the system to satisfy the 8 ...).

I tried to run the script on 4 nodes to satisfy the 16 MPI slots in total, but I got the same error I posted before (more in the attached file).
LastTest_4nodes.log

Here is the incomplete ctest run (it is shorter than the one I mentioned because I overwrote the first one, this one was running for just 30 minutes but with the same errors).
LastTest.log.tmp.txt

@ye-luo
Copy link
Contributor

ye-luo commented Feb 29, 2024

Having headache. Reported an issue to NERSC.

@ye-luo
Copy link
Contributor

ye-luo commented Mar 19, 2024

Please try out #4942
The following setting is required at run.

module use /global/common/software/nersc/n9/llvm/modules
module load craype cray-mpich
module load llvm/17.0.6-gpu

export SLURM_CPU_BIND="cores"
export MPICH_GPU_SUPPORT_ENABLED=0

@romanfanta4
Copy link
Author

I've tested complex and real offload versions, and except for these two tests below, all others passed without problem.

 101/1067 Test  #426: deterministic-heg2d-4_ae-deriv-1-1-check .................................................................***Failed    0.06 sec
Traceback (most recent call last):
  File "/global/cfs/cdirs/m4290/codes/qmcpack/qmcpack_builds/tests/scripts/check_deriv.py", line 88, in <module>
    mm = mmap(f.fileno(),0)
         ^^^^^^^^^^^^^^^^^^
OSError: [Errno 38] Function not implemented

 103/1067 Test  #428: deterministic-heg2d-4_ae-grad_lap-1-1-check ..............................................................***Failed    0.07 sec
Traceback (most recent call last):
  File "/global/cfs/cdirs/m4290/codes/qmcpack/qmcpack_builds/tests/scripts/check_grad_lap.py", line 113, in <module>
    mm = mmap(f.fileno(),0)
         ^^^^^^^^^^^^^^^^^^
OSError: [Errno 38] Function not implemented

@ye-luo
Copy link
Contributor

ye-luo commented Mar 20, 2024

@romanfanta4 the mmap failure will be investigate separately.

@ye-luo ye-luo closed this as completed Mar 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants