Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile script for Perlmutter CPU #4398

Merged
merged 3 commits into from
Jan 16, 2023
Merged

Conversation

aannabe
Copy link
Contributor

@aannabe aannabe commented Jan 13, 2023

Proposed changes

I've been looking at compiling the CPU version on Perlmutter since Cori will be retired soon. The added script compiles the real and complex versions for CPU-only nodes. Some dependencies, such as LibXml2, are handled via spack as this is not provided by default.

I didn't have luck with GNU compilers and/or the cc, CC wrappers provided by NERSC. However, the mpicc, mpic++ MPI wrappers with cray compile without problems.

For the real build, all unit tests pass. 99% of deterministic tests pass, and there are 5 fails related to HEG (see attached).

For the complex build, 2 unit tests are failing. For the deterministic case, 84% is passing. The fails are related to HEG + Gaussian basis bulk systems (see attached).

I didn't explore the GPU build yet.

What type(s) of changes does this code introduce?

  • Build script changes

Does this introduce a breaking change?

  • No

What systems has this change been tested on?

NERSC/Perlmutter-CPU

Checklist

  • Yes. This PR is up to date with current the current state of 'develop'

cplx_tests.txt
real_tests.txt

@ye-luo
Copy link
Contributor

ye-luo commented Jan 13, 2023

  1. How about requesting NERSC to provide libxml2?
  2. I don't have sufficient confidence in CCE compiler. What is the error with GNU?

@aannabe
Copy link
Contributor Author

aannabe commented Jan 13, 2023

  1. I already asked NERSC if they could provide libxml2 as a module and I was told that it is available in some specific spack environment called e4s. However, for some reason this spack environment wasn't available to me and I had to do spack install. We can change it once it is available but this seems to be the easiest workaround until then.
  2. Earlier, I was seeing a buggy glibc problem with GNU. I can't reproduce it now, and it is currently compiling if I just remove the module load PrgEnv-cray (GNU is default). I changed the compiler to GNU now, and indeed the number of failing tests is less (HEG tests are passing now, attached).

cplx_tests.txt
real_tests.txt

@ye-luo
Copy link
Contributor

ye-luo commented Jan 13, 2023

Let us explicitly load PrgEnv-gnu

export CMAKE_PREFIX_PATH=/global/common/software/spackecp/perlmutter/e4s-22.05/78535/spack/opt/spack/cray-sles15-zen3/gcc-11.2.0/libxml2-2.9.13-u2ai4xjq2lmljvej4p3ly7qd6hfbrz7h:$CMAKE_PREFIX_PATH

should give you libxml2.
How did you solve boost?
Please attach full cmake output.

@prckent
Copy link
Contributor

prckent commented Jan 13, 2023

I think that a ticket to NERSC is due here. What are they expecting us to do for common libraries such as libxml2 and boost on perlmutter?

I'll note that the official way to get libxml2 might appear to be

module load spack/e4s-22.05
spack env activate gcc  # S-L-O-W
spack load --first libxml2%gcc@11.2.0

but since this e4s install doesn't provide boost and may only provide libxml2 as a side product of installing other packages, I am not sure how worth this route is. So one unfortunate possibility is that we end up installing our own libxml2 and boost, either directly or via our own spack. This situation is not an improvement for us over previous machines, but perhaps it represents what is maintainable at NERSC.

At an appropriate point we can ask NERSC to install QMCPACK for users, since DOE BES has asked for this to happen. However realistically we'll need a working script before they can make a module for us.

@aannabe
Copy link
Contributor Author

aannabe commented Jan 14, 2023

I made some changes per suggestions. Libxml2 is made available via export CMAKE_PREFIX_PATH to e4s spack env, and the boost is:
-- Found Boost: /usr/include (found suitable version "1.66.0", minimum required is "1.61.0")
@ye-luo , please let me know if this looks reasonable now.

@aannabe
Copy link
Contributor Author

aannabe commented Jan 14, 2023

Attaching cmake output CPU complex.

cmake_out.txt

@ye-luo
Copy link
Contributor

ye-luo commented Jan 14, 2023

At an appropriate point we can ask NERSC to install QMCPACK for users, since DOE BES has asked for this to happen. However realistically we'll need a working script before they can make a module for us.

If I remember correctly, e4s maintainers asked questions on github about spack/quantum-espresso. I would assume qmcpack can be included in e4s and it just picks up our spack/qmcpack.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 14, 2023

@aannabe could you rerun ctest on complex with --output-on-failure The current pass rate is concerning.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 14, 2023

Test this please

@ye-luo ye-luo enabled auto-merge January 14, 2023 01:18
@aannabe
Copy link
Contributor Author

aannabe commented Jan 14, 2023

Attaching complex test results with --output-on-failure flag.

cplx_test_output.txt

@prckent
Copy link
Contributor

prckent commented Jan 14, 2023

Looks like this has caught some actual bugs. Likely in our use of MPI or perhaps the MPI wrapper has a problem that only surfaces with this MPICH.

Interesting that these have only just shown up but it highlights the merits of running on more platforms and with different MPI etc.

@ye-luo
Copy link
Contributor

ye-luo commented Jan 16, 2023

Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x10ab560, tinfo=0xff1cc0 <typeinfo for std::system_error@GLIBCXX_3.4.11>, dest=0x40ab90 <std::system_error::~system_error()@plt>) at ../../../../cpe-gcc-12.1.0-202208101649.1dfb26392197c/libstdc++-v3/libsupc++/eh_throw.cc:80
80	../../../../cpe-gcc-12.1.0-202208101649.1dfb26392197c/libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory.
Missing separate debuginfos, use: zypper install krb5-debuginfo-1.19.2-150400.1.9.x86_64 libbrotlicommon1-debuginfo-1.0.7-3.3.1.x86_64 libbrotlidec1-debuginfo-1.0.7-3.3.1.x86_64 libcom_err2-debuginfo-1.46.4-150400.3.3.1.x86_64 libcurl4-debuginfo-7.79.1-150400.5.3.1.x86_64 libidn2-0-debuginfo-2.2.0-3.6.1.x86_64 libjson-c3-debuginfo-0.13-3.3.1.x86_64 libkeyutils1-debuginfo-1.6.3-5.6.1.x86_64 liblzma5-debuginfo-5.2.3-150000.4.7.1.x86_64 libnghttp2-14-debuginfo-1.40.0-6.1.x86_64 libnl3-200-debuginfo-3.3.0-1.29.x86_64 libopenssl1_1-debuginfo-1.1.1l-150400.7.7.1.x86_64 libpcre1-debuginfo-8.45-150000.20.13.1.x86_64 libpsl5-debuginfo-0.20.1-150000.3.3.1.x86_64 libselinux1-debuginfo-3.1-150400.1.69.x86_64 libssh4-debuginfo-0.9.6-150400.1.5.x86_64 libunistring2-debuginfo-0.9.10-1.1.x86_64 libxml2-2-debuginfo-2.9.14-150400.5.7.1.x86_64 libyaml-0-2-debuginfo-0.1.7-1.17.x86_64 libz1-debuginfo-1.2.11-150000.3.30.1.x86_64 libzstd1-debuginfo-1.5.0-150400.1.71.x86_64
(gdb) bt
#0  __cxxabiv1::__cxa_throw (obj=0x10ab560, tinfo=0xff1cc0 <typeinfo for std::system_error@GLIBCXX_3.4.11>, 
    dest=0x40ab90 <std::system_error::~system_error()@plt>)
    at ../../../../cpe-gcc-12.1.0-202208101649.1dfb26392197c/libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x00000000004461d9 in boost::mpi3::communicator::broadcast_n<std::complex<double>*, unsigned long> (this=<optimized out>, 
    this=<optimized out>, root=0, count=<optimized out>, first=<optimized out>)
    at /global/homes/y/yeluo/opt/qmcpack/external_codes/mpi_wrapper/mpi3/./../mpi3/error.hpp:57
#2  boost::mpi3::communicator::broadcast_n<std::complex<double>*, unsigned long> (root=0, count=<optimized out>, 
    first=<optimized out>, this=<optimized out>)
    at /global/homes/y/yeluo/opt/qmcpack/external_codes/mpi_wrapper/mpi3/./communicator.hpp:1753
#3  qmcplusplus::LCAOrbitalBuilder::putPBCFromH5 (this=<optimized out>, spo=..., coeff_ptr=<optimized out>)
    at /global/homes/y/yeluo/opt/qmcpack/src/QMCWaveFunctions/LCAO/LCAOrbitalBuilder.cpp:824
#4  0x0000000000897002 in qmcplusplus::LCAOrbitalBuilder::loadMO (this=0x13ae930, spo=..., cur=<optimized out>)
    at /global/homes/y/yeluo/opt/qmcpack/src/QMCWaveFunctions/LCAO/LCAOrbitalBuilder.cpp:613
#5  0x0000000000897ac1 in qmcplusplus::LCAOrbitalBuilder::createSPOSetFromXML (this=0x13ae930, cur=0x13a9b40)
    at /global/homes/y/yeluo/opt/qmcpack/src/QMCWaveFunctions/LCAO/LCAOrbitalBuilder.cpp:483
#6  0x00000000005b56a2 in qmcplusplus::SPOSetBuilder::createSPOSet (this=0x13ae930, cur=0x13a9b40)
    at /global/homes/y/yeluo/opt/qmcpack/src/QMCWaveFunctions/SPOSetBuilder.cpp:80
#7  0x00000000004cf0a5 in qmcplusplus::test_C_diamond() ()
#8  0x000000000051aaa2 in Catch::RunContext::invokeActiveTestCase() ()
#9  0x000000000055b595 in Catch::Session::runInternal() ()
#10 0x000000000055cb8b in Catch::Session::run() ()
#11 0x0000000000471c09 in main ()

@correaa any idea?

@correaa
Copy link
Contributor

correaa commented Jan 16, 2023

MPI_Broadcast seems to be returning an error code.

The error code has an associated message in the MPI which is wrapped into a std::system_error runtime exception, isn't there a string with a message or error code in the trace?

Would it be possible to have a big try catch for a std::exception and print e.what()?

@prckent
Copy link
Contributor

prckent commented Jan 16, 2023

Test this please

@ye-luo ye-luo merged commit 1465238 into QMCPACK:develop Jan 16, 2023
@aannabe aannabe deleted the perlmutter_cpu branch January 16, 2023 18:25
@correaa
Copy link
Contributor

correaa commented Jan 17, 2023

@ye-luo , was this solved eventually?

Looking again, it could be related to a bug in support for runtime exceptions in the system or in a debugger. (is this running in a debugger.)

I am very interested because std::system_error is a common pattern for wrapping C libraries and error codes. But it could need to rely on exceptions to produce errors that are understandable.


Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x10ab560, tinfo=0xff1cc0 <typeinfo for std::system_error@GLIBCXX_3.4.11>, dest=0x40ab90 <std::system_error::~system_error()@plt>) at ../../../../cpe-gcc-12.1.0-202208101649.1dfb26392197c/libstdc++-v3/libsupc++/eh_throw.cc:80

80	../../../../cpe-gcc-12.1.0-202208101649.1dfb26392197c/libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory.

Missing separate debuginfos, use: zypper install krb5-debuginfo-1.19.2-150400.1.9.x86_64 libbrotlicommon1-debuginfo-1.0.7-3.3.1.x86_64 libbrotlidec1-debuginfo-1.0.7-3.3.1.x86_64 libcom_err2-debuginfo-1.46.4-150400.3.3.1.x86_64 libcurl4-debuginfo-7.79.1-150400.5.3.1.x86_64 libidn2-0-debuginfo-2.2.0-3.6.1.x86_64 libjson-c3-debuginfo-0.13-3.3.1.x86_64 libkeyutils1-debuginfo-1.6.3-5.6.1.x86_64 liblzma5-debuginfo-5.2.3-150000.4.7.1.x86_64 libnghttp2-14-debuginfo-1.40.0-6.1.x86_64 libnl3-200-debuginfo-3.3.0-1.29.x86_64 libopenssl1_1-debuginfo-1.1.1l-150400.7.7.1.x86_64 libpcre1-debuginfo-8.45-150000.20.13.1.x86_64 libpsl5-debuginfo-0.20.1-150000.3.3.1.x86_64 libselinux1-debuginfo-3.1-150400.1.69.x86_64 libssh4-debuginfo-0.9.6-150400.1.5.x86_64 libunistring2-debuginfo-0.9.10-1.1.x86_64 libxml2-2-debuginfo-2.9.14-150400.5.7.1.x86_64 libyaml-0-2-debuginfo-0.1.7-1.17.x86_64 libz1-debuginfo-1.2.11-150000.3.30.1.x86_64 libzstd1-debuginfo-1.5.0-150400.1.71.x86_64

(gdb) bt

#0  __cxxabiv1::__cxa_throw (obj=0x10ab560, tinfo=0xff1cc0 <typeinfo for std::system_error@GLIBCXX_3.4.11>, 

    dest=0x40ab90 <std::system_error::~system_error()@plt>)

    at ../../../../cpe-gcc-12.1.0-202208101649.1dfb26392197c/libstdc++-v3/libsupc++/eh_throw.cc:80

#1  0x00000000004461d9 in boost::mpi3::communicator::broadcast_n<std::complex<double>*, unsigned long> (this=<optimized out>, 

    this=<optimized out>, root=0, count=<optimized out>, first=<optimized out>)

    at /global/homes/y/yeluo/opt/qmcpack/external_codes/mpi_wrapper/mpi3/./../mpi3/error.hpp:57

#2  boost::mpi3::communicator::broadcast_n<std::complex<double>*, unsigned long> (root=0, count=<optimized out>, 

    first=<optimized out>, this=<optimized out>)

    at /global/homes/y/yeluo/opt/qmcpack/external_codes/mpi_wrapper/mpi3/./communicator.hpp:1753

#3  qmcplusplus::LCAOrbitalBuilder::putPBCFromH5 (this=<optimized out>, spo=..., coeff_ptr=<optimized out>)

    at /global/homes/y/yeluo/opt/qmcpack/src/QMCWaveFunctions/LCAO/LCAOrbitalBuilder.cpp:824

#4  0x0000000000897002 in qmcplusplus::LCAOrbitalBuilder::loadMO (this=0x13ae930, spo=..., cur=<optimized out>)

    at /global/homes/y/yeluo/opt/qmcpack/src/QMCWaveFunctions/LCAO/LCAOrbitalBuilder.cpp:613

#5  0x0000000000897ac1 in qmcplusplus::LCAOrbitalBuilder::createSPOSetFromXML (this=0x13ae930, cur=0x13a9b40)

    at /global/homes/y/yeluo/opt/qmcpack/src/QMCWaveFunctions/LCAO/LCAOrbitalBuilder.cpp:483

#6  0x00000000005b56a2 in qmcplusplus::SPOSetBuilder::createSPOSet (this=0x13ae930, cur=0x13a9b40)

    at /global/homes/y/yeluo/opt/qmcpack/src/QMCWaveFunctions/SPOSetBuilder.cpp:80

#7  0x00000000004cf0a5 in qmcplusplus::test_C_diamond() ()

#8  0x000000000051aaa2 in Catch::RunContext::invokeActiveTestCase() ()

#9  0x000000000055b595 in Catch::Session::runInternal() ()

#10 0x000000000055cb8b in Catch::Session::run() ()

#11 0x0000000000471c09 in main ()

@correaa any idea?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants