Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpich test failure on s390x #703

Open
opoplawski opened this issue Sep 11, 2023 · 2 comments
Open

mpich test failure on s390x #703

opoplawski opened this issue Sep 11, 2023 · 2 comments

Comments

@opoplawski
Copy link
Contributor

Describe the bug
I'm working on a Fedora package for dbcsr. I'm getting test failures with mpich on s390x.

To Reproduce

/usr/bin/ctest --test-dir redhat-linux-build-mpich --output-on-failure --force-new-ctest-process -j3
Internal ctest changing into directory: /builddir/build/BUILD/dbcsr-2.6.0/redhat-linux-build-mpich
Test project /builddir/build/BUILD/dbcsr-2.6.0/redhat-linux-build-mpich
      Start  1: dbcsr_perf:inputs/test_H2O.perf
      Start  2: dbcsr_perf:inputs/test_rect1_dense.perf
      Start  3: dbcsr_perf:inputs/test_rect1_sparse.perf
 1/19 Test  #3: dbcsr_perf:inputs/test_rect1_sparse.perf ..............***Failed    2.10 sec
 DBCSR| CPU Multiplication driver                                           BLAS (D)
 DBCSR| Multrec recursion limit                                              512 (D)
 DBCSR| Multiplication stack size                                           1000 (D)
 DBCSR| Maximum elements for images                                    UNLIMITED (D)
 DBCSR| Multiplicative factor virtual images                                   1 (D)
 DBCSR| Use multiplication densification                                       T (D)
 DBCSR| Multiplication size stacks                                             3 (D)
 DBCSR| Use memory pool for CPU allocation                                     F (D)
 DBCSR| Number of 3D layers                                               SINGLE (D)
 DBCSR| Use MPI memory allocation                                              F (D)
 DBCSR| Use RMA algorithm                                                      F (U)
 DBCSR| Use Communication thread                                               T (D)
 DBCSR| Communication thread load                                            100 (D)
 DBCSR| MPI: My process id                                                     0
 DBCSR| MPI: Number of processes                                               2
 DBCSR| OMP: Current number of threads                                         2
 DBCSR| OMP: Max number of threads                                             2
 DBCSR| Split modifier for TAS multiplication algorithm                  1.0E+00 (D)
 numthreads           2
 numnodes           2
 matrix_sizes        5000        1000        1000
 sparsities  0.90000000000000002       0.90000000000000002       0.90000000000000002     
 trans NN
 symmetries NNN
 type            3
 alpha_in   1.0000000000000000        0.0000000000000000     
 beta_in   1.0000000000000000        0.0000000000000000     
 limits           1        5000           1        1000           1        1000
 retain_sparsity F
 nrep          10
 bs_m           1           5
 bs_n           1           5
 bs_k           1           5
 *******************************************************************************
 *             MPI error 5843983 in mpi_barrier @ mp_sync : Other MPI error,   *
 *               error stack:
internal_Barrier(84).......................:     *
 *                             MPI_Barrier(comm=0x84000001)                    *
 *   ___            failed
MPID_Barrier(167)..........................:        *
 *  /   \              
MPIDI_Barrier_allcomm_composition_json(132):           *
 * [ABORT]             
MPIDI_POSIX_mpi_bcast(219).................:           *
 *  \___/              
MPIDI_POSIX_mpi_bcast_release_gather(132)..:           *
 *    |     
MPIDI_POSIX_mpi_release_gather_release(218): message sizes do not *
 *  O/|      match across processes in the collective routine: Received 0 but  *
 * /| |                                 expected 1                             *
 * / \                                                    dbcsr_mpiwrap.F:1186 *
 *******************************************************************************
 ===== Routine Calling Stack ===== 
            4 mp_sync
            3 perf_multiply
            2 dbcsr_perf_multiply_low
            1 dbcsr_performance_driver
Abort(1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 1
STOP 1

I don't see test failures with openmpi. One difference is that mpich is being built with -DUSE_MPI_F08=ON.

Environment:

  • Operating system & version
    Fedora Rawhide
  • Compiler vendor & version
    gcc 13.2.1
  • Build environment (make or cmake)
    cmake
  • Configuration of DBCSR (either the cmake flags or the Makefile.inc)
    /usr/bin/cmake -S . -B redhat-linux-build-mpich -DCMAKE_C_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_CXX_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_Fortran_FLAGS_RELEASE:STRING=-DNDEBUG -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_INSTALL_DO_STRIP:BOOL=OFF -DCMAKE_INSTALL_PREFIX:PATH=/usr -DINCLUDE_INSTALL_DIR:PATH=/usr/include -DLIB_INSTALL_DIR:PATH=/usr/lib64 -DSYSCONF_INSTALL_DIR:PATH=/etc -DSHARE_INSTALL_PREFIX:PATH=/usr/share -DLIB_SUFFIX=64 -DBUILD_SHARED_LIBS:BOOL=ON -DCMAKE_INSTALL_Fortran_MODULES=/usr/lib64/gfortran/modules/mpich -DUSE_MPI_F08=ON -DCMAKE_PREFIX_PATH:PATH=/usr/lib64/mpich -DCMAKE_INSTALL_PREFIX:PATH=/usr/lib64/mpich -DCMAKE_INSTALL_LIBDIR:PATH=lib -- The C compiler identification is GNU 13.2.1
  • MPI implementation and version
    mpich 4.1.2
  • If CUDA is being used: CUDA version and GPU architecture
    No CUDA
  • BLAS/LAPACK implementation and version
    flexiblas 3.3.1 -> openblas 0.3.21
@alazzaro
Copy link
Member

I've realized that we are not testing with MPI_F08 in our CI, however we did a test here #661 (comment) and it worked. the only difference was GCC 13.1. I will add the test to the CI. In the meantime, I see some actions here:

  1. could you build with F08 and OpenMPI?
  2. any chance you can use GCC 13.1 and mpich with F08 in DBCSR?

@opoplawski
Copy link
Contributor Author

I've enabled -DUSE_MPI_F08=ON for the openmpi builds as well. Scratch builds are here (for a week or two)

F40 - gcc 13.2.1 mpich 4.1.2 - https://koji.fedoraproject.org/koji/taskinfo?taskID=110306721

Tests are still failing.

We are stuck with the version of the compiler in the distribution which is at 13.2.1 in all current Fedora releases.

Interestingly though, the tests are succeeding in F38:

https://koji.fedoraproject.org/koji/taskinfo?taskID=110306885

which is with mpich 4.0.3. So maybe it's more of an mpich issue than DBCSR. Though mpich's own basic test suite is passing.

Also different:
openblas 0.3.21 -> 0.3.25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants