Skip to content

Detected Inf/NaN in free surface output or not depending on the number of nodes #839

Open
@Thomas-Ulrich

Description

@Thomas-Ulrich

Describe the bug
during the Texascale, we ran a scenario of the 7.8 Turkey earthquake.
When using all nodes of Frontera (8192) with 2 ranks per nodes, we get Nan when writing the first surface output:

Wed Apr 12 20:28:45, Info:  Writing faultoutput at time 0.
Wed Apr 12 20:28:46, Info:  Writing faultoutput at time 0. Done.
Wed Apr 12 20:28:47, Info:  Waiting for last wave field.
Wed Apr 12 20:28:47, Info:  Writing wave field at time 0.
Wed Apr 12 20:28:47, Info:  Writing wave field at time 0. Done.
Wed Apr 12 20:28:49, Info:  Writing free surface at time 0.
Wed Apr 12 20:28:49, Info:  Writing free surface at time 0. Done.
Wed Apr 12 20:28:49, Info:  Writing energy output at time 0
Wed Apr 12 20:28:51, Info:  Writing energy output at time 0 Done.
Wed Apr 12 20:28:56, Info:  Writing free surface at time 0.02.
Wed Apr 12 20:28:56, Info:  Writing free surface at time 0.02. Done.
Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting. Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting. Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting. Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting.

with 8000 nodes it runs without problem.

Expected behavior
no node dependence.

To Reproduce
Steps to reproduce the behavior:

  1. Which version do you use? Provide branch and commit id.
    master, 8d6e455
  2. Which build settings do you use? Which compiler version do you use?
    intel
ADDRESS_SANITIZER_DEBUG          OFF
 ASAGI                            ON
 CMAKE_BUILD_TYPE                 Release
 CMAKE_INSTALL_PREFIX             /usr/local
 COMMTHREAD                       ON
 COVERAGE                         OFF
 DEVICE_ARCH                      none
 DEVICE_BACKEND                   none
 DR_QUAD_RULE                     dunavant
 EQUATIONS                        viscoelastic2
 Eigen3_DIR                       /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/eigen-3.4.0-umuh726eijyrt346dic275mdha6rc7mm/share/eigen3/cmake
 GEMM_TOOLS_LIST                  LIBXSMM,PSpaMM
 HDF5                             ON
 HDF5_DIR                         /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/hdf5-1.12.2-fzowvyobpiowc3f3hsidtdj3mkxpkm6x/cmake
 HOST_ARCH                        skx
 INTEGRATE_QUANTITIES             OFF
 LIKWID                           OFF
 LOG_LEVEL                        warning
 LOG_LEVEL_MASTER                 info
 Libxsmm_executable_PROGRAM       /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/libxsmm-1.17-qdddzbw5bx5r6lkzihnuu44ry5uyaavv/bin/libxsmm_gemm_generator
 MEMKIND                          OFF
 MEMORY_LAYOUT                    auto
 METIS                            ON
 MINI_SEISSOL                     ON
 MPI                              ON
 NETCDF                           ON
 NUMA_AWARE_PINNING               ON
 NUMA_ROOT_DIR                    /usr
 NUMBER_OF_FUSED_SIMULATIONS      1
 NUMBER_OF_MECHANISMS             3
 OPENMP                           ON
 ORDER                            6
 PLASTICITY_METHOD                nb
 PRECISION                        double
 PROXY_PYBINDING                  OFF
 PSpaMM_PROGRAM                   /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/py-pspamm-develop-35zum726336t2gdzfcz6mfzivifra7nd/pspamm.py
 SIONLIB                          OFF
 TESTING                          OFF
 TESTING_GENERATED                OFF
 USE_IMPALA_JIT_LLVM              OFF
 easi_DIR                         /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/easi-1.2.0-zto7che54ra7wvybw2be6m66yzaisxt5/lib64/cmake/easi
 impalajit_DIR                    /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/impalajit-main-rdjaykqjjbb645iny6nexrtnup27ejpg/lib64/cmake/impalajit
 netCDF_DIR                       netCDF_DIR-NOTFOUND
 yaml-cpp_DIR                     /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/yaml-cpp-0.6.2-qszzfashukprv326kpoe2ivdwrfq6a5f/lib/cmake/yaml-cpp
  1. On which machine does your problem occur? If on a cluster: Which modules are loaded?
    frontera,
Currently Loaded Modules:
  1) intel/19.1.1    7) hwloc/1.11.12
  2) impi/19.0.9     8) xalt/2.10.34
  3) git/2.24.1      9) TACC
  4) autotools/1.2  10) python3/3.9.2
  5) cmake/3.24.2   11) seissol-env-develop-intel-19.1.1.217-w2i565p
  6) pmix/3.1.4     12) impalajit-main-intel-19.1.1.217-rdjaykq
  1. Provide parameter/material files.
    /scratch1/09160/ulrich/Turkey-Syria-Earthquakes/SeisSolSetupHeterogeneities/event2

probably related with #818

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions