Open
Description
Describe the bug
during the Texascale, we ran a scenario of the 7.8 Turkey earthquake.
When using all nodes of Frontera (8192) with 2 ranks per nodes, we get Nan when writing the first surface output:
Wed Apr 12 20:28:45, Info: Writing faultoutput at time 0.
Wed Apr 12 20:28:46, Info: Writing faultoutput at time 0. Done.
Wed Apr 12 20:28:47, Info: Waiting for last wave field.
Wed Apr 12 20:28:47, Info: Writing wave field at time 0.
Wed Apr 12 20:28:47, Info: Writing wave field at time 0. Done.
Wed Apr 12 20:28:49, Info: Writing free surface at time 0.
Wed Apr 12 20:28:49, Info: Writing free surface at time 0. Done.
Wed Apr 12 20:28:49, Info: Writing energy output at time 0
Wed Apr 12 20:28:51, Info: Writing energy output at time 0 Done.
Wed Apr 12 20:28:56, Info: Writing free surface at time 0.02.
Wed Apr 12 20:28:56, Info: Writing free surface at time 0.02. Done.
Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting. Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting. Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting. Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting.
with 8000 nodes it runs without problem.
Expected behavior
no node dependence.
To Reproduce
Steps to reproduce the behavior:
- Which version do you use? Provide branch and commit id.
master, 8d6e455 - Which build settings do you use? Which compiler version do you use?
intel
ADDRESS_SANITIZER_DEBUG OFF
ASAGI ON
CMAKE_BUILD_TYPE Release
CMAKE_INSTALL_PREFIX /usr/local
COMMTHREAD ON
COVERAGE OFF
DEVICE_ARCH none
DEVICE_BACKEND none
DR_QUAD_RULE dunavant
EQUATIONS viscoelastic2
Eigen3_DIR /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/eigen-3.4.0-umuh726eijyrt346dic275mdha6rc7mm/share/eigen3/cmake
GEMM_TOOLS_LIST LIBXSMM,PSpaMM
HDF5 ON
HDF5_DIR /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/hdf5-1.12.2-fzowvyobpiowc3f3hsidtdj3mkxpkm6x/cmake
HOST_ARCH skx
INTEGRATE_QUANTITIES OFF
LIKWID OFF
LOG_LEVEL warning
LOG_LEVEL_MASTER info
Libxsmm_executable_PROGRAM /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/libxsmm-1.17-qdddzbw5bx5r6lkzihnuu44ry5uyaavv/bin/libxsmm_gemm_generator
MEMKIND OFF
MEMORY_LAYOUT auto
METIS ON
MINI_SEISSOL ON
MPI ON
NETCDF ON
NUMA_AWARE_PINNING ON
NUMA_ROOT_DIR /usr
NUMBER_OF_FUSED_SIMULATIONS 1
NUMBER_OF_MECHANISMS 3
OPENMP ON
ORDER 6
PLASTICITY_METHOD nb
PRECISION double
PROXY_PYBINDING OFF
PSpaMM_PROGRAM /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/py-pspamm-develop-35zum726336t2gdzfcz6mfzivifra7nd/pspamm.py
SIONLIB OFF
TESTING OFF
TESTING_GENERATED OFF
USE_IMPALA_JIT_LLVM OFF
easi_DIR /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/easi-1.2.0-zto7che54ra7wvybw2be6m66yzaisxt5/lib64/cmake/easi
impalajit_DIR /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/impalajit-main-rdjaykqjjbb645iny6nexrtnup27ejpg/lib64/cmake/impalajit
netCDF_DIR netCDF_DIR-NOTFOUND
yaml-cpp_DIR /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/yaml-cpp-0.6.2-qszzfashukprv326kpoe2ivdwrfq6a5f/lib/cmake/yaml-cpp
- On which machine does your problem occur? If on a cluster: Which modules are loaded?
frontera,
Currently Loaded Modules:
1) intel/19.1.1 7) hwloc/1.11.12
2) impi/19.0.9 8) xalt/2.10.34
3) git/2.24.1 9) TACC
4) autotools/1.2 10) python3/3.9.2
5) cmake/3.24.2 11) seissol-env-develop-intel-19.1.1.217-w2i565p
6) pmix/3.1.4 12) impalajit-main-intel-19.1.1.217-rdjaykq
- Provide parameter/material files.
/scratch1/09160/ulrich/Turkey-Syria-Earthquakes/SeisSolSetupHeterogeneities/event2
probably related with #818