Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detected Inf/NaN in free surface output or not depending on the number of nodes #839

Open
Thomas-Ulrich opened this issue Apr 13, 2023 · 12 comments
Labels

Comments

@Thomas-Ulrich
Copy link
Contributor

Describe the bug
during the Texascale, we ran a scenario of the 7.8 Turkey earthquake.
When using all nodes of Frontera (8192) with 2 ranks per nodes, we get Nan when writing the first surface output:

Wed Apr 12 20:28:45, Info:  Writing faultoutput at time 0.
Wed Apr 12 20:28:46, Info:  Writing faultoutput at time 0. Done.
Wed Apr 12 20:28:47, Info:  Waiting for last wave field.
Wed Apr 12 20:28:47, Info:  Writing wave field at time 0.
Wed Apr 12 20:28:47, Info:  Writing wave field at time 0. Done.
Wed Apr 12 20:28:49, Info:  Writing free surface at time 0.
Wed Apr 12 20:28:49, Info:  Writing free surface at time 0. Done.
Wed Apr 12 20:28:49, Info:  Writing energy output at time 0
Wed Apr 12 20:28:51, Info:  Writing energy output at time 0 Done.
Wed Apr 12 20:28:56, Info:  Writing free surface at time 0.02.
Wed Apr 12 20:28:56, Info:  Writing free surface at time 0.02. Done.
Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting. Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting. Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting. Wed Apr 12 20:28:56, Error: Detected Inf/NaN in free surface output. Aborting.

with 8000 nodes it runs without problem.

Expected behavior
no node dependence.

To Reproduce
Steps to reproduce the behavior:

  1. Which version do you use? Provide branch and commit id.
    master, 8d6e455
  2. Which build settings do you use? Which compiler version do you use?
    intel
ADDRESS_SANITIZER_DEBUG          OFF
 ASAGI                            ON
 CMAKE_BUILD_TYPE                 Release
 CMAKE_INSTALL_PREFIX             /usr/local
 COMMTHREAD                       ON
 COVERAGE                         OFF
 DEVICE_ARCH                      none
 DEVICE_BACKEND                   none
 DR_QUAD_RULE                     dunavant
 EQUATIONS                        viscoelastic2
 Eigen3_DIR                       /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/eigen-3.4.0-umuh726eijyrt346dic275mdha6rc7mm/share/eigen3/cmake
 GEMM_TOOLS_LIST                  LIBXSMM,PSpaMM
 HDF5                             ON
 HDF5_DIR                         /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/hdf5-1.12.2-fzowvyobpiowc3f3hsidtdj3mkxpkm6x/cmake
 HOST_ARCH                        skx
 INTEGRATE_QUANTITIES             OFF
 LIKWID                           OFF
 LOG_LEVEL                        warning
 LOG_LEVEL_MASTER                 info
 Libxsmm_executable_PROGRAM       /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/libxsmm-1.17-qdddzbw5bx5r6lkzihnuu44ry5uyaavv/bin/libxsmm_gemm_generator
 MEMKIND                          OFF
 MEMORY_LAYOUT                    auto
 METIS                            ON
 MINI_SEISSOL                     ON
 MPI                              ON
 NETCDF                           ON
 NUMA_AWARE_PINNING               ON
 NUMA_ROOT_DIR                    /usr
 NUMBER_OF_FUSED_SIMULATIONS      1
 NUMBER_OF_MECHANISMS             3
 OPENMP                           ON
 ORDER                            6
 PLASTICITY_METHOD                nb
 PRECISION                        double
 PROXY_PYBINDING                  OFF
 PSpaMM_PROGRAM                   /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/py-pspamm-develop-35zum726336t2gdzfcz6mfzivifra7nd/pspamm.py
 SIONLIB                          OFF
 TESTING                          OFF
 TESTING_GENERATED                OFF
 USE_IMPALA_JIT_LLVM              OFF
 easi_DIR                         /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/easi-1.2.0-zto7che54ra7wvybw2be6m66yzaisxt5/lib64/cmake/easi
 impalajit_DIR                    /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/impalajit-main-rdjaykqjjbb645iny6nexrtnup27ejpg/lib64/cmake/impalajit
 netCDF_DIR                       netCDF_DIR-NOTFOUND
 yaml-cpp_DIR                     /work2/09160/ulrich/frontera/spack/opt/spack/linux-centos7-cascadelake/intel-19.1.1.217/yaml-cpp-0.6.2-qszzfashukprv326kpoe2ivdwrfq6a5f/lib/cmake/yaml-cpp
  1. On which machine does your problem occur? If on a cluster: Which modules are loaded?
    frontera,
Currently Loaded Modules:
  1) intel/19.1.1    7) hwloc/1.11.12
  2) impi/19.0.9     8) xalt/2.10.34
  3) git/2.24.1      9) TACC
  4) autotools/1.2  10) python3/3.9.2
  5) cmake/3.24.2   11) seissol-env-develop-intel-19.1.1.217-w2i565p
  6) pmix/3.1.4     12) impalajit-main-intel-19.1.1.217-rdjaykq
  1. Provide parameter/material files.
    /scratch1/09160/ulrich/Turkey-Syria-Earthquakes/SeisSolSetupHeterogeneities/event2

probably related with #818

@krenzland
Copy link
Contributor

This is going to be impossible to debug :(
Was this reproduceable?

@Thomas-Ulrich
Copy link
Contributor Author

Ian had the same issue with the 2 meshes (175M and 470M). so yes it seems.

@krenzland
Copy link
Contributor

Okay, it was always going from 8k nodes to full machine? That is quite strange. 2 ranks per node on 8192 nodes leads to exactly 2^14 MPI ranks. But 2^14 doesn't scream "overflow" to me

@wangyinz
Copy link

Another note is that I was able to run the palu case on 8254 nodes (2 ranks per node) last time when we did texascale. The version used was 160c195d. I think that is the largest we ever ran.

@wangyinz
Copy link

Just want to follow up here. We are doing the Texascale again, and this time, it failed with 8000 nodes, but ran through with 8192. Maybe this is random, but we really don't have that many data point at this scale to tell.

@Thomas-Ulrich
Copy link
Contributor Author

just want to follow up:
I just run twice the same setup, with the same number of nodes.
num_nodes: 8195 ntasks: 16390
The first one has nan, the second works smoothly.

Thu Oct 12 15:14:04, Info:  Finishing initialization...
Thu Oct 12 15:14:04, Info:  Starting simulation.
Thu Oct 12 15:14:04, Info:  Writing faultoutput at time 0.
Thu Oct 12 15:14:04, Info:  Writing faultoutput at time 0. Done.
Thu Oct 12 15:14:04, Info:  Waiting for last wave field.
Thu Oct 12 15:14:04, Info:  Writing wave field at time 0.
Thu Oct 12 15:14:04, Info:  Writing wave field at time 0. Done.
Thu Oct 12 15:14:04, Info:  Writing free surface at time 0.
Thu Oct 12 15:14:04, Info:  Writing free surface at time 0. Done.
Thu Oct 12 15:14:04, Info:  Writing energy output at time 0
Thu Oct 12 15:14:05, Info:  Writing energy output at time 0 Done.
Thu Oct 12 15:14:11, Info:  Writing free surface at time 0.01.
Thu Oct 12 15:14:11, Info:  Writing free surface at time 0.01. Done.
Thu Oct 12 15:14:11, Info:  Writing energy output at time 0.01
Thu Oct 12 15:14:14, Info:  Volume energies skipped at this step
Thu Oct 12 15:14:14, Info:  Frictional work (total, % static, % radiated):  -nan  , -nan  , -nan
Thu Oct 12 15:14:14, Info:  Seismic moment (without plasticity): -nan  Mw: -nan
Thu Oct 12 15:14:14, Info:  Writing energy output at time 0.01 Done.

@davschneller
Copy link
Contributor

davschneller commented Oct 12, 2023

The partitioning should still be a little different between the runs. ... Do I see correctly that the nans always appear at the first output after 0 already?

My guess would go to something between LtsLayout and MemoryManager; maybe with cell duplication (or lack thereof).

@Thomas-Ulrich
Copy link
Contributor Author

yes

@Thomas-Ulrich
Copy link
Contributor Author

Thomas-Ulrich commented May 28, 2024

The bug is still there with the current master (built 15 may, spack says hte hash is daigfrx but I don't see this hash in the repository),
I just ran into it with a much smaller setup.
on NG: /hppfs/work/pn49ha/di73yeq4/issue839_nan_very_rarely
runs on 16 nodes, 2 ranks per nodes
failed once on 20 nodes, 2 ranks per nodes).
I've run ~100 of such setup on 20 nodes without problem (slightly different friction parameters).

EDIT:

  1. I found another one that failed.
  2. I just queued again the setup on 20 nodes and it did not fail.

Thomas.

@davschneller
Copy link
Contributor

To write it down for once, my current guess (not knowing the exact reason for sure) still goes to the code block around

if( m_clusteredCopy[l_localClusterId][l_region].first[0] == i_neighboringRank &&
m_clusteredCopy[l_localClusterId][l_region].first[1] == i_neighboringGlobalClusterId ) {
// assert ordering is preserved
assert( *(m_clusteredCopy[l_localClusterId][l_region].second.end()-1) <= i_cellId );
// only add a cell if not present already
if( *(m_clusteredCopy[l_localClusterId][l_region].second.end()-1) != i_cellId ) {
m_clusteredCopy[l_localClusterId][l_region].second.push_back( i_cellId );
}
break;
}
.

The following theory (which'll still require some double-checking): if we have a copy cell with a DR ghost cell and another non-DR ghost cell; all three cells are in the same time cluster. The two ghost cells are located on the same neighboring rank. However, looking at this copy cell from the neighboring rank (as ghost cell), we require it to send us both the buffers and derivatives—but a ghost cell at a time can only give one of these. Somehow from this mismatch, we obtain the inf/nan problem due to mismatching transport data.

As a fix, we could try to introduce another copy cell copy and see if that fixes it in that situation—or go the slightly longer way and just re-write parts of the LtsLayout in its entirety. At latest when we want to add new clusters, we may need to do that anyways.

@Thomas-Ulrich
Copy link
Contributor Author

Thanks @davschneller,
Where can I add a print statement somewhere that would verify if we are indeed in this situation?
I can then run this setup many times until I get nans (this should not take a lot of CPU ressources) and confirm or not your theory.

Thomas.

@davschneller
Copy link
Contributor

Hi, you can try https://github.com/SeisSol/SeisSol/tree/davschneller/test-infnan . Unfortunately, it takes more than one print statement to test the hypothesis (and presumably even more to fix it).

(note: I've only tested it a tiny bit so far)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants