-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detected Inf/NaN in free surface output or not depending on the number of nodes #839
Comments
This is going to be impossible to debug :( |
Ian had the same issue with the 2 meshes (175M and 470M). so yes it seems. |
Okay, it was always going from 8k nodes to full machine? That is quite strange. 2 ranks per node on 8192 nodes leads to exactly 2^14 MPI ranks. But 2^14 doesn't scream "overflow" to me |
Another note is that I was able to run the palu case on 8254 nodes (2 ranks per node) last time when we did texascale. The version used was |
Just want to follow up here. We are doing the Texascale again, and this time, it failed with 8000 nodes, but ran through with 8192. Maybe this is random, but we really don't have that many data point at this scale to tell. |
just want to follow up:
|
The partitioning should still be a little different between the runs. ... Do I see correctly that the nans always appear at the first output after 0 already? My guess would go to something between LtsLayout and MemoryManager; maybe with cell duplication (or lack thereof). |
yes |
The bug is still there with the current master (built 15 may, spack says hte hash is daigfrx but I don't see this hash in the repository), EDIT:
Thomas. |
To write it down for once, my current guess (not knowing the exact reason for sure) still goes to the code block around SeisSol/src/Initializer/time_stepping/LtsLayout.cpp Lines 743 to 753 in 5ecc5ff
The following theory (which'll still require some double-checking): if we have a copy cell with a DR ghost cell and another non-DR ghost cell; all three cells are in the same time cluster. The two ghost cells are located on the same neighboring rank. However, looking at this copy cell from the neighboring rank (as ghost cell), we require it to send us both the buffers and derivatives—but a ghost cell at a time can only give one of these. Somehow from this mismatch, we obtain the inf/nan problem due to mismatching transport data. As a fix, we could try to introduce another copy cell copy and see if that fixes it in that situation—or go the slightly longer way and just re-write parts of the |
Thanks @davschneller, Thomas. |
Hi, you can try https://github.com/SeisSol/SeisSol/tree/davschneller/test-infnan . Unfortunately, it takes more than one print statement to test the hypothesis (and presumably even more to fix it). (note: I've only tested it a tiny bit so far) |
Describe the bug
during the Texascale, we ran a scenario of the 7.8 Turkey earthquake.
When using all nodes of Frontera (8192) with 2 ranks per nodes, we get Nan when writing the first surface output:
with 8000 nodes it runs without problem.
Expected behavior
no node dependence.
To Reproduce
Steps to reproduce the behavior:
master, 8d6e455
intel
frontera,
/scratch1/09160/ulrich/Turkey-Syria-Earthquakes/SeisSolSetupHeterogeneities/event2
probably related with #818
The text was updated successfully, but these errors were encountered: