-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inf/Nan in receiver output (GPU and CPU) #818
Comments
Hi @ francescomosconi, Can you re-compile SeisSol with double precision and try again? |
Hi @ravil-mobile thanks for your reply, I tried with the double precision. The process doesn't fail with inf/Nan error but at the same timestep before it happened we have this strange behaviour of the slip velocity: |
Do you use Linear Slip Weakening, right? How large your mesh is and whether (if possible) you can share it with me? |
I am also curious whether an older version of SeisSol leads to the same results? Can you also try to run this simulation on CPUs and compare results? If we can find an older git-commit which leads to correct (relatively to SeisSol's CPU version) results on GPUs then it will be easy to determined which SeisSol's recent changes caused the issue. |
What is the maximum slip rate? Can you rescale |
Is this output at |
Is the |
yes, |
Hi @ravil-mobile |
I was running your scenario on a local CPU server. 850k elements was too much for it. Do you have an opportunity to half the mesh by coarsening?
Did I understand you correctly that your simulation with a coarse mesh (the one which you gave me) got aborted at 0.54s with " inf/NaN in receivers output" while running SeisSol on CPUs? Please, confirm if it is correct because it is important for me. |
yes, I can confirm that. Ok, i will provide you a more coars mesh. May be this is not correlate... but looking at some previous issue ( https://github.com/SeisSol/SeisSol/issues/546) I saw a similar behavior, is possible that the error is due to the numerical oscillation? |
Hi, @Thomas-Ulrich and @sebwolf-de . This issue is also related to the SeisSol-CPU version. Do you have any idea? |
how much simulation time does it takes to reach an error? |
Hi @Thomas-Ulrich I made an additional test with the same mesh and parameters, and by impose an high value of the Dc at 10m of radius (so the rupture if forced to stop abruptly, and we avoid the numerical oscillation when the rupture has low propagation velocity) the simulation goes on and finish the process correctly. |
I can confirm this error with the small mesh (while the fault output looks ok) on supermuc NG with 16 nodes:
|
Hi, Thomas. |
That doesn't make sense. The faultoutput interval seems to be 0.002, hence, changing the receiveroutputinterval to 0.01 should not introduce a new synch point. |
(actually, I also change the fault output printtimeinterval_sec to 0.005s, and with this new interval, I get the following error (very similar to previous) when ReceiverOutputInterval = 0.01 is not set and no error when set.
|
GTS, same problem:
|
Hi, Please let me know if i can be helpfull with somethings. |
Hi, |
4 order, single precision and |
This is wild shot , but can you try rerunning one of these simulations with sliprateoutputtype=0 ? https://seissol.readthedocs.io/en/latest/dynamic-rupture.html#visualisation-sliprateoutputtype-default-1 |
@Thomas-Ulrich Can you maybe check this setup on one node on SuperMUC (fat node if necessary)? |
Hi, |
It seems this problem could have been explained and fixed by #875 |
Describe the bug
The simulation hangs with error inf/NaN in receiver output(10;10;10).
Expected behavior
The simulation should run till the end
To Reproduce
Steps to reproduce the behavior:
Seissol version is 202103_Sumatra-1090-g7ae55cfa (last compilation: 2 weeks ago), master branch
GPU setting with CUDA and MPI compilers. Single precision. We tested polynomial order 4, 6 and 7.
Marconi100:
parameters.txt
Screenshots/Console output
The seissol.err is provide in:
err_rec_output.txt
By visualizing the results in paraview this appere just before the error occur:
Additional context
We test different setup on the cluster from 24 to 64 nodes to avoid memory problem, but the error still occurs in a different simulation time. Even by vary the Dc the error occur in different time (we use FL=16).
The mesh was checked and we also try to decrease CFL from 0.5 to CFL=0.25.
Additionally we reduce the timesteps before the inf/Nan but we obtain this error in closing of hipSYCL (see it in seissol_out.txt and seissol_err.txt files, maybe it could be related)
from /m100/home/userexternal/rdorozhi/Downloads/gcc11.2.0/hipSYCL/src/runtime/cuda/cuda_allocator.cpp:143 @ query_pointer(): cuda_allocator: query_pointer(): pointer is unknown by backend (error code = CUDA:0)
seissol_err.txt
seissol_out.txt
Many thanks in advance.
The text was updated successfully, but these errors were encountered: