inf/Nan in receiver output (GPU and CPU) #818

francescomosconi · 2023-03-14T12:12:16Z

Describe the bug
The simulation hangs with error inf/NaN in receiver output(10;10;10).

Expected behavior
The simulation should run till the end

To Reproduce
Steps to reproduce the behavior:

Which version do you use? Provide branch and commit id.
Seissol version is 202103_Sumatra-1090-g7ae55cfa (last compilation: 2 weeks ago), master branch
Which build settings do you use? Which compiler version do you use?
GPU setting with CUDA and MPI compilers. Single precision. We tested polynomial order 4, 6 and 7.
On which machine does your problem occur? If on a cluster: Which modules are loaded?
Marconi100:

Provide parameter/material files.
parameters.txt

Screenshots/Console output
The seissol.err is provide in:
err_rec_output.txt
By visualizing the results in paraview this appere just before the error occur:

Additional context
We test different setup on the cluster from 24 to 64 nodes to avoid memory problem, but the error still occurs in a different simulation time. Even by vary the Dc the error occur in different time (we use FL=16).
The mesh was checked and we also try to decrease CFL from 0.5 to CFL=0.25.

Additionally we reduce the timesteps before the inf/Nan but we obtain this error in closing of hipSYCL (see it in seissol_out.txt and seissol_err.txt files, maybe it could be related)
from /m100/home/userexternal/rdorozhi/Downloads/gcc11.2.0/hipSYCL/src/runtime/cuda/cuda_allocator.cpp:143 @ query_pointer(): cuda_allocator: query_pointer(): pointer is unknown by backend (error code = CUDA:0)
seissol_err.txt
seissol_out.txt

Many thanks in advance.

ravil-mobile · 2023-03-14T22:18:23Z

Hi @ francescomosconi,

Can you re-compile SeisSol with double precision and try again?

francescomosconi · 2023-03-15T15:39:00Z

Hi @ravil-mobile thanks for your reply,

I tried with the double precision. The process doesn't fail with inf/Nan error but at the same timestep before it happened we have this strange behaviour of the slip velocity:

ravil-mobile · 2023-03-16T12:16:13Z

Do you use Linear Slip Weakening, right? How large your mesh is and whether (if possible) you can share it with me?

ravil-mobile · 2023-03-16T12:22:46Z

I am also curious whether an older version of SeisSol leads to the same results? Can you also try to run this simulation on CPUs and compare results?
You will need to change to following cmake parameters: -DDEVICE_ARCH=none -DDEVICE_BACKEND=none -DHOST_ARCH=power9

If we can find an older git-commit which leads to correct (relatively to SeisSol's CPU version) results on GPUs then it will be easy to determined which SeisSol's recent changes caused the issue.

ravil-mobile · 2023-03-21T10:27:47Z

Hi @francescomosconi,

What is the maximum slip rate? Can you rescale SRd field using auto-scale option

francescomosconi · 2023-03-21T10:44:53Z

Hi @ravil-mobile

with the auto scale we have range of values from 1e-50 to 1e50.
This the snapshot of the last time step:

This is the scale from 0 to 1e10 for the same time

ravil-mobile · 2023-03-21T11:59:43Z

Is this output at t=0.002 sec?

francescomosconi · 2023-03-21T12:29:02Z

Is the t=0.021s

ravil-mobile · 2023-03-21T12:49:25Z

yes, 1e50 is too big. Definitely something is wrong in SeisSol. It is difficult to debug. I am running your coarse mesh on CPUs. I set the end of simulation time equal to 0.9 sec. Then I will do the same on GPUs

francescomosconi · 2023-03-24T12:15:34Z

Hi @ravil-mobile
do you have any news?
We compiled SeisSol in another small cluster with the CPUs, and then we run the coars mesh with end time at 1.0sec. The same error inf/NaN in receivers output occur at 0.54s.
Unfortunately, we do not have the opportunity to also test refined mesh because of the computational power...

ravil-mobile · 2023-03-28T11:09:05Z

Hi @francescomosconi,

I was running your scenario on a local CPU server. 850k elements was too much for it. Do you have an opportunity to half the mesh by coarsening?

The same error inf/NaN in receivers output occur at 0.54s.

Did I understand you correctly that your simulation with a coarse mesh (the one which you gave me) got aborted at 0.54s with " inf/NaN in receivers output" while running SeisSol on CPUs? Please, confirm if it is correct because it is important for me.

francescomosconi · 2023-03-28T14:46:18Z

Hi @ravil-mobile

yes, I can confirm that.

Ok, i will provide you a more coars mesh.

May be this is not correlate... but looking at some previous issue ( https://github.com/SeisSol/SeisSol/issues/546) I saw a similar behavior, is possible that the error is due to the numerical oscillation?
Below i show you the P_n during our simulation:

ravil-mobile · 2023-03-28T15:22:16Z

Hi, @Thomas-Ulrich and @sebwolf-de . This issue is also related to the SeisSol-CPU version. Do you have any idea?

Thomas-Ulrich · 2023-03-30T16:28:22Z

how much simulation time does it takes to reach an error?

francescomosconi · 2023-03-31T07:21:54Z

Hi @Thomas-Ulrich
the simulation time when the process hang (for the CPUs version) is of 0.54s over the total 1s.

I made an additional test with the same mesh and parameters, and by impose an high value of the Dc at 10m of radius (so the rupture if forced to stop abruptly, and we avoid the numerical oscillation when the rupture has low propagation velocity) the simulation goes on and finish the process correctly.

Thomas-Ulrich · 2023-03-31T07:56:05Z

I can confirm this error with the small mesh (while the fault output looks ok) on supermuc NG with 16 nodes:

0 Thu Mar 30 18:40:46, Info:  Writing faultoutput at time 0.476.
0 Thu Mar 30 18:40:46, Info:  Writing faultoutput at time 0.476. Done.
0 Thu Mar 30 18:40:46, Info:  Performance since the start: 8.17978 TFLOP/s (rank 0: 253.11 GFLOP/s, average over ranks: 255.618 GFLOP/s)
0 Thu Mar 30 18:40:46, Info:  Performance since last sync point: 8.13081 TFLOP/s (rank 0: 251.595 GFLOP/s, average over ranks: 254.088 GFLOP/s)
0 Thu Mar 30 18:40:48, Info:  Writing faultoutput at time 0.477.
0 Thu Mar 30 18:40:48, Info:  Writing faultoutput at time 0.477. Done.
0 Thu Mar 30 18:40:48, Info:  Performance since the start: 8.17907 TFLOP/s (rank 0: 253.088 GFLOP/s, average over ranks: 255.596 GFLOP/s)
0 Thu Mar 30 18:40:48, Info:  Performance since last sync point: 7.85701 TFLOP/s (rank 0: 243.141 GFLOP/s, average over ranks: 245.531 GFLOP/s)
0 Thu Mar 30 18:40:51, Info:  Writing faultoutput at time 0.478.
0 Thu Mar 30 18:40:51, Info:  Writing faultoutput at time 0.478. Done.
0 Thu Mar 30 18:40:51, Info:  Performance since the start: 8.17872 TFLOP/s (rank 0: 253.077 GFLOP/s, average over ranks: 255.585 GFLOP/s)
0 Thu Mar 30 18:40:51, Info:  Performance since last sync point: 8.01478 TFLOP/s (rank 0: 248.006 GFLOP/s, average over ranks: 250.462 GFLOP/s)
0 Thu Mar 30 18:40:52, Error: Detected Inf/NaN in receiver output at 10 , 10 , 10 . Aborting.
Backtrace:
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic(_ZN5utils6LoggerD1Ev+0x477) [0x426d77]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x6f0620]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ed854]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ea6d0]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ea5fe]

Thomas-Ulrich · 2023-04-03T15:37:38Z

Hi,
Somehow, I found a fix to your problem. Use:
ReceiverOutputInterval = 0.01
(see https://seissol.readthedocs.io/en/latest/parameter-file.html)
Don't ask me why, I don't know.

Thomas.

krenzland · 2023-04-04T08:41:47Z

That doesn't make sense. The faultoutput interval seems to be 0.002, hence, changing the receiveroutputinterval to 0.01 should not introduce a new synch point.
Thomas, does it work with GTS?

Thomas-Ulrich · 2023-04-04T09:37:20Z

(actually, I also change the fault output printtimeinterval_sec to 0.005s, and with this new interval, I get the following error (very similar to previous) when ReceiverOutputInterval = 0.01 is not set and no error when set.
(folder to be found at /hppfs/work/pr45fi/di73yeq4/bug-Mosconi)

0 Mon Apr 03 17:35:23, Info:  Writing faultoutput at time 0.425. Done.
0 Mon Apr 03 17:35:23, Info:  Performance since the start: 17.047 TFLOP/s (rank 0: 534.371 GFLOP/s, average over ranks: 532.718 GFLOP/s)
0 Mon Apr 03 17:35:23, Info:  Performance since last sync point: 17.0601 TFLOP/s (rank 0: 546.433 GFLOP/s, average over ranks: 533.128 GFLOP/s)
0 Mon Apr 03 17:35:27, Error: Detected Inf/NaN in receiver output at 10 , 10 , 10 . Aborting.

Thomas-Ulrich · 2023-04-04T10:00:34Z

GTS, same problem:

0 Tue Apr 04 11:57:33, Info:  Performance since the start: 32.3926 TFLOP/s (rank 0: 1010.18 GFLOP/s, average over ranks: 1012.27 GFLOP/s)
0 Tue Apr 04 11:57:33, Info:  Performance since last sync point: 32.0423 TFLOP/s (rank 0: 1001.17 GFLOP/s, average over ranks: 1001.32 GFLOP/s)
0 Tue Apr 04 11:57:35, Info:  #max-updates since sync:  100  @  0.475194
0 Tue Apr 04 11:57:35, Info:  #max-updates since sync:  200  @  0.475387
0 Tue Apr 04 11:57:35, Info:  #max-updates since sync:  300  @  0.475581
0 Tue Apr 04 11:57:36, Info:  #max-updates since sync:  400  @  0.475774
0 Tue Apr 04 11:57:36, Info:  #max-updates since sync:  500  @  0.475968
0 Tue Apr 04 11:57:37, Info:  #max-updates since sync:  600  @  0.476161
0 Tue Apr 04 11:57:37, Info:  #max-updates since sync:  700  @  0.476355
0 Tue Apr 04 11:57:37, Info:  #max-updates since sync:  800  @  0.476548
0 Tue Apr 04 11:57:38, Info:  #max-updates since sync:  900  @  0.476742
0 Tue Apr 04 11:57:38, Info:  #max-updates since sync:  1000  @  0.476935
0 Tue Apr 04 11:57:39, Info:  #max-updates since sync:  1100  @  0.477129
0 Tue Apr 04 11:57:39, Info:  #max-updates since sync:  1200  @  0.477322
0 Tue Apr 04 11:57:40, Info:  #max-updates since sync:  1300  @  0.477516
0 Tue Apr 04 11:57:40, Info:  #max-updates since sync:  1400  @  0.477709
0 Tue Apr 04 11:57:40, Info:  #max-updates since sync:  1500  @  0.477903
0 Tue Apr 04 11:57:41, Info:  #max-updates since sync:  1600  @  0.478096
0 Tue Apr 04 11:57:41, Info:  #max-updates since sync:  1700  @  0.47829
0 Tue Apr 04 11:57:42, Info:  #max-updates since sync:  1800  @  0.478483
0 Tue Apr 04 11:57:42, Info:  #max-updates since sync:  1900  @  0.478677
0 Tue Apr 04 11:57:42, Error: Detected Inf/NaN in receiver output at 10 , 10 , 10 . Aborting.
Backtrace:
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic(_ZN5utils6LoggerD1Ev+0x477) [0x426d77]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x6f0620]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ed854]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ea6d0]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ea5fe]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7f0281]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x6e791e]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x6c853e]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7167b5]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x691f42]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x418947]
/lib64/libc.so.6(__libc_start_main+0xef) [0x7f324f1032bd]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x41876a]
Abort(134) on node 29 (rank 29 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 134) - process 29

francescomosconi · 2023-04-05T12:45:15Z

Hi,
I tested to set ReceiverOutputInterval = 0.01 with the refined mesh on the GPUs version.
We avoid the inf/Nan in receiver output but the simulation fails anyway with this Error: Detected Inf/NaN in energies. Aborting.
The end time was 0.05s while the error occurred at 0.02s.
I'm running on 48nodes of Marconi100 cluster.

This is the last timestep:

Please let me know if i can be helpfull with somethings.

Thomas-Ulrich · 2023-04-05T14:18:19Z

Hi,
I think there might be 2 issues.
Which order did you use, single or double precision, what is the name of the mesh? CFL 0.5?

francescomosconi · 2023-04-05T18:01:38Z

4 order, single precision and CFL = 0.25.
Mesh name is fear_15_500cm

AliceGabriel · 2023-04-24T09:41:00Z

This is wild shot , but can you try rerunning one of these simulations with sliprateoutputtype=0 ? https://seissol.readthedocs.io/en/latest/dynamic-rupture.html#visualisation-sliprateoutputtype-default-1

krenzland · 2023-04-24T15:37:58Z

@Thomas-Ulrich Can you maybe check this setup on one node on SuperMUC (fat node if necessary)?
If this works, then it might be connected to #839; if not, then it's probably unconnected.

francescomosconi · 2023-05-03T15:17:44Z

Hi,
i tried different models with SlipRateOutputType=0.
This resolve some numerical oscillation for the slip velocity, but doesn't avoid the error inf/Nan in receivers output

Thomas-Ulrich · 2023-06-05T18:41:17Z

Hi, Somehow, I found a fix to your problem. Use: ReceiverOutputInterval = 0.01 (see https://seissol.readthedocs.io/en/latest/parameter-file.html) Don't ask me why, I don't know.

Thomas.

It seems this problem could have been explained and fixed by #875

francescomosconi added the bug label Mar 14, 2023

Thomas-Ulrich changed the title ~~inf/Nan in receiver output (GPU)~~ inf/Nan in receiver output (GPU and CPU) Apr 13, 2023

Thomas-Ulrich mentioned this issue Apr 13, 2023

Detected Inf/NaN in free surface output or not depending on the number of nodes #839

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inf/Nan in receiver output (GPU and CPU) #818

inf/Nan in receiver output (GPU and CPU) #818

francescomosconi commented Mar 14, 2023 •

edited

ravil-mobile commented Mar 14, 2023

francescomosconi commented Mar 15, 2023

ravil-mobile commented Mar 16, 2023

ravil-mobile commented Mar 16, 2023

ravil-mobile commented Mar 21, 2023

francescomosconi commented Mar 21, 2023

ravil-mobile commented Mar 21, 2023

francescomosconi commented Mar 21, 2023

ravil-mobile commented Mar 21, 2023

francescomosconi commented Mar 24, 2023

ravil-mobile commented Mar 28, 2023

francescomosconi commented Mar 28, 2023

ravil-mobile commented Mar 28, 2023

Thomas-Ulrich commented Mar 30, 2023

francescomosconi commented Mar 31, 2023 •

edited

Thomas-Ulrich commented Mar 31, 2023

Thomas-Ulrich commented Apr 3, 2023

krenzland commented Apr 4, 2023

Thomas-Ulrich commented Apr 4, 2023

Thomas-Ulrich commented Apr 4, 2023

francescomosconi commented Apr 5, 2023

Thomas-Ulrich commented Apr 5, 2023

francescomosconi commented Apr 5, 2023 •

edited

AliceGabriel commented Apr 24, 2023 •

edited

krenzland commented Apr 24, 2023

francescomosconi commented May 3, 2023

Thomas-Ulrich commented Jun 5, 2023 •

edited

inf/Nan in receiver output (GPU and CPU) #818

inf/Nan in receiver output (GPU and CPU) #818

Comments

francescomosconi commented Mar 14, 2023 • edited

ravil-mobile commented Mar 14, 2023

francescomosconi commented Mar 15, 2023

ravil-mobile commented Mar 16, 2023

ravil-mobile commented Mar 16, 2023

ravil-mobile commented Mar 21, 2023

francescomosconi commented Mar 21, 2023

ravil-mobile commented Mar 21, 2023

francescomosconi commented Mar 21, 2023

ravil-mobile commented Mar 21, 2023

francescomosconi commented Mar 24, 2023

ravil-mobile commented Mar 28, 2023

francescomosconi commented Mar 28, 2023

ravil-mobile commented Mar 28, 2023

Thomas-Ulrich commented Mar 30, 2023

francescomosconi commented Mar 31, 2023 • edited

Thomas-Ulrich commented Mar 31, 2023

Thomas-Ulrich commented Apr 3, 2023

krenzland commented Apr 4, 2023

Thomas-Ulrich commented Apr 4, 2023

Thomas-Ulrich commented Apr 4, 2023

francescomosconi commented Apr 5, 2023

Thomas-Ulrich commented Apr 5, 2023

francescomosconi commented Apr 5, 2023 • edited

AliceGabriel commented Apr 24, 2023 • edited

krenzland commented Apr 24, 2023

francescomosconi commented May 3, 2023

Thomas-Ulrich commented Jun 5, 2023 • edited

francescomosconi commented Mar 14, 2023 •

edited

francescomosconi commented Mar 31, 2023 •

edited

francescomosconi commented Apr 5, 2023 •

edited

AliceGabriel commented Apr 24, 2023 •

edited

Thomas-Ulrich commented Jun 5, 2023 •

edited