Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inf/Nan in receiver output (GPU and CPU) #818

Open
francescomosconi opened this issue Mar 14, 2023 · 27 comments
Open

inf/Nan in receiver output (GPU and CPU) #818

francescomosconi opened this issue Mar 14, 2023 · 27 comments
Labels

Comments

@francescomosconi
Copy link

francescomosconi commented Mar 14, 2023

Describe the bug
The simulation hangs with error inf/NaN in receiver output(10;10;10).

Expected behavior
The simulation should run till the end

To Reproduce
Steps to reproduce the behavior:

  1. Which version do you use? Provide branch and commit id.
    Seissol version is 202103_Sumatra-1090-g7ae55cfa (last compilation: 2 weeks ago), master branch
  2. Which build settings do you use? Which compiler version do you use?
    GPU setting with CUDA and MPI compilers. Single precision. We tested polynomial order 4, 6 and 7.
  3. On which machine does your problem occur? If on a cluster: Which modules are loaded?
    Marconi100:

Schermata 2023-03-14 alle 12 52 27

  1. Provide parameter/material files.
    parameters.txt

Screenshots/Console output
The seissol.err is provide in:
err_rec_output.txt
By visualizing the results in paraview this appere just before the error occur:
Schermata 2023-03-11 alle 12 12 18

Additional context
We test different setup on the cluster from 24 to 64 nodes to avoid memory problem, but the error still occurs in a different simulation time. Even by vary the Dc the error occur in different time (we use FL=16).
The mesh was checked and we also try to decrease CFL from 0.5 to CFL=0.25.

Additionally we reduce the timesteps before the inf/Nan but we obtain this error in closing of hipSYCL (see it in seissol_out.txt and seissol_err.txt files, maybe it could be related)
from /m100/home/userexternal/rdorozhi/Downloads/gcc11.2.0/hipSYCL/src/runtime/cuda/cuda_allocator.cpp:143 @ query_pointer(): cuda_allocator: query_pointer(): pointer is unknown by backend (error code = CUDA:0)
seissol_err.txt
seissol_out.txt

Many thanks in advance.

@ravil-mobile
Copy link
Collaborator

Hi @ francescomosconi,

Can you re-compile SeisSol with double precision and try again?

@francescomosconi
Copy link
Author

Hi @ravil-mobile thanks for your reply,

I tried with the double precision. The process doesn't fail with inf/Nan error but at the same timestep before it happened we have this strange behaviour of the slip velocity:
Schermata 2023-03-15 alle 16 32 55
Schermata 2023-03-15 alle 16 38 33

@ravil-mobile
Copy link
Collaborator

Do you use Linear Slip Weakening, right? How large your mesh is and whether (if possible) you can share it with me?

@ravil-mobile
Copy link
Collaborator

I am also curious whether an older version of SeisSol leads to the same results? Can you also try to run this simulation on CPUs and compare results?
You will need to change to following cmake parameters: -DDEVICE_ARCH=none -DDEVICE_BACKEND=none -DHOST_ARCH=power9

If we can find an older git-commit which leads to correct (relatively to SeisSol's CPU version) results on GPUs then it will be easy to determined which SeisSol's recent changes caused the issue.

@ravil-mobile
Copy link
Collaborator

Hi @francescomosconi,

What is the maximum slip rate? Can you rescale SRd field using auto-scale option

@francescomosconi
Copy link
Author

Hi @ravil-mobile

with the auto scale we have range of values from 1e-50 to 1e50.
This the snapshot of the last time step:
Schermata 2023-03-21 alle 11 37 32

This is the scale from 0 to 1e10 for the same time
Schermata 2023-03-21 alle 11 44 16

@ravil-mobile
Copy link
Collaborator

Is this output at t=0.002 sec?

@francescomosconi
Copy link
Author

Is the t=0.021s

@ravil-mobile
Copy link
Collaborator

yes, 1e50 is too big. Definitely something is wrong in SeisSol. It is difficult to debug. I am running your coarse mesh on CPUs. I set the end of simulation time equal to 0.9 sec. Then I will do the same on GPUs

@francescomosconi
Copy link
Author

Hi @ravil-mobile
do you have any news?
We compiled SeisSol in another small cluster with the CPUs, and then we run the coars mesh with end time at 1.0sec. The same error inf/NaN in receivers output occur at 0.54s.
Unfortunately, we do not have the opportunity to also test refined mesh because of the computational power...

@ravil-mobile
Copy link
Collaborator

Hi @francescomosconi,

I was running your scenario on a local CPU server. 850k elements was too much for it. Do you have an opportunity to half the mesh by coarsening?

The same error inf/NaN in receivers output occur at 0.54s.

Did I understand you correctly that your simulation with a coarse mesh (the one which you gave me) got aborted at 0.54s with " inf/NaN in receivers output" while running SeisSol on CPUs? Please, confirm if it is correct because it is important for me.

@francescomosconi
Copy link
Author

Hi @ravil-mobile

yes, I can confirm that.

Ok, i will provide you a more coars mesh.

May be this is not correlate... but looking at some previous issue ( https://github.com/SeisSol/SeisSol/issues/546) I saw a similar behavior, is possible that the error is due to the numerical oscillation?
Below i show you the P_n during our simulation:

modello_asperity

modello_injection

@ravil-mobile
Copy link
Collaborator

Hi, @Thomas-Ulrich and @sebwolf-de . This issue is also related to the SeisSol-CPU version. Do you have any idea?

@Thomas-Ulrich
Copy link
Contributor

how much simulation time does it takes to reach an error?

@francescomosconi
Copy link
Author

francescomosconi commented Mar 31, 2023

Hi @Thomas-Ulrich
the simulation time when the process hang (for the CPUs version) is of 0.54s over the total 1s.

I made an additional test with the same mesh and parameters, and by impose an high value of the Dc at 10m of radius (so the rupture if forced to stop abruptly, and we avoid the numerical oscillation when the rupture has low propagation velocity) the simulation goes on and finish the process correctly.

@Thomas-Ulrich
Copy link
Contributor

I can confirm this error with the small mesh (while the fault output looks ok) on supermuc NG with 16 nodes:

0 Thu Mar 30 18:40:46, Info:  Writing faultoutput at time 0.476.
0 Thu Mar 30 18:40:46, Info:  Writing faultoutput at time 0.476. Done.
0 Thu Mar 30 18:40:46, Info:  Performance since the start: 8.17978 TFLOP/s (rank 0: 253.11 GFLOP/s, average over ranks: 255.618 GFLOP/s)
0 Thu Mar 30 18:40:46, Info:  Performance since last sync point: 8.13081 TFLOP/s (rank 0: 251.595 GFLOP/s, average over ranks: 254.088 GFLOP/s)
0 Thu Mar 30 18:40:48, Info:  Writing faultoutput at time 0.477.
0 Thu Mar 30 18:40:48, Info:  Writing faultoutput at time 0.477. Done.
0 Thu Mar 30 18:40:48, Info:  Performance since the start: 8.17907 TFLOP/s (rank 0: 253.088 GFLOP/s, average over ranks: 255.596 GFLOP/s)
0 Thu Mar 30 18:40:48, Info:  Performance since last sync point: 7.85701 TFLOP/s (rank 0: 243.141 GFLOP/s, average over ranks: 245.531 GFLOP/s)
0 Thu Mar 30 18:40:51, Info:  Writing faultoutput at time 0.478.
0 Thu Mar 30 18:40:51, Info:  Writing faultoutput at time 0.478. Done.
0 Thu Mar 30 18:40:51, Info:  Performance since the start: 8.17872 TFLOP/s (rank 0: 253.077 GFLOP/s, average over ranks: 255.585 GFLOP/s)
0 Thu Mar 30 18:40:51, Info:  Performance since last sync point: 8.01478 TFLOP/s (rank 0: 248.006 GFLOP/s, average over ranks: 250.462 GFLOP/s)
0 Thu Mar 30 18:40:52, Error: Detected Inf/NaN in receiver output at 10 , 10 , 10 . Aborting.
Backtrace:
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic(_ZN5utils6LoggerD1Ev+0x477) [0x426d77]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x6f0620]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ed854]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ea6d0]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ea5fe]

@Thomas-Ulrich
Copy link
Contributor

Hi,
Somehow, I found a fix to your problem. Use:
ReceiverOutputInterval = 0.01
(see https://seissol.readthedocs.io/en/latest/parameter-file.html)
Don't ask me why, I don't know.

Thomas.

@krenzland
Copy link
Contributor

That doesn't make sense. The faultoutput interval seems to be 0.002, hence, changing the receiveroutputinterval to 0.01 should not introduce a new synch point.
Thomas, does it work with GTS?

@Thomas-Ulrich
Copy link
Contributor

(actually, I also change the fault output printtimeinterval_sec to 0.005s, and with this new interval, I get the following error (very similar to previous) when ReceiverOutputInterval = 0.01 is not set and no error when set.
(folder to be found at /hppfs/work/pr45fi/di73yeq4/bug-Mosconi)

0 Mon Apr 03 17:35:23, Info:  Writing faultoutput at time 0.425. Done.
0 Mon Apr 03 17:35:23, Info:  Performance since the start: 17.047 TFLOP/s (rank 0: 534.371 GFLOP/s, average over ranks: 532.718 GFLOP/s)
0 Mon Apr 03 17:35:23, Info:  Performance since last sync point: 17.0601 TFLOP/s (rank 0: 546.433 GFLOP/s, average over ranks: 533.128 GFLOP/s)
0 Mon Apr 03 17:35:27, Error: Detected Inf/NaN in receiver output at 10 , 10 , 10 . Aborting.

@Thomas-Ulrich
Copy link
Contributor

GTS, same problem:

0 Tue Apr 04 11:57:33, Info:  Performance since the start: 32.3926 TFLOP/s (rank 0: 1010.18 GFLOP/s, average over ranks: 1012.27 GFLOP/s)
0 Tue Apr 04 11:57:33, Info:  Performance since last sync point: 32.0423 TFLOP/s (rank 0: 1001.17 GFLOP/s, average over ranks: 1001.32 GFLOP/s)
0 Tue Apr 04 11:57:35, Info:  #max-updates since sync:  100  @  0.475194
0 Tue Apr 04 11:57:35, Info:  #max-updates since sync:  200  @  0.475387
0 Tue Apr 04 11:57:35, Info:  #max-updates since sync:  300  @  0.475581
0 Tue Apr 04 11:57:36, Info:  #max-updates since sync:  400  @  0.475774
0 Tue Apr 04 11:57:36, Info:  #max-updates since sync:  500  @  0.475968
0 Tue Apr 04 11:57:37, Info:  #max-updates since sync:  600  @  0.476161
0 Tue Apr 04 11:57:37, Info:  #max-updates since sync:  700  @  0.476355
0 Tue Apr 04 11:57:37, Info:  #max-updates since sync:  800  @  0.476548
0 Tue Apr 04 11:57:38, Info:  #max-updates since sync:  900  @  0.476742
0 Tue Apr 04 11:57:38, Info:  #max-updates since sync:  1000  @  0.476935
0 Tue Apr 04 11:57:39, Info:  #max-updates since sync:  1100  @  0.477129
0 Tue Apr 04 11:57:39, Info:  #max-updates since sync:  1200  @  0.477322
0 Tue Apr 04 11:57:40, Info:  #max-updates since sync:  1300  @  0.477516
0 Tue Apr 04 11:57:40, Info:  #max-updates since sync:  1400  @  0.477709
0 Tue Apr 04 11:57:40, Info:  #max-updates since sync:  1500  @  0.477903
0 Tue Apr 04 11:57:41, Info:  #max-updates since sync:  1600  @  0.478096
0 Tue Apr 04 11:57:41, Info:  #max-updates since sync:  1700  @  0.47829
0 Tue Apr 04 11:57:42, Info:  #max-updates since sync:  1800  @  0.478483
0 Tue Apr 04 11:57:42, Info:  #max-updates since sync:  1900  @  0.478677
0 Tue Apr 04 11:57:42, Error: Detected Inf/NaN in receiver output at 10 , 10 , 10 . Aborting.
Backtrace:
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic(_ZN5utils6LoggerD1Ev+0x477) [0x426d77]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x6f0620]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ed854]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ea6d0]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7ea5fe]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7f0281]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x6e791e]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x6c853e]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x7167b5]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x691f42]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x418947]
/lib64/libc.so.6(__libc_start_main+0xef) [0x7f324f1032bd]
/hppfs/work/pr45fi/di73yeq4/SeisSol/build/SeisSol_Release_sskx_4_elastic() [0x41876a]
Abort(134) on node 29 (rank 29 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 134) - process 29

@francescomosconi
Copy link
Author

Hi,
I tested to set ReceiverOutputInterval = 0.01 with the refined mesh on the GPUs version.
We avoid the inf/Nan in receiver output but the simulation fails anyway with this Error: Detected Inf/NaN in energies. Aborting.
The end time was 0.05s while the error occurred at 0.02s.
I'm running on 48nodes of Marconi100 cluster.

Schermata 2023-04-05 alle 14 38 15

This is the last timestep:
Schermata 2023-04-05 alle 14 41 50

Please let me know if i can be helpfull with somethings.

@Thomas-Ulrich
Copy link
Contributor

Hi,
I think there might be 2 issues.
Which order did you use, single or double precision, what is the name of the mesh? CFL 0.5?

@francescomosconi
Copy link
Author

francescomosconi commented Apr 5, 2023

4 order, single precision and CFL = 0.25.
Mesh name is fear_15_500cm

@Thomas-Ulrich Thomas-Ulrich changed the title inf/Nan in receiver output (GPU) inf/Nan in receiver output (GPU and CPU) Apr 13, 2023
@AliceGabriel
Copy link
Contributor

AliceGabriel commented Apr 24, 2023

This is wild shot , but can you try rerunning one of these simulations with sliprateoutputtype=0 ? https://seissol.readthedocs.io/en/latest/dynamic-rupture.html#visualisation-sliprateoutputtype-default-1

@krenzland
Copy link
Contributor

@Thomas-Ulrich Can you maybe check this setup on one node on SuperMUC (fat node if necessary)?
If this works, then it might be connected to #839; if not, then it's probably unconnected.

@francescomosconi
Copy link
Author

Hi,
i tried different models with SlipRateOutputType=0.
This resolve some numerical oscillation for the slip velocity, but doesn't avoid the error inf/Nan in receivers output

@Thomas-Ulrich
Copy link
Contributor

Thomas-Ulrich commented Jun 5, 2023

Hi, Somehow, I found a fix to your problem. Use: ReceiverOutputInterval = 0.01 (see https://seissol.readthedocs.io/en/latest/parameter-file.html) Don't ask me why, I don't know.

Thomas.

It seems this problem could have been explained and fixed by #875

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants