mpi error on supermucNG #771

NicoSchlw · 2023-01-02T09:28:13Z

Describe the bug
The simulation hangs after a certain timestep but the job is not killed. The error log shows that this is mpi related.

Expected behavior
The simulation should run till the end. If an error occurs the job should be canceled.

To Reproduce
Steps to reproduce the behavior:

Which version do you use? Provide branch and commit id.
Error occurred with nico/2nuc 7f8c759 and fix707-2nuc a71e59e without enforcing the state variable to be positive
Which build settings do you use? Which compiler version do you use?
Viscoelastic2, double precision, intel mpi compiler
On which machine does your problem occur? If on a cluster: Which modules are loaded?
SupermucNG:
Provide parameter/material files.
parameters.par.txt

Screenshots/Console output
If you suspect a problem in the numerics/physics add a screenshot of your output.

If you encounter any errors/warnings/... during execution please provide the console output.
2528718.ridgecrest.err.txt
2528718.ridgecrest.out.txt

Additional context
Add any other context about the problem here.
The setup did already run successfully with nico/2nuc 7f8c759 but without writing surface/volume output.
2526877.ridgecrest.out.txt

The text was updated successfully, but these errors were encountered:

krenzland · 2023-01-09T08:29:12Z

Do you have the following line in your parameter file:

export I_MPI_SHM_HEAP_VSIZE=8192

See #691

NicoSchlw · 2023-01-09T09:37:08Z

Hi, I did not have this line in my submission file when the problem occurred first. But I added it afterward, changed the CFL from 0.45 to 0.35, and used mpiexec instead of srun. Since then the error did not occur again at this timestep but twice at a later timestep (I did not recognize the first time because a node failure overprinted it):

2553177.ridgecrest.out.txt
2553177.ridgecrest.err.txt

Thomas-Ulrich · 2023-01-11T10:11:17Z

Sounds like a memory leak?

NicoSchlw · 2023-10-06T08:49:05Z

Hi, this issue still persists with the latest version of SeisSol (nico/2nuc-latest-master branch 908b691)
3145028.ridgecrest.out.txt
parameters.par.txt

The model failed after simulating 145.3 s. This is the first error message:
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2246: comm->shm_numa_layout[my_numa_node].base_addr

Both times it happened with the Ridgecrest setup, but with very different meshes (smooth <=> rough fault, different topography resolution, different domain size)

krenzland · 2023-10-06T09:00:07Z

As an aside, you should definitely increase your receiver output interval a bit (maybe 0.1s like the free surface output?).

ReceiverOutputInterval = 0.01

Note that this is not the sampling rate but rather how often the files are written!
This will significantly reduce the number of synchronization points. You can see this in the lines that look like this:

Thu Oct 05 22:32:20, Info: Wrote receivers in 3.85e-07 seconds.

davschneller · 2023-10-06T11:07:28Z

Maybe you'll also benefit from using the features from #952 . Currently, we sort of have a memory increase by storing profiling data all the time. That may accumulate eventually (cf. #948 ), though that may happen only with very frequent updates. So far, I don't think we've encountered a problem with that in "production"—but are we completely sure about that?

Maybe we should output the resident set size for each thread from time to time, just to get more info on everything? That is, call getrusage from time to time with RUSAGE_THREAD; maybe at each sync point. https://man7.org/linux/man-pages/man2/getrusage.2.html

NicoSchlw · 2023-10-06T11:23:01Z

Is the profiling data storing frequency also given by ReceiverOutputInterval?

davschneller · 2023-10-06T11:45:37Z

It is usually not stored at all (unless specified), only used for the regression analysis at the end (cf. e.g. here: https://gitlab.lrz.de/ci_seissol/SeisSol/-/jobs/6740080#L271 ). However, the samples are accumulated over the course of the program—that is, each time a cluster gets updated, an entry is added.

NicoSchlw · 2023-10-06T11:49:40Z

Ok thanks a lot, I will test if the issue is fixed by #952.

davschneller · 2023-10-06T12:05:00Z

Two more things: there seems to be a similar issue floating around, apparently it's supposed to be fixed by IMPI in version 2023.2 . ( https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/MPI-program-aborts-with-an-quot-Assertion-failed-in-file-ch4-shm/m-p/1370537 ) Other than that... I can right now only think about increasing I_MPI_SHM_HEAP_VSIZE even more... Or adding some I_MPI_DEBUG level, if you can sustain much larger logs?

krenzland · 2023-10-06T12:45:40Z

David, I've also seen that. Sadly it is not really actionable for us, as only pretty outdated MPI versions are available on SuperMUC :(

NicoSchlw · 2023-10-09T08:53:43Z

Hi, the setup did run successfully after merging #952 into nico/2nuc-latest-master (57bb256) and increasing I_MPI_SHM_HEAP_VSIZE=32768. Maybe the HEAP_VSIZE was the crucial factor as it already delayed the error earlier:

Hi, I did not have this line in my submission file when the problem occurred first. But I added it afterward, changed the CFL from 0.45 to 0.35, and used mpiexec instead of srun. Since then the error did not occur again at this timestep but twice at a later timestep (I did not recognize the first time because a node failure overprinted it):

2553177.ridgecrest.out.txt 2553177.ridgecrest.err.txt

davschneller · 2023-10-12T19:49:15Z

By the way... Did this error occur with viscoelastic2 only, or also with other materials?

NicoSchlw · 2023-10-13T08:22:44Z

It occurred only with viscoelastic2 (but I did only run models with viscoelastic2)

NicoSchlw added the bug label Jan 2, 2023

davschneller mentioned this issue Nov 30, 2023

Persistent MPI #997

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpi error on supermucNG #771

mpi error on supermucNG #771

NicoSchlw commented Jan 2, 2023

krenzland commented Jan 9, 2023

NicoSchlw commented Jan 9, 2023

Thomas-Ulrich commented Jan 11, 2023

NicoSchlw commented Oct 6, 2023

krenzland commented Oct 6, 2023

davschneller commented Oct 6, 2023

NicoSchlw commented Oct 6, 2023

davschneller commented Oct 6, 2023

NicoSchlw commented Oct 6, 2023

davschneller commented Oct 6, 2023

krenzland commented Oct 6, 2023

NicoSchlw commented Oct 9, 2023

davschneller commented Oct 12, 2023

NicoSchlw commented Oct 13, 2023

mpi error on supermucNG #771

mpi error on supermucNG #771

Comments

NicoSchlw commented Jan 2, 2023

krenzland commented Jan 9, 2023

NicoSchlw commented Jan 9, 2023

Thomas-Ulrich commented Jan 11, 2023

NicoSchlw commented Oct 6, 2023

krenzland commented Oct 6, 2023

davschneller commented Oct 6, 2023

NicoSchlw commented Oct 6, 2023

davschneller commented Oct 6, 2023

NicoSchlw commented Oct 6, 2023

davschneller commented Oct 6, 2023

krenzland commented Oct 6, 2023

NicoSchlw commented Oct 9, 2023

davschneller commented Oct 12, 2023

NicoSchlw commented Oct 13, 2023