Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mpi error on supermucNG #771

Open
NicoSchlw opened this issue Jan 2, 2023 · 14 comments
Open

mpi error on supermucNG #771

NicoSchlw opened this issue Jan 2, 2023 · 14 comments
Labels

Comments

@NicoSchlw
Copy link
Contributor

Describe the bug
The simulation hangs after a certain timestep but the job is not killed. The error log shows that this is mpi related.

Expected behavior
The simulation should run till the end. If an error occurs the job should be canceled.

To Reproduce
Steps to reproduce the behavior:

  1. Which version do you use? Provide branch and commit id.
    Error occurred with nico/2nuc 7f8c759 and fix707-2nuc a71e59e without enforcing the state variable to be positive
  2. Which build settings do you use? Which compiler version do you use?
    Viscoelastic2, double precision, intel mpi compiler
  3. On which machine does your problem occur? If on a cluster: Which modules are loaded?
    SupermucNG:
    Screenshot from 2023-01-02 10-10-55
  4. Provide parameter/material files.
    parameters.par.txt

Screenshots/Console output
If you suspect a problem in the numerics/physics add a screenshot of your output.

If you encounter any errors/warnings/... during execution please provide the console output.
2528718.ridgecrest.err.txt
2528718.ridgecrest.out.txt

Additional context
Add any other context about the problem here.
The setup did already run successfully with nico/2nuc 7f8c759 but without writing surface/volume output.
2526877.ridgecrest.out.txt

@NicoSchlw NicoSchlw added the bug label Jan 2, 2023
@krenzland
Copy link
Contributor

Do you have the following line in your parameter file:

export I_MPI_SHM_HEAP_VSIZE=8192

See #691

@NicoSchlw
Copy link
Contributor Author

Hi, I did not have this line in my submission file when the problem occurred first. But I added it afterward, changed the CFL from 0.45 to 0.35, and used mpiexec instead of srun. Since then the error did not occur again at this timestep but twice at a later timestep (I did not recognize the first time because a node failure overprinted it):

2553177.ridgecrest.out.txt
2553177.ridgecrest.err.txt

@Thomas-Ulrich
Copy link
Contributor

Sounds like a memory leak?

@NicoSchlw
Copy link
Contributor Author

Hi, this issue still persists with the latest version of SeisSol (nico/2nuc-latest-master branch 908b691)
3145028.ridgecrest.out.txt
parameters.par.txt

The model failed after simulating 145.3 s. This is the first error message:
Assertion failed in file ../../src/mpid/ch4/src/intel/ch4_shm_coll.c at line 2246: comm->shm_numa_layout[my_numa_node].base_addr

Both times it happened with the Ridgecrest setup, but with very different meshes (smooth <=> rough fault, different topography resolution, different domain size)

@krenzland
Copy link
Contributor

As an aside, you should definitely increase your receiver output interval a bit (maybe 0.1s like the free surface output?).

ReceiverOutputInterval = 0.01

Note that this is not the sampling rate but rather how often the files are written!
This will significantly reduce the number of synchronization points. You can see this in the lines that look like this:

Thu Oct 05 22:32:20, Info: Wrote receivers in 3.85e-07 seconds.

@davschneller
Copy link
Contributor

Maybe you'll also benefit from using the features from #952 . Currently, we sort of have a memory increase by storing profiling data all the time. That may accumulate eventually (cf. #948 ), though that may happen only with very frequent updates. So far, I don't think we've encountered a problem with that in "production"—but are we completely sure about that?

Maybe we should output the resident set size for each thread from time to time, just to get more info on everything? That is, call getrusage from time to time with RUSAGE_THREAD; maybe at each sync point. https://man7.org/linux/man-pages/man2/getrusage.2.html

@NicoSchlw
Copy link
Contributor Author

Is the profiling data storing frequency also given by ReceiverOutputInterval?

@davschneller
Copy link
Contributor

It is usually not stored at all (unless specified), only used for the regression analysis at the end (cf. e.g. here: https://gitlab.lrz.de/ci_seissol/SeisSol/-/jobs/6740080#L271 ). However, the samples are accumulated over the course of the program—that is, each time a cluster gets updated, an entry is added.

@NicoSchlw
Copy link
Contributor Author

Ok thanks a lot, I will test if the issue is fixed by #952.

@davschneller
Copy link
Contributor

Two more things: there seems to be a similar issue floating around, apparently it's supposed to be fixed by IMPI in version 2023.2 . ( https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/MPI-program-aborts-with-an-quot-Assertion-failed-in-file-ch4-shm/m-p/1370537 ) Other than that... I can right now only think about increasing I_MPI_SHM_HEAP_VSIZE even more... Or adding some I_MPI_DEBUG level, if you can sustain much larger logs?

@krenzland
Copy link
Contributor

David, I've also seen that. Sadly it is not really actionable for us, as only pretty outdated MPI versions are available on SuperMUC :(

@NicoSchlw
Copy link
Contributor Author

Hi, the setup did run successfully after merging #952 into nico/2nuc-latest-master (57bb256) and increasing I_MPI_SHM_HEAP_VSIZE=32768. Maybe the HEAP_VSIZE was the crucial factor as it already delayed the error earlier:

Hi, I did not have this line in my submission file when the problem occurred first. But I added it afterward, changed the CFL from 0.45 to 0.35, and used mpiexec instead of srun. Since then the error did not occur again at this timestep but twice at a later timestep (I did not recognize the first time because a node failure overprinted it):

2553177.ridgecrest.out.txt 2553177.ridgecrest.err.txt

@davschneller
Copy link
Contributor

By the way... Did this error occur with viscoelastic2 only, or also with other materials?

@NicoSchlw
Copy link
Contributor Author

It occurred only with viscoelastic2 (but I did only run models with viscoelastic2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants