-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mpi error on supermucNG #771
Comments
Do you have the following line in your parameter file:
See #691 |
Hi, I did not have this line in my submission file when the problem occurred first. But I added it afterward, changed the CFL from 0.45 to 0.35, and used mpiexec instead of srun. Since then the error did not occur again at this timestep but twice at a later timestep (I did not recognize the first time because a node failure overprinted it): |
Sounds like a memory leak? |
Hi, this issue still persists with the latest version of SeisSol (nico/2nuc-latest-master branch The model failed after simulating 145.3 s. This is the first error message: Both times it happened with the Ridgecrest setup, but with very different meshes (smooth <=> rough fault, different topography resolution, different domain size) |
As an aside, you should definitely increase your receiver output interval a bit (maybe 0.1s like the free surface output?).
Note that this is not the sampling rate but rather how often the files are written!
|
Maybe you'll also benefit from using the features from #952 . Currently, we sort of have a memory increase by storing profiling data all the time. That may accumulate eventually (cf. #948 ), though that may happen only with very frequent updates. So far, I don't think we've encountered a problem with that in "production"—but are we completely sure about that? Maybe we should output the resident set size for each thread from time to time, just to get more info on everything? That is, call |
Is the profiling data storing frequency also given by |
It is usually not stored at all (unless specified), only used for the regression analysis at the end (cf. e.g. here: https://gitlab.lrz.de/ci_seissol/SeisSol/-/jobs/6740080#L271 ). However, the samples are accumulated over the course of the program—that is, each time a cluster gets updated, an entry is added. |
Ok thanks a lot, I will test if the issue is fixed by #952. |
Two more things: there seems to be a similar issue floating around, apparently it's supposed to be fixed by IMPI in version 2023.2 . ( https://community.intel.com/t5/Intel-oneAPI-HPC-Toolkit/MPI-program-aborts-with-an-quot-Assertion-failed-in-file-ch4-shm/m-p/1370537 ) Other than that... I can right now only think about increasing |
David, I've also seen that. Sadly it is not really actionable for us, as only pretty outdated MPI versions are available on SuperMUC :( |
Hi, the setup did run successfully after merging #952 into nico/2nuc-latest-master (
|
By the way... Did this error occur with viscoelastic2 only, or also with other materials? |
It occurred only with viscoelastic2 (but I did only run models with viscoelastic2) |
Describe the bug
The simulation hangs after a certain timestep but the job is not killed. The error log shows that this is mpi related.
Expected behavior
The simulation should run till the end. If an error occurs the job should be canceled.
To Reproduce
Steps to reproduce the behavior:
Error occurred with nico/2nuc
7f8c759
and fix707-2nuca71e59e
without enforcing the state variable to be positiveViscoelastic2, double precision, intel mpi compiler
SupermucNG:
parameters.par.txt
Screenshots/Console output
If you suspect a problem in the numerics/physics add a screenshot of your output.
If you encounter any errors/warnings/... during execution please provide the console output.
2528718.ridgecrest.err.txt
2528718.ridgecrest.out.txt
Additional context
Add any other context about the problem here.
The setup did already run successfully with nico/2nuc
7f8c759
but without writing surface/volume output.2526877.ridgecrest.out.txt
The text was updated successfully, but these errors were encountered: