-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error writing checkpoints at high core counts #548
Comments
I received a similar error running a C360 sim on 1200 cores. The error message I got was
My libraries are
In my run I have NX=10, NY=120. I used 30 cores per node across 40 nodes with 300 GB or memory per node. Let me know if there's any more information that I can provide. |
A couple of things. First, do you set any OMPI_ environment variables or pass any mca options to the Second, as a test can you see if adding:
to |
For me,
My only OMPI MCA setting is:
I must admit, I'm not familiar with these settings. Our sysadmin set this and I've used it blindly. For the second point, I'll give |
I've got a rerun in the queue with the I actually use |
It looks pretty standard. It looks like you built with UCX instead of verbs which I think is the current preferred method for Infiniband. I will note I often have issues with The Line 10 in 34eae4b
so I can't complain. If I had a thought from what you've said, it might be to try a newer version of UCX. Say one in the 1.8 or the new 1.9.0. Though maybe that'll just cause different errors... |
I checked this morning and my sim is ~3 days in, so it looks like
worked! Thanks! What does this switch do? |
This is now solved for me as well. My first run kept |
I'll ping @weiyuan-jiang to the thread to be more specific, but when I was trying to run with Open MPI on Discover, I found that it was taking ages to write out restarts. I think I eventually tracked it down to Open MPI having some bad MPI_GatherV (or Gather? can't remember) timings. Like stupid bad. And guess what calls are used when writing checkpoints/restarts? 😄 So, I asked around and it turns out @weiyuan-jiang added a (somewhat hidden) ability for the IOSERVER to write the restarts instead of the "normal" path. The IOSERVER uses Send/Recv I think, so it bypassed the bad performing call. Now, I will say that in our GEOSldas @weiyuan-jiang found some sort of oddity happening with the |
This is great. I'll add the line to the default |
@lizziel Note that I only turn this on with Open MPI. I keep our "default" behavior with Intel MPI, etc. because, well, it works so don't rock the boat. (Well, we do need |
I am checking with @bena-nasa . Eventually, we will eliminate the parameter WRITE_RESTART_BY_OSERVER. So far without this parameter, the program goes to different branch which may cause problems. |
If you do eliminate it, that would probably mean I have to stop using Open MPI on discover. It is the only way I can write checkpoints due to the crazy slow MPI_GatherV performance. |
Could you update checkpoint writing to be similar to History writing so it avoids the problem? |
Even we use WRITE_RESTART_BY_OSERVER, we still use mpi_gatherV. I am wondering if that is the problem. |
@weiyuan-jiang But isn't it the case that when we use the OSERVER, the gatherv() is on a much smaller set of processes? For the main application there are many cores and there are therefore many very small messages. On the server it is much fewer cores and thus fewer and larger messages. |
Oserver does not have mpi_gatherV. This gatherV happens in the client side only in 1d tile space. On the client side, it gathers all the data and then send it through oserver. For multi-dimension, WRITE_RESTART_BY_OSERVER bypassed the gatherV. @tclune |
@lizziel Do you have problem after you set WRITE_RESTART_BY_OSERVER to yes? |
It looks like you hit an MPI problem in a gatherV. Like Weiyan said, if you do the write by oserver option it bypasses doing a gatherV and takes whole different code path to write the checkpoint. So you have sidestepped the problem by not exercising the code that was causing the initial problem. |
I have not noticed any run issues after setting WRITE_RESTART_BY_OSERVER to yes. |
I am going to close this issue. Please keep @LiamBindle and myself informed if there is a new fix in a future MAPL release, or if this fix is to be retired without a replacement for the problem. |
@LiamBindle The other one to watch out for is |
Thanks @mathomp4—I'll give that a try too. |
I've been running GCHPctm with MAPL 2.2.7 for various grid resolutions and core counts on the Harvard Cannon cluster. I am encountering an error while writing checkpoint files when running with high core counts, in my case 1440 cores. The error is in UCX, so not MAPL specifically, but it is specific to the MAPL checkpoint files:
My libraries are as follows (plus UCX 1.6.0):
My run is at c180, and NX=16 and NY=90. I am using 24 cores per node across 60 nodes, reserving full 128G memory for each. Originally I encountered this error at the beginning of the run because I had periodic checkpoints configured (RECORD_* in GCHP.rc) which caused a checkpoint to be written at the beginning of the run. I turned that off and my run then got to the end, successfully wrote History files, but then again encountered the issue writing the checkpoint file.
@LiamBindle also encountered this problem on a separate compute cluster with c360 using 1200 cores.
Have you seen this before?
The text was updated successfully, but these errors were encountered: