Re-established some albedo IO-optimizations from #400#885
Conversation
|
The level loop in write_mean fires one MPI_Gatherv per level. Without a barrier, every rank rushes ahead and sends level 2, 3, 4, … before the I/O rank has finished matching level 1. Those early messages pile up in Intel MPI's unexpected-message queue, which is matched linearly. Once it has thousands of entries, matching goes O(N), the progress engine falls behind, and a few unlucky ranks block for an hour waiting for their send buffer to drain. The barrier forces all ranks onto the same level before any gather starts, so only one level of traffic is ever in flight. The queue stays short, matching stays O(1), no stragglers. The mean goes up (15.8 → 456.8 s) because every rank now waits for the slowest, but the max collapses (3588 → 457 s) and total runtime drops ~12×. On other I'd keep it OFF by default. On OpenMPI (Levante) or cray-mpich, or with async I/O threads, the queue pileup doesn't happen the same way and the barrier would be pure overhead. The opt-in via ENABLE_ALBEDO_INTELMPI_WORKAROUNDS is the right granularity — anyone seeing the "small mean, huge max" signature should flip it on, others shouldn't. |
ogurses
left a comment
There was a problem hiding this comment.
Do we have to apply similar workaround for NHR architecture?
|
Hi @ogurses, I don't know. By what @JanStreffing wrote I would guess that if you are using intel mpi or any MPI without async I/O threads, it might help. What I would recommend though is to run the tests Sven ran and check the times he was checking. If the writting times match the pattern and it is really slow on writing you can try adding it. If the problem exists also there it would be good to know which MPI distro are you using. Maybe the switch can be generalized to something that is not Albedo specific, but MPI distro/option-specific? |
|
@svenharig @JanStreffing : Maybe we should introduce here a warning message for the user (like other messages we already have), by comparing the ratio of max/min process output wcltime > 5 or 10. If its the case spit out a warning that the ENABLE_ALBEDO_INTELMPI_WORKAROUNDS flag should be better active or its status should be at least checked, togehter with a bit of explanation ?! At least like that, most user would be able to help themself. |
This PR does re-establish some of the changes dealing with IO issues on albedo that were included in the branch refactoring_albedo_env, see #400.
These entries are not present in Fesom 2.7, however, experiments on albedo show much more regular behavior across the cores in the output phases of the model if Intel compiler and Intel MPI are used (Settings in env/albedo/shell). Especially the MPI_Barriers in the source code io_meandata.F90 seem to be relevant in this respect
As an example here the output timings for an experiment on 2592 cores on albedo:
Original:
___MODEL RUNTIME per task [seconds]____mean____________min____________max
runtime output: 15.7956 0.6689 3588.7000
runtime restart: 42.7859 0.0435 6165.9058
================ BENCHMARK RUNTIME ===================
Number of cores : 2592
Runtime for all timesteps : 6181.1978 sec
After re-establishing the entries for ENABLE_ALBEDO_INTELMPI_WORKAROUNDS:
___MODEL RUNTIME per task [seconds]____mean____________min____________max
runtime output: 456.7661 456.5835 457.6900
runtime restart: 30.9291 30.0464 30.9496
================ BENCHMARK RUNTIME ===================
Number of cores : 2592
Runtime for all timesteps : 502.2796 sec
This behavior was tested in several experiments and was so far reproducible, however not yet with other compilers or OpenMPI. We suggest to consider to re-insert the flags in other branches. Path to the experiments:
/albedo/work/user/sharig/f27_r31_dev/config/bin_fesom2.7_lrg/
@mandresm