Skip to content

Re-established some albedo IO-optimizations from #400#885

Merged
svenharig merged 1 commit into
fesom2.7-recom3.1from
fesom2.7-recom3.1_dev_albedo
Apr 22, 2026
Merged

Re-established some albedo IO-optimizations from #400#885
svenharig merged 1 commit into
fesom2.7-recom3.1from
fesom2.7-recom3.1_dev_albedo

Conversation

@svenharig
Copy link
Copy Markdown
Collaborator

@svenharig svenharig commented Apr 21, 2026

This PR does re-establish some of the changes dealing with IO issues on albedo that were included in the branch refactoring_albedo_env, see #400.

These entries are not present in Fesom 2.7, however, experiments on albedo show much more regular behavior across the cores in the output phases of the model if Intel compiler and Intel MPI are used (Settings in env/albedo/shell). Especially the MPI_Barriers in the source code io_meandata.F90 seem to be relevant in this respect

As an example here the output timings for an experiment on 2592 cores on albedo:

Original:
___MODEL RUNTIME per task [seconds]____mean____________min____________max

runtime output: 15.7956 0.6689 3588.7000
runtime restart: 42.7859 0.0435 6165.9058

================ BENCHMARK RUNTIME ===================
Number of cores : 2592
Runtime for all timesteps : 6181.1978 sec

After re-establishing the entries for ENABLE_ALBEDO_INTELMPI_WORKAROUNDS:

___MODEL RUNTIME per task [seconds]____mean____________min____________max

runtime output: 456.7661 456.5835 457.6900
runtime restart: 30.9291 30.0464 30.9496

================ BENCHMARK RUNTIME ===================
Number of cores : 2592
Runtime for all timesteps : 502.2796 sec

This behavior was tested in several experiments and was so far reproducible, however not yet with other compilers or OpenMPI. We suggest to consider to re-insert the flags in other branches. Path to the experiments:
/albedo/work/user/sharig/f27_r31_dev/config/bin_fesom2.7_lrg/

@mandresm

@JanStreffing
Copy link
Copy Markdown
Collaborator

The level loop in write_mean fires one MPI_Gatherv per level. Without a barrier, every rank rushes ahead and sends level 2, 3, 4, … before the I/O rank has finished matching level 1. Those early messages pile up in Intel MPI's unexpected-message queue, which is matched linearly. Once it has thousands of entries, matching goes O(N), the progress engine falls behind, and a few unlucky ranks block for an hour waiting for their send buffer to drain.

The barrier forces all ranks onto the same level before any gather starts, so only one level of traffic is ever in flight. The queue stays short, matching stays O(1), no stragglers.

The mean goes up (15.8 → 456.8 s) because every rank now waits for the slowest, but the max collapses (3588 → 457 s) and total runtime drops ~12×.

On other I'd keep it OFF by default. On OpenMPI (Levante) or cray-mpich, or with async I/O threads, the queue pileup doesn't happen the same way and the barrier would be pure overhead. The opt-in via ENABLE_ALBEDO_INTELMPI_WORKAROUNDS is the right granularity — anyone seeing the "small mean, huge max" signature should flip it on, others shouldn't.

Copy link
Copy Markdown
Collaborator

@ogurses ogurses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to apply similar workaround for NHR architecture?

@mandresm
Copy link
Copy Markdown
Collaborator

Hi @ogurses, I don't know. By what @JanStreffing wrote I would guess that if you are using intel mpi or any MPI without async I/O threads, it might help. What I would recommend though is to run the tests Sven ran and check the times he was checking. If the writting times match the pattern and it is really slow on writing you can try adding it. If the problem exists also there it would be good to know which MPI distro are you using. Maybe the switch can be generalized to something that is not Albedo specific, but MPI distro/option-specific?

@svenharig svenharig merged commit 776fafc into fesom2.7-recom3.1 Apr 22, 2026
@svenharig svenharig deleted the fesom2.7-recom3.1_dev_albedo branch April 22, 2026 09:54
@patrickscholz
Copy link
Copy Markdown
Contributor

@svenharig @JanStreffing : Maybe we should introduce here a warning message for the user (like other messages we already have), by comparing the ratio of max/min process output wcltime > 5 or 10. If its the case spit out a warning that the ENABLE_ALBEDO_INTELMPI_WORKAROUNDS flag should be better active or its status should be at least checked, togehter with a bit of explanation ?! At least like that, most user would be able to help themself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants