Re-established some albedo IO-optimizations from #400 by svenharig · Pull Request #885 · FESOM/fesom2

svenharig · 2026-04-21T10:07:58Z

This PR does re-establish some of the changes dealing with IO issues on albedo that were included in the branch refactoring_albedo_env, see #400.

These entries are not present in Fesom 2.7, however, experiments on albedo show much more regular behavior across the cores in the output phases of the model if Intel compiler and Intel MPI are used (Settings in env/albedo/shell). Especially the MPI_Barriers in the source code io_meandata.F90 seem to be relevant in this respect

As an example here the output timings for an experiment on 2592 cores on albedo:

Original:
___MODEL RUNTIME per task [seconds]____mean____________min____________max

runtime output: 15.7956 0.6689 3588.7000
runtime restart: 42.7859 0.0435 6165.9058

================ BENCHMARK RUNTIME ===================
Number of cores : 2592
Runtime for all timesteps : 6181.1978 sec

After re-establishing the entries for ENABLE_ALBEDO_INTELMPI_WORKAROUNDS:

___MODEL RUNTIME per task [seconds]____mean____________min____________max

runtime output: 456.7661 456.5835 457.6900
runtime restart: 30.9291 30.0464 30.9496

================ BENCHMARK RUNTIME ===================
Number of cores : 2592
Runtime for all timesteps : 502.2796 sec

This behavior was tested in several experiments and was so far reproducible, however not yet with other compilers or OpenMPI. We suggest to consider to re-insert the flags in other branches. Path to the experiments:
/albedo/work/user/sharig/f27_r31_dev/config/bin_fesom2.7_lrg/

@mandresm

JanStreffing · 2026-04-21T13:07:54Z

The level loop in write_mean fires one MPI_Gatherv per level. Without a barrier, every rank rushes ahead and sends level 2, 3, 4, … before the I/O rank has finished matching level 1. Those early messages pile up in Intel MPI's unexpected-message queue, which is matched linearly. Once it has thousands of entries, matching goes O(N), the progress engine falls behind, and a few unlucky ranks block for an hour waiting for their send buffer to drain.

The barrier forces all ranks onto the same level before any gather starts, so only one level of traffic is ever in flight. The queue stays short, matching stays O(1), no stragglers.

The mean goes up (15.8 → 456.8 s) because every rank now waits for the slowest, but the max collapses (3588 → 457 s) and total runtime drops ~12×.

On other I'd keep it OFF by default. On OpenMPI (Levante) or cray-mpich, or with async I/O threads, the queue pileup doesn't happen the same way and the barrier would be pure overhead. The opt-in via ENABLE_ALBEDO_INTELMPI_WORKAROUNDS is the right granularity — anyone seeing the "small mean, huge max" signature should flip it on, others shouldn't.

ogurses

Do we have to apply similar workaround for NHR architecture?

mandresm · 2026-04-22T06:09:03Z

Hi @ogurses, I don't know. By what @JanStreffing wrote I would guess that if you are using intel mpi or any MPI without async I/O threads, it might help. What I would recommend though is to run the tests Sven ran and check the times he was checking. If the writting times match the pattern and it is really slow on writing you can try adding it. If the problem exists also there it would be good to know which MPI distro are you using. Maybe the switch can be generalized to something that is not Albedo specific, but MPI distro/option-specific?

patrickscholz · 2026-04-28T08:31:34Z

@svenharig @JanStreffing : Maybe we should introduce here a warning message for the user (like other messages we already have), by comparing the ratio of max/min process output wcltime > 5 or 10. If its the case spit out a warning that the ENABLE_ALBEDO_INTELMPI_WORKAROUNDS flag should be better active or its status should be at least checked, togehter with a bit of explanation ?! At least like that, most user would be able to help themself.

Re-established some albedo IO-optimizations from #400

2b0b79e

svenharig requested review from JanStreffing, ogurses and patrickscholz April 21, 2026 10:07

svenharig mentioned this pull request Apr 21, 2026

Simulations with FESOM 2.7 are significantly slower than FESOM 2.6 with output/restart enabled #874

Open

ogurses approved these changes Apr 21, 2026

View reviewed changes

svenharig merged commit 776fafc into fesom2.7-recom3.1 Apr 22, 2026

svenharig deleted the fesom2.7-recom3.1_dev_albedo branch April 22, 2026 09:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-established some albedo IO-optimizations from #400#885

Re-established some albedo IO-optimizations from #400#885
svenharig merged 1 commit into
fesom2.7-recom3.1from
fesom2.7-recom3.1_dev_albedo

svenharig commented Apr 21, 2026 •

edited

Loading

Uh oh!

JanStreffing commented Apr 21, 2026

Uh oh!

ogurses left a comment

Uh oh!

mandresm commented Apr 22, 2026

Uh oh!

patrickscholz commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

svenharig commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JanStreffing commented Apr 21, 2026

Uh oh!

ogurses left a comment

Choose a reason for hiding this comment

Uh oh!

mandresm commented Apr 22, 2026

Uh oh!

patrickscholz commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

svenharig commented Apr 21, 2026 •

edited

Loading