Auto merge partitioned netcdf files by alperaltuntas · Pull Request #5 · ESCOMP/FMS

alperaltuntas · 2025-03-04T20:52:14Z

This is one of three PRs to enable automated parallel IO in MOM6 within CESM by making use of the existing parallel IO implementation that comes with FMS. These series of PRs (1) enable FMS parallel IO, (2) make sure the IO PE Layout is compatible with auto-land block elimination (i.e., masking), and (3) that partitioned files are automatically merged after a run is completed.

The major change in this PR is the introduction of mppnccombine tool into the FMS diag_manager. When an IO_LAYOUT is specified, FMS now automatically merges the partitioned files when diag manager closes the open netcdf files. For domain files, this is done in a round robin fashion, where each IO PE takes care of the auto-merging of a history file stream. For section files, the IO PE with the smallest PE index takes care of the merging. The auto merge feature is enabled by the newly added auto_merge_nc namelist parameter.

testing: aux_mom.derecho
status: b4b, except for masking in static files due to changing compute layout.
Performance: varies depending on the resolution, pe count and IO layout, from no enhancement or degradation to more than 50% enhancement.

This PR should be evaluated in conjunction with:
NCAR/MOM6#342
ESCOMP/MOM_interface#235

…and remove existing combined file when root pe opens a partitioned one.

…ion interfaces

alperaltuntas · 2025-03-04T23:09:11Z

Mentioning @mnlevy1981 and @gustavo-marques.

jedwards4b · 2025-03-05T14:59:02Z

What is the suggested MOM IO strategy for fully coupled compsets? Is this something that can be set as default or does it require user modifications?

jedwards4b

I think that it would have been better to integrate MOM IO with PIO as with other component models, but I understand the motivation for doing it this way.

alperaltuntas · 2025-03-05T15:40:32Z

@jedwards4b I think this approach is lightweight and more maintainable in the long run. I do have an FMS branch where I integrated PIO into the FMS diagnostic manager. That integration is mostly complete, with two remaining issues: compatibility with auto-masking and handling of section files.

However, that branch introduces significant changes, over 15,000 new lines of code, which makes me concerned about maintainability and potential fragility. Given the scope of these modifications, I opted for this simpler approach. That said, I’m happy to continue pursuing the full PIO integration if I can get support for resolving the remaining issues and, more importantly, assistance with ongoing maintenance, including porting to newer versions of FMS as needed.

Here’s the branch with the PIO integration:
dev/ncar...alperaltuntas:FMS:dev/ncar_add_pio

…mbine via a barrier

mnlevy1981

I compared a 1 year run with G1850MARBL_JRA on the 2/3° grid with the parallel I/O to the current out-of-the-box configuration, and saw a modest speedup with parallel I/O:

The actual runtime of the model (from "case.run starting" to "case.run success" in the CaseStatus file) decreased from 7852 sec (6142 pe-hrs/sim-year, 11 SYPD) to 7594 sec (5940.2 pe-hrs/sim-year, 11.4 SYPD) -- a 3% reduction in cost of the run
The reported Model Cost in the timing file decreased from 5920.28 pe-hrs/simulated_year (11.42 SYPD) to 5644.11 pe-hrs/simulated_year (11.97 SYPD) -- a 5% reduction in cost of the run
Another number of interest: my first test with the parallel I/O enabled was running at a reported 12.3 SYPD; this run failed in the "combine the various netCDF patches" stage, but a rough estimate would be an "actual runtime" (per CaseStatus) of 7395 sec (5785 pe-hrs/sim-year, 11.68 SYPD), or a 5.75% decrease in runtime over the serial I/O.

So I think this is good for a 3% - 6% performance boost out of the box with the MARBL tracers active

alperaltuntas · 2025-03-18T15:22:27Z

@mnlevy1981, I wonder if MARBL would benefit from increased TARGET_IO_PES. I can run another case based on your one-year run if you share your test directory, unless you'd prefer to experiment with different TARGET_IO_PES settings yourself.

mnlevy1981 · 2025-03-18T15:38:10Z

If you don't mind running the tests, my CESM sandbox is /glade/work/mlevy/codes/CESM/cesm3_0_alpha06c_MOMmainPR and there is a cases/ subdirectory that has my control (dev_ncar_PR.pre_MARBL_updates) and my latest (dev_ncar_PR.my_updates).

I didn't fix the PE layout until after running dev_ncar_PR.my_updates, so multiply the reported Model Cost in my runs by 0.55 (I was running on 40 nodes, but 18 of them were idle; you should be running on 22 nodes); Model Throughput doesn't need to be adjusted.

mnlevy1981 · 2025-03-18T15:54:53Z

Sorry, my last comment was pointing to testing for merging the dev/ncar branch of MOM6 into mom-ocean:main; the parallel I/O testing is in /glade/work/mlevy/codes/CESM/cesm3_0_alpha06c_MOMIO. cases/G1850MARBL_JRA.alper_updates is my run with the changes in I/O, and the control was out of a different alpha06c sandbox: /glade/work/mlevy/codes/CESM/cesm3_0_alpha06c_cupid/cases/G1850MARBL_JRA.no_alper_updates

by determining which IO PE combines which file based on the order of IO PES (as opposed to IO PE index, which is not guaranteed to be contiguous when land block elimination is enabled)

alperaltuntas added 6 commits February 16, 2025 15:17

first working version of auto nccombine execution

291e3df

reme existing outfile, if an old version exists

0bc2cef

first complete version of automated file combiner

79131c0

set extent_type to 1 for fileobjects opened with open_domain_file

9fb9338

use select(type) block to determine netcdf file type (domain or not) …

f3235f2

…and remove existing combined file when root pe opens a partitioned one.

introduce auto_merge_nc nml parameter and simplify mppnccombine funct…

8533422

…ion interfaces

This was referenced Mar 4, 2025

Enable auto IO LAYOUT (via FMS parallel IO) ESCOMP/MOM_interface#235

Merged

Add option to auto-compute an IO LAYOUT when auto masking is on NCAR/MOM6#342

Merged

alperaltuntas requested a review from jedwards4b March 4, 2025 23:00

jedwards4b approved these changes Mar 5, 2025

View reviewed changes

split the identification of files to combine and execution of mppncco…

56a24d0

…mbine via a barrier

mnlevy1981 approved these changes Mar 17, 2025

View reviewed changes

alperaltuntas added 2 commits April 10, 2025 17:35

fix round robin combination algm

e6726af

by determining which IO PE combines which file based on the order of IO PES (as opposed to IO PE index, which is not guaranteed to be contiguous when land block elimination is enabled)

comment out Combining file log

05bb5a4

alperaltuntas merged commit e6daa5f into dev/ncar Apr 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto merge partitioned netcdf files#5

Auto merge partitioned netcdf files#5
alperaltuntas merged 9 commits intodev/ncarfrom
incorporate_mppnccombine

alperaltuntas commented Mar 4, 2025 •

edited

Loading

Uh oh!

alperaltuntas commented Mar 4, 2025

Uh oh!

jedwards4b commented Mar 5, 2025

Uh oh!

jedwards4b left a comment

Uh oh!

alperaltuntas commented Mar 5, 2025

Uh oh!

mnlevy1981 left a comment

Uh oh!

alperaltuntas commented Mar 18, 2025

Uh oh!

mnlevy1981 commented Mar 18, 2025

Uh oh!

mnlevy1981 commented Mar 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alperaltuntas commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alperaltuntas commented Mar 4, 2025

Uh oh!

jedwards4b commented Mar 5, 2025

Uh oh!

jedwards4b left a comment

Choose a reason for hiding this comment

Uh oh!

alperaltuntas commented Mar 5, 2025

Uh oh!

mnlevy1981 left a comment

Choose a reason for hiding this comment

Uh oh!

alperaltuntas commented Mar 18, 2025

Uh oh!

mnlevy1981 commented Mar 18, 2025

Uh oh!

mnlevy1981 commented Mar 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alperaltuntas commented Mar 4, 2025 •

edited

Loading