Parallel IO in CICE #81

anton-seaice · 2023-10-27T03:51:04Z

In OM2 - a fair bit of work was done to add parrallel writing of netcdf output to get around delays writing daily out from CICE:

COSIMA/cice5#34

COSIMA/cice5@e9575cd

The ice_history code between CICE5 and 6 looks largely unchanged, so we will probably need to make similar changes to CICE6?

The text was updated successfully, but these errors were encountered:

micaeljtoliveira · 2023-10-29T23:24:01Z

CICE6 has the option to perform IO using parallelio. This is implemented here:

https://github.com/CICE-Consortium/CICE/tree/main/cicecore/cicedyn/infrastructure/io/io_pio2

My understanding is that, when using it, it replaces the serial IO entirely, which is probably why this is not obvious in ice_history.F90 .

Note that, currently, the default build option in OM3 is to use PIO (see here

anton-seaice · 2023-10-31T04:17:50Z

Thanks Micael

Maybe I misunderstood the changes done to CICE5 and COSIMA/cice5@e9575cd is just about adding the chunking features and some other improvements? But the parrallel IO was already working?

@aekiss - Can you confirm?

micaeljtoliveira · 2023-11-01T01:35:45Z

@anton-seaice I think the development of PIO support in the COSIMA fork of CICE5 and in CICE6 were done independently. So they might not provide exactly the same features. Still, very likely the existing PIO support in CICE6 is good enough for our needs, although that needs to be tested.

anton-seaice · 2023-11-03T04:12:55Z

Using the config from ACCESS-NRI/access-om3-configs#17 , ice.log gives these times:

Timer   1:     Total     173.07 seconds
Timer  13:   History      43.67 seconds

It's not clear to me if that is a problem (times are not mutually exclusive), and we might not know until we try the higher resolutions.

There are a couple of other issues though:

Monthly output in OM2 was ~17mb:

-rw-r-----+ 1 rmh561 ik11 7.6M May 11 2022 /g/data/ik11/outputs/access-om2/1deg_era5_ryf/output000/ice/OUTPUT/iceh.1900-01.nc

But the OM3 output is ~69mb
-rwxrwx--x 1 as2285 tm70 69M Nov 3 14:22 GMOM_JRA.cice.h.0001-01.nc

The history output is not chunked
And @dougie pointed out the history output is being written in "64-bit offset" which is a very dated way to write output which predates NetCDF-4

anton-seaice · 2023-11-10T05:26:21Z

It looks like we need to set pio_typename = netcdf4p in nuopc.runconfig to turn this on (per med_io_mod )

But when I do this, i get this error in access-om3.err:

get_stripe failed: 61 (No data available)
Abort with message NetCDF: Error initializing for parallel access in file /jobfs/98914803.gadi-pbs/mo1833/spack-stage/spack-stage-parallelio-2.5.10-hyj75i7d5yy5zbqc7jm6whlkduofib2k/spack-src/src/clib/pioc_support.c at line 2832
Obtained 10 stack frames.
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(print_trace+0x29) [0x147f3a88eff9]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(piodie+0x42) [0x147f3a88d082]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(check_netcdf2+0x1b9) [0x147f3a88d019]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(PIOc_openfile_retry+0x855) [0x147f3a88d9f5]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpioc.so(PIOc_openfile+0x16) [0x147f3a8887e6]
/g/data/ik11/spack/0.20.1/opt/linux-rocky8-cascadelake/intel-2021.6.0/parallelio-2.5.10-hyj75i7/lib/libpiof.so(piolib_mod_mp_pio_openfile_+0x21f) [0x147f3a61dacf]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x4082508]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x408b56f]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x42544bd]
/scratch/tm70/as2285/experiments/cice650_netcdf/as2285/access-om3/work/MOM6-CICE6/access-om3-MOM6-CICE6-37f9856-modified-37f9856-modified-4c25570-modified() [0x40589e5]

The No data available is curious. I think its trying to open the restart file (which works fine if pio_typename = netcdf). This implies it could be missing dependencies - are we including both the HDF5 and PnetCDF libraries ? Where would I find out? (more importantly)

micaeljtoliveira · 2023-11-12T22:25:38Z

The definitions of the spack environments we are using can be found here. For the development version of OM3, we are using this one.

HDF5 with MPI support is included by default when compiling netCDF with spack , while pnetcdf is off when building parallelio. If you want I can try to rebuild parallelio with pnetcdf support.

aekiss · 2023-11-14T10:14:47Z

Possibly relevant:
COSIMA/access-om2#166

anton-seaice · 2023-11-16T05:50:33Z

The definitions of the spack environments we are using can be found here. For the development version of OM3, we are using this one.

HDF5 with MPI support is included by default when compiling netCDF with spack , while pnetcdf is off when building parallelio. If you want I can try to rebuild parallelio with pnetcdf support.

Thanks - this sounds ok. HDF5 is the one we want, and the ParrallelIO library should be backward compatible without pnetcdf.

I am still getting the "NetCDF: Error initializing for parallel access" error when reading files (although I can generate netcdf4 files ok). The error text comes from the Netcdf library but it looks like it could be an error from the HDF library. I can't see any error logs from the HDF5 library though? I wonder if building hdf in Build Mode: 'Debug' rather than release will generate error messages (or at least lines numbers in the stack trace)?

access-hive-bot · 2023-11-21T03:32:31Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/payu-generated-symlinks-dont-work-with-parallelio-library/1617/1

anton-seaice · 2023-11-21T03:34:42Z

This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there:

https://forum.access-hive.org.au/t/payu-generated-symlinks-dont-work-with-parallelio-library/1617/1

I was way off on a tangent. The ParallelIO library doesn't like using a symlink to the initial conditions file, and this gives the get_stripe failed errror.

anton-seaice · 2023-11-21T05:20:06Z

I raised an issue for the code changes needed for chunking and compression:
CICE-Consortium/CICE#914

anton-seaice · 2023-12-12T02:31:34Z

For anyone reading later, Dale Roberts and OpenMPI both suggested setting the mpi io library to romio321 instead of ompio (the default).

(i.e. mpirun --mca io romio321 ./cice)

Which works and open files through the symlink, but there is a significant performance hit. Monthly runs (with some daily output) have history timers in the ice.log of approximately double (99 seconds vs 54 seconds, 48 cores, 12 pio tasks, pio_type=netcdf4p).

It looks like ompio was deliberately chosen in OM2, (see https://cosima.org.au/index.php/category/minutes/ and COSIMA/cice5#34 (comment)) but the details are pretty minimal. So it doesn't seem like a good fix.

There is an open issue with OpenMPI still: open-mpi/ompi#12141

dsroberts · 2023-12-12T03:02:52Z

Hi @anton-seaice. Was going to email the following to you, but thought I'd put it here:
In my experience ROMIO is very sensitive to tuning parameters. If your lustre striping, buffer sizes and aggregator settings don't line up just so, performance is barely any better than sequential writes because that's more or less what it'll be doing under the hood. It does require a bit of thought, and it very much depends on your application's output patterns. For what its worth, I recently did some MPI-IO tuning a high-resolution regional atmosphere simulation. Picking the correct MPI-IO settings improved the write performance from ~400MB/s to 2.5-3GB/s sustained to a single file. If your pio tasks aggregate data sequentially, then the general advice is set lustre_stripe_count <= cb_nodes <= n_pio_tasks, with the cb_buffer_size set such that each write transaction fits entirely within the buffer. There isn't a ton of info on tuning MPI-IO out there, best place to start is the source: https://ftp.mcs.anl.gov/pub/romio/users-guide.pdf.

anton-seaice · 2023-12-12T05:28:36Z

Hi @anton-seaice. Was going to email the following to you, but thought I'd put it here: In my experience ROMIO is very sensitive to tuning parameters. If your lustre striping, buffer sizes and aggregator settings don't line up just so, performance is barely any better than sequential writes because that's more or less what it'll be doing under the hood. It does require a bit of thought, and it very much depends on your application's output patterns. For what its worth, I recently did some MPI-IO tuning a high-resolution regional atmosphere simulation. Picking the correct MPI-IO settings improved the write performance from ~400MB/s to 2.5-3GB/s sustained to a single file. If your pio tasks aggregate data sequentially, then the general advice is set lustre_stripe_count <= cb_nodes <= n_pio_tasks, with the cb_buffer_size set such that each write transaction fits entirely within the buffer. There isn't a ton of info on tuning MPI-IO out there, best place to start is the source: https://ftp.mcs.anl.gov/pub/romio/users-guide.pdf.

Thanks Dale.

The other big caveat here is we only have a 1 degree resolution at this point, and in OM2, performance was worse with parallel IO (than without) at 1 degree but better at 0.25 degree. So it may get hard to really get into the details at this point.

Lustre stripe count is 1 (files are <100MB), but I couldn't figure out an easy way to check cb_nodes?

CICE uses the ncar parallelio library. The data might in a somewhat sensible order. Each PE would have 10 or so blocks of adjacent data (in a line of constant longitude). If we use the 'box rearranger', then each io task might end up with adjacent data in latitude too (assuming PE's get assigned sequentially?).

Saying that, it looks like using 1 pio iotask (there are 48 PEs) and box re-arranger is fastest. With 1 pio task, box rearranger and ompio reported history time is ~12seconds (vs about 15 seconds with romio321).

(For reference: config tested )

anton-seaice · 2023-12-19T23:49:08Z

OpenMpi will fix the bug, so plan of action is

patch (in MOM6-CICE6) for symlink bug in OpenMpi to turn on netcdf4
Update to OpenMpi 4.1.7 in Q2 2024 (mpi_file_open fails on symlink with Lustre filesystem open-mpi/ompi#12141 (comment)), revert patch and consider spack impact.
Add chunking + compression to CICE
Update to CICE version with chukning and compression and tune chunk/stripe sizes
Consider bit grooming! (https://docs.unidata.ucar.edu/netcdf-c/current/md__media_psf_Home_Desktop_netcdf_releases_v4_9_2_release_netcdf_c_docs_quantize.html)

aekiss · 2023-12-20T23:00:37Z

Could also be worth discussing with Rui Yang (NCI) - he has a lot of experience with parallel IO.

aekiss · 2023-12-21T01:10:24Z

CICE uses the ncar parallelio library. The data might in a somewhat sensible order. Each PE would have 10 or so blocks of adjacent data (in a line of constant longitude). If we use the 'box rearranger', then each io task might end up with adjacent data in latitude too (assuming PE's get assigned sequentially?).

Would efficient parallel io also require a chunked NetCDF file, with chunks corresponding to each iotask's set of blocks?

Also (as in OM2) we'll probably use different distribution_type, distribution_wght and processor_shape at higher resolution, probably with land block elimination (distribution_wght = block). In this case each compute PE handles a non-rectangular region - guess this makes the role of the rearranger more important?

anton-seaice · 2024-01-07T21:48:25Z

Would efficient parallel io also require a chunked NetCDF file, with chunks corresponding to each iotask's set of blocks?

Possibly - we will have to revisit when the chunking is working, although with neatly organised data (i.e. in 1 degree where blocks are adjacent) it might not matter. If we stick with the box rearranger, then 1 chunk per iotask is worth trying. Ofcourse we need to mindful of read patterns just as much as write speed though.

Also (as in OM2) we'll probably use different distribution_type, distribution_wght and processor_shape at higher resolution, probably with land block elimination (distribution_wght = block). In this case each compute PE handles a non-rectangular region - guess this makes the role of the rearranger more important?

Using the box rearranger - this would send all data from one compute task to one IO task - but then the data blocks would be non-contiguous in the output and need multiple calls to the netcdf library. (Presumably set netcdf chunk size = block size)

Using the subset rearranger - the data from compute tasks would be spread among multiple IO tasks - but then the data blocks would be contiguous for each IO task and require only one call to the netcdf library. (Presumably set netcdf chunk size = 1 chunk per IO task)

Box would have more IO operations and subset would have more network operations. I don't know how they would balance out (and also would guess the results are different depending if the tasks are across multiple NUMA nodes / real nodes etc).

NB: The TWG minutes talk about this a lot. Suggestion is actually that one chunk per node will be best!

Creation of users scripts for OM3, see COSIMA/access-om3#182 There are three processes here: 1. Per COSIMA/access-om3#81 initial conditions file for CICE are converted from symlink to files/hardlinks 2. CICE daily output in concatenated into one file per month 3. An intake esm datastore for the run is generated --------- Co-authored-by: Dougie Squire <42455466+dougiesquire@users.noreply.github.com> Co-authored-by: dougiesquire <dougiesquire@gmail.com>

anton-seaice changed the title ~~Parrallel IO~~ Parrallel IO in CICE Oct 27, 2023

anton-seaice mentioned this issue Oct 27, 2023

Cice config ACCESS-NRI/access-om3-configs#17

Merged

anton-seaice changed the title ~~Parrallel IO in CICE~~ Parallel IO in CICE Nov 21, 2023

anton-seaice mentioned this issue Nov 30, 2023

Configuration parameters set in multiple places #108

Open

anton-seaice mentioned this issue Dec 19, 2023

Symlink patch & CICE netcdf4 parrallel ACCESS-NRI/access-om3-configs#24

Merged

anton-seaice self-assigned this Feb 1, 2024

anton-seaice added the cice6 Related to CICE6 label Feb 1, 2024

anton-seaice mentioned this issue Mar 13, 2024

Update components for v0.3.0 #118

Closed

6 tasks

dougiesquire added the blocked For issues waiting resolution of issues outside this repository label May 2, 2024

aekiss mentioned this issue Aug 15, 2024

Analysis-ready chunking of diagnostic output files #203

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel IO in CICE #81

Parallel IO in CICE #81

anton-seaice commented Oct 27, 2023

micaeljtoliveira commented Oct 29, 2023

anton-seaice commented Oct 31, 2023

micaeljtoliveira commented Nov 1, 2023

anton-seaice commented Nov 3, 2023 •

edited

Loading

anton-seaice commented Nov 10, 2023 •

edited

Loading

micaeljtoliveira commented Nov 12, 2023

aekiss commented Nov 14, 2023

anton-seaice commented Nov 16, 2023

access-hive-bot commented Nov 21, 2023

anton-seaice commented Nov 21, 2023 •

edited

Loading

anton-seaice commented Nov 21, 2023

anton-seaice commented Dec 12, 2023 •

edited

Loading

dsroberts commented Dec 12, 2023

anton-seaice commented Dec 12, 2023 •

edited

Loading

anton-seaice commented Dec 19, 2023 •

edited

Loading

aekiss commented Dec 20, 2023

aekiss commented Dec 21, 2023

anton-seaice commented Jan 7, 2024 •

edited

Loading

Parallel IO in CICE #81

Parallel IO in CICE #81

Comments

anton-seaice commented Oct 27, 2023

micaeljtoliveira commented Oct 29, 2023

anton-seaice commented Oct 31, 2023

micaeljtoliveira commented Nov 1, 2023

anton-seaice commented Nov 3, 2023 • edited Loading

anton-seaice commented Nov 10, 2023 • edited Loading

micaeljtoliveira commented Nov 12, 2023

aekiss commented Nov 14, 2023

anton-seaice commented Nov 16, 2023

access-hive-bot commented Nov 21, 2023

anton-seaice commented Nov 21, 2023 • edited Loading

anton-seaice commented Nov 21, 2023

anton-seaice commented Dec 12, 2023 • edited Loading

dsroberts commented Dec 12, 2023

anton-seaice commented Dec 12, 2023 • edited Loading

anton-seaice commented Dec 19, 2023 • edited Loading

aekiss commented Dec 20, 2023

aekiss commented Dec 21, 2023

anton-seaice commented Jan 7, 2024 • edited Loading

anton-seaice commented Nov 3, 2023 •

edited

Loading

anton-seaice commented Nov 10, 2023 •

edited

Loading

anton-seaice commented Nov 21, 2023 •

edited

Loading

anton-seaice commented Dec 12, 2023 •

edited

Loading

anton-seaice commented Dec 12, 2023 •

edited

Loading

anton-seaice commented Dec 19, 2023 •

edited

Loading

anton-seaice commented Jan 7, 2024 •

edited

Loading