-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel IO in CICE #81
Comments
CICE6 has the option to perform IO using parallelio. This is implemented here: https://github.com/CICE-Consortium/CICE/tree/main/cicecore/cicedyn/infrastructure/io/io_pio2 My understanding is that, when using it, it replaces the serial IO entirely, which is probably why this is not obvious in Note that, currently, the default build option in OM3 is to use PIO (see here |
Thanks Micael Maybe I misunderstood the changes done to CICE5 and COSIMA/cice5@e9575cd is just about adding the chunking features and some other improvements? But the parrallel IO was already working? @aekiss - Can you confirm? |
@anton-seaice I think the development of PIO support in the COSIMA fork of CICE5 and in CICE6 were done independently. So they might not provide exactly the same features. Still, very likely the existing PIO support in CICE6 is good enough for our needs, although that needs to be tested. |
Using the config from ACCESS-NRI/access-om3-configs#17 ,
It's not clear to me if that is a problem (times are not mutually exclusive), and we might not know until we try the higher resolutions. There are a couple of other issues though: Monthly output in OM2 was ~17mb:
But the OM3 output is ~69mb The history output is not chunked |
It looks like we need to set But when I do this, i get this error in access-om3.err:
The |
The definitions of the spack environments we are using can be found here. For the development version of OM3, we are using this one. HDF5 with MPI support is included by default when compiling netCDF with spack , while pnetcdf is off when building parallelio. If you want I can try to rebuild parallelio with pnetcdf support. |
Possibly relevant: |
Thanks - this sounds ok. HDF5 is the one we want, and the ParrallelIO library should be backward compatible without pnetcdf. I am still getting the |
This issue has been mentioned on ACCESS Hive Community Forum. There might be relevant details there: https://forum.access-hive.org.au/t/payu-generated-symlinks-dont-work-with-parallelio-library/1617/1 |
I was way off on a tangent. The ParallelIO library doesn't like using a symlink to the initial conditions file, and this gives the get_stripe failed errror. |
I raised an issue for the code changes needed for chunking and compression: |
For anyone reading later, Dale Roberts and OpenMPI both suggested setting the mpi io library to (i.e. Which works and open files through the symlink, but there is a significant performance hit. Monthly runs (with some daily output) have history timers in the ice.log of approximately double (99 seconds vs 54 seconds, 48 cores, 12 pio tasks, pio_type=netcdf4p). It looks like There is an open issue with OpenMPI still: open-mpi/ompi#12141 |
Hi @anton-seaice. Was going to email the following to you, but thought I'd put it here: |
Thanks Dale. The other big caveat here is we only have a 1 degree resolution at this point, and in OM2, performance was worse with parallel IO (than without) at 1 degree but better at 0.25 degree. So it may get hard to really get into the details at this point. Lustre stripe count is 1 (files are <100MB), but I couldn't figure out an easy way to check cb_nodes? CICE uses the ncar parallelio library. The data might in a somewhat sensible order. Each PE would have 10 or so blocks of adjacent data (in a line of constant longitude). If we use the 'box rearranger', then each io task might end up with adjacent data in latitude too (assuming PE's get assigned sequentially?). Saying that, it looks like using 1 pio iotask (there are 48 PEs) and box re-arranger is fastest. With 1 pio task, box rearranger and ompio reported history time is ~12seconds (vs about 15 seconds with romio321). (For reference: config tested ) |
OpenMpi will fix the bug, so plan of action is
|
Could also be worth discussing with Rui Yang (NCI) - he has a lot of experience with parallel IO. |
Would efficient parallel io also require a chunked NetCDF file, with chunks corresponding to each iotask's set of blocks? Also (as in OM2) we'll probably use different |
Possibly - we will have to revisit when the chunking is working, although with neatly organised data (i.e. in 1 degree where blocks are adjacent) it might not matter. If we stick with the box rearranger, then 1 chunk per iotask is worth trying. Ofcourse we need to mindful of read patterns just as much as write speed though.
Using the box rearranger - this would send all data from one compute task to one IO task - but then the data blocks would be non-contiguous in the output and need multiple calls to the netcdf library. (Presumably set netcdf chunk size = block size) Using the subset rearranger - the data from compute tasks would be spread among multiple IO tasks - but then the data blocks would be contiguous for each IO task and require only one call to the netcdf library. (Presumably set netcdf chunk size = 1 chunk per IO task) Box would have more IO operations and subset would have more network operations. I don't know how they would balance out (and also would guess the results are different depending if the tasks are across multiple NUMA nodes / real nodes etc). NB: The TWG minutes talk about this a lot. Suggestion is actually that one chunk per node will be best! |
Creation of users scripts for OM3, see COSIMA/access-om3#182 There are three processes here: 1. Per COSIMA/access-om3#81 initial conditions file for CICE are converted from symlink to files/hardlinks 2. CICE daily output in concatenated into one file per month 3. An intake esm datastore for the run is generated --------- Co-authored-by: Dougie Squire <42455466+dougiesquire@users.noreply.github.com> Co-authored-by: dougiesquire <dougiesquire@gmail.com>
In OM2 - a fair bit of work was done to add parrallel writing of netcdf output to get around delays writing daily out from CICE:
COSIMA/cice5#34
COSIMA/cice5@e9575cd
The ice_history code between CICE5 and 6 looks largely unchanged, so we will probably need to make similar changes to CICE6?
The text was updated successfully, but these errors were encountered: