Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel netcdf writes #23

Closed
jswhit2 opened this issue Dec 10, 2019 · 126 comments
Closed

parallel netcdf writes #23

jswhit2 opened this issue Dec 10, 2019 · 126 comments

Comments

@jswhit2
Copy link
Collaborator

jswhit2 commented Dec 10, 2019

Parallel write capability is needed in module_write_netcdf.F90. The current version of the netcdf library does not support parallel writing of compressed files, but uncompressed parallel writes should work. Here are some steps needed to implement this:

  1. add a flag to model_configure to indicate parallel IO is desired (for now make sure this flag is set to false if compression is enabled).
  2. if parallel IO is enabled, open the file using nf90_create on all tasks and pass the optional mpi_comm and mpi_info arguments.
  3. the nf90_put_var calls need to be modified to write independent slices (defined by istart,jstart,iend,jend,kstart,kend). The ESMF_Gather call should be skipped.
@junwang-noaa
Copy link
Collaborator

Jeff,

Just a quick update, I am working on this following the steps you listed here. The coding is mostly done. One thing not listed in your steps and not clear to me is about writing attribute, I assume it is OK to call nf90_put_att on all the mpi tasks?

@jswhit
Copy link
Contributor

jswhit commented Dec 24, 2019

Yes, creating the attributes, dimensions and variables can be done on all tasks. The only thing that needs to change is the nf90_put_var (writing data to the variable).

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Dec 24, 2019 via email

@jswhit
Copy link
Contributor

jswhit commented Dec 24, 2019

Jun - one thing to be aware of. There are two parallel IO implementations with the same API - one that works with classic (non-hdf5) files and one that works with hdf5 files. To enable both, you have to build with --enable-parallel4 and --enable-pnetcdf. If Cory did not build with --enable-pnetcdf you will have to make this change

-    if (ideflate == 0) then
-        ncerr = nf90_create(trim(filename), &
-        cmode=IOR(IOR(NF90_CLOBBER,NF90_64BIT_OFFSET),NF90_SHARE), &
-        ncid=ncid); NC_ERR_STOP(ncerr)
-        ncerr = nf90_set_fill(ncid, NF90_NOFILL, oldMode); NC_ERR_STOP(ncerr)
-    else
-        ncerr = nf90_create(trim(filename), cmode=IOR(IOR(NF90_CLOBBER,NF90_NETCDF4),NF90_CLASSIC_MODEL), &
-        ncid=ncid); NC_ERR_STOP(ncerr)
-        ncerr = nf90_set_fill(ncid, NF90_NOFILL, oldMode); NC_ERR_STOP(ncerr)
-    endif
+   ncerr = nf90_create(trim(filename), cmode=IOR(IOR(NF90_CLOBBER,NF90_NETCDF4),NF90_CLASSIC_MODEL), &
+   ncid=ncid); NC_ERR_STOP(ncerr) ! modify if parallel IO needed
+   ncerr = nf90_set_fill(ncid, NF90_NOFILL, oldMode); NC_ERR_STOP(ncerr)

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Dec 24, 2019 via email

@climbfuji
Copy link
Collaborator

For your information, the pnetcdf (parallel-netcdf) library developed at Argonne Labs writes data in netCDF cdf-2 format (called 64-bit offset in pnetcdf; is essentially the same as netCDF 3 format and can be read and written with any unidata netCDF library from 3.x upwards, but parallel reads/writes require pnetcdf support as described by Jeff for netCDF 4.x). Note that early versions of netcdf 4 (4.5.x) had some serious bugs when writing to those cdf-5 files, which in some cases corrupted the data.

For larger files (individual variables larger than 2GB), the pnetCDF format must be switched to 64-bit data format (aka cdf-5; NF90_64BIT_DATA instead of NF90_64BIT_OFFSET in the code snippet above), which is incompatible with standard netCDF installations and also requires building newer unidata netcdf4.x versions with the flags that Jeff described above. The drawback of this format is that many other downstream tools don't recognize it (you will see an error message of the kind "MPI routine called before MPI init" or so) - unless they were built using the same pnetcdf-enabled version of netCDF 4.x.

The other format using parallel hdf5 as backend for so-called netCDF4 or netCDF4-classic files. This does't require the parallel-netcdf backend and seems to be easier. BUT: the parallel reading and writing of netCDF4 files through the HDF5 format is several times slower than that through parallel-netCDF!

Long story short: for the given "small" global meshes of 13km resolution, we can use the CDF-2 parallel-netcdf version, which can be read or written with any existing software that can read netCDF3 classic files. We should not use the netCDF4-phdf5 backend, because it is very, very slow. I have some HPC reports (unpublished work) from my previous job where I was using pnetcdf, netcdf4-phdf5 and SIONlib in MPAS extreme scaling experiments if you are interested.

@jswhit
Copy link
Contributor

jswhit commented Dec 24, 2019

We have to use the hdf5 backend to get compression. My experience is that parallel-hdf5 performance has improved quite a bit in recent versions.

@climbfuji
Copy link
Collaborator

climbfuji commented Dec 24, 2019 via email

@jswhit
Copy link
Contributor

jswhit commented Dec 24, 2019

We're using a compression level of 1.

@edwardhartnett
Copy link
Contributor

Howdy!

To clarify a few points:

  • The CDF5 format originally developed by Argonne for pnetcdf is now part of the Unidata netCDF library as well, and is a canonical and supported binary format of netCDF. Older versions of netCDF will not understand it, but all recent and future versions do. So it is safe to use and distribute, however offers no compression.

  • I have a PR open on the netcdf-c projec that will allow parallel writes with zlib, and that should help.

  • I am going to re-instate the szip filter in the netCDF C library as well. This will allow szip to be easily used. Since szip was supported in netCDF in read-only mode, all existing netCDF installs will be able to read these files, as long as HDF5 was installed with szip capability.

  • I am exploring some new filters, specifically LZ4. I should offer much better read and write performance, at a cost of slightly less good compression. The challenge is making it available to everyone, but this is something we are working out.

  • HDF5 parallel I/O should be reasonably fast, when settings are correct. If you build netcdf-c with the --enable-benchmarks option, you get a program nc_perf/bm_file, which will allow you to test your file read and write time with a variety of chunksize settings, so you can get a feel for how it changes performance. Let me know if you want help.

@climbfuji
Copy link
Collaborator

Howdy!

To clarify a few points:

  • The CDF5 format originally developed by Argonne for pnetcdf is now part of the Unidata netCDF library as well, and is a canonical and supported binary format of netCDF. Older versions of netCDF will not understand it, but all recent and future versions do. So it is safe to use and distribute, however offers no compression.
  • I have a PR open on the netcdf-c projec that will allow parallel writes with zlib, and that should help.
  • I am going to re-instate the szip filter in the netCDF C library as well. This will allow szip to be easily used. Since szip was supported in netCDF in read-only mode, all existing netCDF installs will be able to read these files, as long as HDF5 was installed with szip capability.
  • I am exploring some new filters, specifically LZ4. I should offer much better read and write performance, at a cost of slightly less good compression. The challenge is making it available to everyone, but this is something we are working out.
  • HDF5 parallel I/O should be reasonably fast, when settings are correct. If you build netcdf-c with the --enable-benchmarks option, you get a program nc_perf/bm_file, which will allow you to test your file read and write time with a variety of chunksize settings, so you can get a feel for how it changes performance. Let me know if you want help.

Thanks for the update, Ed. Did you ever get to play with the SIONlib backend and test its performance? We had a telecon with the developers in Juelich some time ago, and I didn't have any time to follow up on this, sorry.

@edwardhartnett
Copy link
Contributor

No I have not played with SIONlib but would be interested in learning more about it. There was an idea of writing a SIONlib read/write module for netcdf...

@edwardhartnett
Copy link
Contributor

OK, some more news: turns out that szip is already enabled in netcdf-c for writes! It's a bit of an undocumented feature.

For this to work, HDF5 must be built with szip (as well as the usual zlib).

Once netcdf-c is built on a HDF5 that supports szip, then szip compression may be turned on for a var like this:

#define HDF5_FILTER_SZIP 4
    /*
     * Set parameters for SZIP compression; check the description of
     * the H5Pset_szip function in the HDF5 Reference Manual for more
     * information.
     */
    szip_params[0] = H5_SZIP_NN_OPTION_MASK;
    szip_params[1] = H5_SZIP_MAX_PIXELS_PER_BLOCK_IN;
    stat = nc_def_var_filter(ncid, varid, HDF5_FILTER_SZIP, 2, szip_params);

Currently this will only work for sequential access. I am working on getting it working for parallel access next. ;-)

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jan 3, 2020 via email

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 7, 2020

@junwang-noaa - do you have a branch with the parallel-io mods in it that I can play with?

@junwang-noaa
Copy link
Collaborator

Yes. The code gets compiled on mars, but I haven't tested the parallel netcdf case yet.

[submodule "FV3"]
path = FV3
url = https://github.com/junwang-noaa/fv3atm
branch = netcdf_parallel

The branch also has the code changes for real(8) lon/lat in netcdf file and a bug fix for post.

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 8, 2020

Thanks @junwang-noaa, I will give it a spin on hera.

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 8, 2020

I've made a few changes to @junwang-noaa's netcdf_parallel branch and almost have it running on hera (without compression). The files are created, but the data isn't actually written. Jun - if you give me access to your fork I can push my changes there, or if you prefer I can create my own fork.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jan 9, 2020 via email

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 9, 2020

I've updated https://github.com/junwang-noaa/fv3atm/tree/netcdf_parallel, to include bug fixes and changes from PR #18. It now runs on hera, and for uncompressed data shows some significant speedups. Next step is to build the netcdf library with Unidata/netcdf-c#1582 so we can test parallel compressed writes.

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 9, 2020

BTW - the code is now using independent (not collective) parallel access. Collective access seems to be quite a bit slower in my tests - don't understand why. Using independent access required changing the time dimension from unlimited to fixed length.

@edwardhartnett
Copy link
Contributor

Collective access is required for any filters in HDF5, including all compression filters. :-(

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 9, 2020

OK - I've updated the code to turn collective access for variables that are compressed.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jan 9, 2020 via email

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 9, 2020

Jun - it's not merged into master yet but we can check out https://github.com/NOAA-GSD/netcdf-c/tree/ejh_parallel_zlib. I'll build this on hera.

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jan 9, 2020 via email

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 9, 2020

Jun: Here's what I did on hera (haven't yet tested it)

  • create a directory ${parlibpath}
  • cd ${parlibpath}
  • download hdf5-1.10.6.tar.gz and netcdf-fortran-4.5.2.tar.gz to that directory

To build HDF5:

  • tar -xvzf hdf5-1.10.6.tar.gz
  • cd hdf5-1.10.6
  • ./configure --prefix=${parlibpath} --enable-hl --enable-parallel
  • make
  • make install

To build netcdf-c:

  • git clone https://github.com/NOAA-GSD/netcdf-c
  • cd netcdf-c; git checkout ejh_parallel_zlib
  • autoreconf -i
  • setenv LDFLAGS -L${parlibpath}/lib
  • setenv CPPFLAGS -I${parlibpath}/include
  • ./configure --prefix=${parlibpath} --enable-netcdf-4 --enable-shared --disable-dap --enable-parallel4
  • make
  • make install

To build netcdf-fortran:

  • tar -xvzf netcdf-fortran-4.5.2.tar.gz
  • cd netcdf-fortran-4.5.2
  • setenv FC mpif90
  • setenv CC mpicc
  • setenv LDFLAGS -L${parlibpath}/lib
  • setenv CPPFLAGS -I${parlibpath}/include
  • ./configure --prefix=${parlibpath}
  • make
  • make install

@climbfuji
Copy link
Collaborator

climbfuji commented Jan 9, 2020 via email

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 9, 2020

Dom: We need a bleeding edge version of netcdf-c from https://github.com/NOAA-GSD/netcdf-c/tree/ejh_parallel_zlib compiled with parallel hdf support.

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 9, 2020

@edwardhartnett - with your branch I'm still getting NetCDF: Invalid argument in nf90_def_var when I enable compression filters in parallel mode. Is there something that needs to be updated in the fortran interface?

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 26, 2020

I ran a C768 test on hera, and I still see a benefit to parallel IO for the 2D files. Using 12 write tasks, with parallel IO for both 2d and 3d files I get

 parallel netcdf      Write Time is   36.91967 at Fcst   03:00
 parallel netcdf      Write Time is   18.39263 at Fcst   03:00
 total                Write Time is   55.56771 at Fcst   03:00

whereas turning off parallel IO for 2d files I see

 parallel netcdf      Write Time is   37.76248 at Fcst   03:00
 netcdf               Write Time is   33.08457 at Fcst   03:00
 total                Write Time is   71.07401 at Fcst   03:00

and without parallel IO for either 2d or 3d files

 netcdf            Write Time is  206.89221 at Fcst   03:00
 netcdf            Write Time is   29.70651 at Fcst   03:00
 total             Write Time is  236.85511 at Fcst   03:00

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 26, 2020

Seems like the 2D writes are very sensitive to chunksize. If I set the chunksize to be the same as the size of the array on each write task, I get

 parallel netcdf      Write Time is   35.68702 at Fcst   03:03
 parallel netcdf      Write Time is    9.80075 at Fcst   03:03
 total                Write Time is   45.72166 at Fcst   03:03

The optimal chunksize may be platform dependent. I'll look at adding the chunksize as a runtime parameter in model_configure

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 26, 2020

Added ichunk2d, jchunk2d parameters in model_configure (can be used to tune chunksize for best parallel IO performance). Default is size of array on each write task.

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 27, 2020

Also added ichunk3d,jchunk3d,kchunk3d to set 3d variable chunksize. Default is ichunk3d=ichunk2d, jchunk3d=jchunk2d, kchunk3d=nlevs. This results in the fastest writes for me on hera:

 parallel netcdf      Write Time is   24.97020 at Fcst   03:03
 parallel netcdf      Write Time is    9.98413 at Fcst   03:03
 total                Write Time is   35.23754 at Fcst   03:03

To restore the previous behavior, set ichunk3d=imo,jchunk3d=jmo,kchunk3d=1.

Writes are slower for this setting, but reading 2d horizontal slices is much faster.

@junwang-noaa - could you run your test again on WCOSS with these new default chunksizes?

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jan 27, 2020 via email

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 27, 2020

did you mean (imo,jmo,nlevs,1) for the 3d chunk size?

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jan 27, 2020 via email

@jswhit2
Copy link
Collaborator Author

jswhit2 commented Jan 27, 2020

Hmm. Wonder why I'm getting ~4x faster writes on hera with the new default chunksizes and 12 write tasks (total write time of 35 secs vs 140 secs on WCOSS).

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jan 27, 2020 via email

@junwang-noaa
Copy link
Collaborator

junwang-noaa commented Jan 27, 2020 via email

@jswhit
Copy link
Contributor

jswhit commented Jan 27, 2020

To recap, netcdf chunksizes (for parallel and serial compressed IO) can be set at runtime by specifying parameters ichunk2d,jchunk2d,ichunk3d,jchunk3d,kchunk3d in model_configure. The default values (if these parameters are not given) are the MPI decomposition size. If the parameters are set to a negative value, then the netcdf C library will choose the chunksize. If compression is turned off, the netcdf C library chooses the chunksize (the chunking parameters are not used).

My tests on hera show that the default values produced the fastest parallel IO throughput. However, this might not be true on other platforms (and Jun's tests suggest it may not be true on WCOSS)

Note that for parallel IO, the chunksize that produces the fastest write speed may not be optimal for read speed (depending on the access pattern). For example, my tests indicate that setting the chunksize for 3d variables to imo,jmo,1,1 greatly speeds up reads for 2d slices (a very common access pattern), but slows down the writes by 25% or so on hera.

In general, the write speeds for parallel IO with compression appear to be quite sensitive to chunksize (much less so for serial IO).

The nccopy utility can be used to change the chunking in an existing netcdf file.

@jswhit
Copy link
Contributor

jswhit commented Jan 28, 2020

To compile on hera, I made the following change to modulefiles/hera.intel/fv3:

[ufs-weather-model-parnc]$ git diff modulefiles/hera.intel/fv3
diff --git a/modulefiles/hera.intel/fv3 b/modulefiles/hera.intel/fv3
index a369558..362dea3 100644
--- a/modulefiles/hera.intel/fv3
+++ b/modulefiles/hera.intel/fv3
@@ -50,6 +50,8 @@ module load post/8.0.1
 ##
 module use -a /scratch1/NCEPDEV/nems/emc.nemspara/soft/modulefiles
 module load esmf/8.0.0
+module use -a /scratch2/BMC/gsienkf/whitaker/modulefiles/intel
+module load netcdf-parallel/4.7.3

 ##
 ## load cmake

This, and corresponding changes to the wcoss modulefiles, plus the addition of PARALLEL_NETCDF to the cmake files, need to be included in a companion pull request to https://github.com/ufs-community/ufs-weather-model.

@DusanJovic-NOAA
Copy link
Collaborator

The netcdf-c branch named "ejh_parallel_zlib" described in this comment:

#23 (comment)

does not exist anymore. Which version of netcdf-c should we use?

@edwardhartnett
Copy link
Contributor

Use the netcdf-c master branch. All my changes have been merged to master.

@DusanJovic-NOAA
Copy link
Collaborator

https://github.com/NOAA-GSD/netcdf-c
or
https://github.com/Unidata/netcdf-c

I see NOAA-GSD:master is 6 commits behind Unidata:master

@edwardhartnett
Copy link
Contributor

Use the Unidata master. THe NOAA-GSD one is my fork that I use for working on the Unidata one.

@junwang-noaa
Copy link
Collaborator

Ed,

just to confirm, I am building netcdf-c from unidata on hera, the revision is 2a34eb2a.

.../nems/emc.nemspara/soft/netcdf_parallel/netcdf-c> git clone https://github.com/Unidata/netcdf-c
.../nems/emc.nemspara/soft/netcdf_parallel/netcdf-c> git branch

  • master
    [ =0 03:35:41 10034 emc.nemspara@hfe05 ]
    .../nems/emc.nemspara/soft/netcdf_parallel/netcdf-c> git log |more

commit 2a34eb2ac5996dc23339bdb72918eb5503393d77
Merge: 2e7234ff 0f5bdafe
Author: Ward Fisher WardF@users.noreply.github.com
Date: Mon Jan 27 17:44:48 2020 -0700

Merge pull request #1603 from NOAA-GSD/ejh_release_notes

updated RELEASE_NOTES to include results of recent PR merges

@junwang-noaa
Copy link
Collaborator

I have problem to build the hdf5/1.10.6 on cray, I have

module load PrgEnv-intel
module rm intel
module rm NetCDF-intel-sandybridge
module load intel/16.3.210

export FC=ftn
export CC=cc
export CXX=CC
export LD=ftn
export LDFLAGS=-L${parlibpath}/lib
export CPPFLAGS=-I${parlibpath}/include
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/gpfs/hps3/emc/nems/noscrub/emc.nemspara/soft/netcdf_parallel/lib

but I got error:
configure:4721: ./conftest

Please verify that both the operating system and the processor support Intel(R) MOVBE, FMA, BMI, LZCNT and AVX2 instructions.

configure:4725: $? = 1
configure:4732: error: in `/gpfs/hps3/emc/nems/noscrub/emc.nemspara/soft/netcdf_parallel/hdf5-1.10.6':
configure:4734: error: cannot run C compiled programs.

Does anybody have any idea?

@edwardhartnett
Copy link
Contributor

@junwang-noaa you do have the correct netcdf.

However, I don't know what the HDF5 build problem is. Have you asked HDF5 support?

@jswhit
Copy link
Contributor

jswhit commented Jan 31, 2020

@junwang-noaa - have a look at config.log file in the build directory after configure fails to find more information as to the cause of failure. Search for "cannot" and examine the lines preceding it.

@edwardhartnett
Copy link
Contributor

@jswhit BTW I'm still working on szip and netcdf-fortran. There are some complexities but I hope to have a working branch for you soon...

@junwang-noaa
Copy link
Collaborator

A detailed configuration testing on wcoss dell is at:

https://docs.google.com/document/d/18vqajgOv3flbS35eNPMnYpFyNiprkJpHkVZpP--dx5o/edit

@junwang-noaa
Copy link
Collaborator

@jswhit2 The error message I listed above is from the config.log. Not sure why the executable compiled from c program can't run on cray.

@junwang-noaa
Copy link
Collaborator

@edwardhartnett Can you create a tag from the master?

@edwardhartnett
Copy link
Contributor

@jswhit2 there is now a branch ejh_szip on netcdf-fortran which has the szip code in the fortran APIs.

@junwang-noaa I cannot create a tag. I am not part of Unidata any longer. ;-) However, I believe they will be doing a release in the next few weeks. (No guarantee though.)

@junwang-noaa
Copy link
Collaborator

Code is committed. I will open a new issue for installing the hdf/netcdf lib on cray.

@edwardhartnett
Copy link
Contributor

@jswhit2 the szip changes have been merged into master branch on Unidata's netcdf-fortran project. So just grab that, and build it against the netcdf-c master you have already built.

Then you can call nf90_def_var_szip() on a variable. Make sure you also turn off deflate. You can't use both deflate and szip.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants