Replies: 13 comments 1 reply
-
In the presentation (noted above), E3SM atmospheric data contributed to CMIP6 had a compression ratio of 1.8, so around half the lossless uncompressed size (netcdf deflate) vs vanilla writes. To gauge the impact of using lossy compression on data usability, a couple of example cases would be great to test:
ping @juliettelavoie @geo-rao - note the 1-n suggestions above are unrelated to CMOR development, but figured it would be useful to co-locate this information so that discussions outside this repo can start to familiarize themselves with some of the dev discussions |
Beta Was this translation helpful? Give feedback.
-
I think several criteria will need to be considered before we specify an appropriate truncation of the mantissa of numbers. Here are a few:
|
Beta Was this translation helpful? Give feedback.
-
@taylor13 agreed, it will be useful to catch these comments and redirect toward another place so that testing impacts on data usability and access can be undertaken. This particular issue can be closed when we've ascertained how to expose these new netcdf functions through CMOR, and indeed whether this is available in the CMOR3 or CMOR4 (future) releases. If it is possible, having these available in a soon to be released version would be my preference, if this is relatively little work |
Beta Was this translation helpful? Give feedback.
-
It should be easy to expose nc_def_var_quantize and nc_def_var_zstandard in the same way we do with nc_def_var_deflate and cmor_set_deflate. We might need to add check for the version of NetCDF4 being used to determine if the functions are supported similar to the following code. Lines 25 to 42 in 047fd2c |
Beta Was this translation helpful? Give feedback.
-
Another piece of context here; Baker et al added some data with lossy compression to an ensemble of CESM simulations. Several of the figures here highlight the benefits of and issues created by introducing it |
Beta Was this translation helpful? Give feedback.
-
Nice catch! It looks like depending on what you're calculating (Figures 5, 6, 9, 13) it does matter. Their section 6 "Lessons learned" is certainly worth reading, and notes that relationships between variables matter, focused on their surface energy balance anomaly using 4 separate variables - pointing out that commonly derived variables need to be a primary consideration, as do high frequency/precipitation extremes and other very data sensitive analyses. They also point out how missing/fill values are treated needs to be a consideration. It could be useful to loop around with Allison and Gary (@strandwg) to see if follow on analyses are available |
Beta Was this translation helpful? Give feedback.
-
Thanks for the link to Baker. Looks like useful information. |
Beta Was this translation helpful? Give feedback.
-
Allison is definitely the person to talk to. |
Beta Was this translation helpful? Give feedback.
-
Hey, a consideration to stick with "Lossy" compression and skip the "lossless" compression is that the latter is done in hardware on the storage layer and much more efficient that way, you are saving CPU cycles that way. Just another consideration, and I didn't have a chance to raise that with Charlie yesterday. |
Beta Was this translation helpful? Give feedback.
-
Is this on the GPFS hardware that you're talking about? So this is infrastructure/hardware dependent, right? |
Beta Was this translation helpful? Give feedback.
-
typically the blockstorage controller. All the US datacenters have it for sure, its a fairly standard feature been around for a while, but true not a guarantee that every center will enable that. (But I would be surprised if they don't by now) |
Beta Was this translation helpful? Give feedback.
-
OTOH, the end-user will end up with a larger file size if downloaded locally, so there would be a trade off (but the downloading tool could then rewrite the file with L3 lossless compression to save space locally; I'd hope by CMIP7 wget is a relic and most use a more robust client that would offer such a feature). |
Beta Was this translation helpful? Give feedback.
-
That's news to me. What type of lossless compression do these blockstorage controllers implement? Zstandard? DEFLATE? Or...? |
Beta Was this translation helpful? Give feedback.
-
The latest versions of libnetcdf include new functions to further squash data using lossy compression, see Charlie Zender: Why & How to Increase Dataset Compression in CMIP7 - in particular the quantize and zstandard operations.
How easy is it to expose this in CMOR 3.9? #725
ping @taylor13 @matthew-mizielinski @sashakames @piotrflorek @czender
Beta Was this translation helpful? Give feedback.
All reactions