-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compression/shuffle defaults for CMOR3 #403
Comments
@taylor13 following Balaji et al., 2018, we should probably amend the above to:
|
@doutriaux1 thanks that would be ideal! |
I didn't see anything specifically indicating that deflate_level=2 is much better than deflate_level=1. Also, in the study mentioned by Balaji, I didn't see "shuffle" analyzed, so why do you suggest suffle=1 and how much that slow things? |
With this based on https://public.tableau.com/profile/balticbirch#!/vizhome/NC4/NetCDF4Deflation |
Also note (from Charlie Zender):
and
I have recommended in the CMIP6 output requirements:
|
Looking at a default dump I see chuinking is on by default see:
|
The _DeflateLevel = 1 is good as a default. [I don't think it is necessary, however, for the coordinates. Doesn't hurt, I suppose.] The _ChunkSizes I think are way too small. I think by default CMOR should rely on the netCDF default chunking strategy. Anyone else care to comment? |
Would be nice to check whether turning "shuffle" on degrades performance. Charlie Zender said that when it's on it sometimes significantly reduces storage (and it doesn't hurt performance). Before making shuffle the default though let's check. |
If further guidance is needed, please ask. |
concerning the "too small" _ChunkSizes mentioned in #403 (comment), the default apparently is to define chunksize for longitude as the length of the longitude dimension, and chunksize for latitude as the length of the latitude dimension and set chunksize=1 for all other dimensions ("time" in this case) In the example the length of the longitude and latitude dimensions is 1 so CchunkSizes = 1, 1, 1 results. This is o.k. as the default behavior. The only thing left to do is to try turning "shuffle" on as the default. Could check how much better the compression is. |
I created NetCDF files and profiled their creation time and size with the following script. https://github.com/mauzey1/cmor/blob/96786014b6a00076df5d2eb74d721736db52a294/Test/shuffle_test.py With shuffle enabled, the sizes of the files were reduced by ~17% of their previous size and took 23% less time to create. |
Thanks @mauzey1, with the demos how big are the files? And did you use files that included a large number of missing data (ocean grids for e.g.). It would be curious to see how this works for ~10Gb files, and for files where ~30% of points are missing |
@durack1 The demo files ranged from 21 MB to 211 MB without shuffle and 17 MB to 175 MB with shuffle. They are using grids that have 360 latitude points, 720 longitude points, and 26 pressure levels. They have 1, 2, 5, and 10 time steps. Are there any examples of cmor using ocean grids with a lot of missing data? |
@mauzey1 this test: https://github.com/PCMDI/cmor/blob/master/Test/test_python_CMIP6_projections.py Writes data on an ocean grid, you could tweak it to make the grid really big with a bunch of missing values. I think I have a branch/pr somewhere which finishes that script to actually write the data. |
I have done some test using a mask from this file for ocean data: The mask removes about 46% of the data from a 332x362 grid. I made a script that creates data with the same longitude and latitude grid as the above file's mask but with varying numbers of time steps. https://github.com/mauzey1/cmor/blob/shuffle_testing/Test/shuffle_test_missing_data.py Here are some results from that script. Enabling shuffle causes a 11% decrease in file size with a 13% decrease in run time.
I did however find a case where enabling shuffle would cause an increase in file size. I used a file that had a land mask that removes 62% of the data if you mask parts that equal 0.0.
Although there is a small decrease in run time, the files are 2.7% larger with shuffle than without it. |
To be sure, in these tests was deflate_level = 1? Also, is there a simply explanation why the time to generate the files should be less when shuffle is turned on? I would have thought the computer had to work harder when shuffling was invoked. |
All of my tests use deflate_level 1. I'm not sure why enabling shuffling leads to shorter run times. At first I thought it was caused by reduction in disk writing due to smaller files, but the reduction in time also happens when the files get a little bigger. Could the shuffling cause the compression algorithm to run faster? |
sounds reasonable to me. Let's not pursue it further. I suggest we leave in place the present guidance:
Anyone disagree? thanks for your careful investigation. |
In cmor_reset_variable() (in cmor.c) I find:
Is this initializing the settings for shuffle, deflate and deflate_level? And will those settings remain in effect unless the CMOR table specifies different values? Or do they invariably get ignored?
If these are default settings specified for CMOR3, then all data will be compressed by default, correct?
The text was updated successfully, but these errors were encountered: