Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compression/shuffle defaults for CMOR3 #403

Closed
taylor13 opened this issue Sep 19, 2018 · 19 comments
Closed

compression/shuffle defaults for CMOR3 #403

taylor13 opened this issue Sep 19, 2018 · 19 comments
Assignees

Comments

@taylor13
Copy link
Collaborator

In cmor_reset_variable() (in cmor.c) I find:

    cmor_vars[var_id].shuffle = 0;
    cmor_vars[var_id].deflate = 1;
    cmor_vars[var_id].deflate_level = 1;

Is this initializing the settings for shuffle, deflate and deflate_level? And will those settings remain in effect unless the CMOR table specifies different values? Or do they invariably get ignored?

If these are default settings specified for CMOR3, then all data will be compressed by default, correct?

@durack1
Copy link
Contributor

durack1 commented Sep 19, 2018

@taylor13 following Balaji et al., 2018, we should probably amend the above to:

    cmor_vars[var_id].shuffle = 1;
    cmor_vars[var_id].deflate = 1;
    cmor_vars[var_id].deflate_level = 2;

@doutriaux1
Copy link
Collaborator

@taylor13 @durack1 I will run a simple test and see if indeed it comes back compressed by default. And what the values are.

@durack1
Copy link
Contributor

durack1 commented Sep 19, 2018

@doutriaux1 thanks that would be ideal!

@taylor13
Copy link
Collaborator Author

I didn't see anything specifically indicating that deflate_level=2 is much better than deflate_level=1. Also, in the study mentioned by Balaji, I didn't see "shuffle" analyzed, so why do you suggest suffle=1 and how much that slow things?

@durack1
Copy link
Contributor

durack1 commented Sep 19, 2018

To help inform the discussion about compression, we undertook a systematic study of typical
model output files under lossless compression, the results of which are publicly available22.
The study indicates that standard zlib compression in the netCDF4 library with the settings
of deflate=2 (relatively modest, and computationally inexpensive), and shuffle (which
ensures better spatiotemporal homogeneity) ensures the best compromise between
increased computational cost and reduced data volume. For an ESM, we expect a total
savings of about 50 %, with ocean, ice, and land realms benefiting most (owing to large
areas of the globe that are masked) and atmospheric data benefiting least. This 50%
estimate has been verified with sample output from one model whose compression
rates should be quite typical.

With this based on https://public.tableau.com/profile/balticbirch#!/vizhome/NC4/NetCDF4Deflation

@taylor13
Copy link
Collaborator Author

Also note (from Charlie Zender):

FWIW, in my experience deflate_level=1 suffices for 90-99% of lossless
compression, usually 98-99%. Levels > 1 just slow things down:
http://www.geosci-model-dev.net/9/3199/2016

and

In my experience shuffle=on never increases, and can decrease
at little expense in time, compressed filesize.
That's why NCO always turns shuffle=on when compressing. 

I have recommended in the CMIP6 output requirements:

It is recommended that data be compressed by setting the “deflate_level” 
and “shuffle” to values that optimize the balance between reduction of file 
size and degradation in performance. Usually deflate_level=1 will suffice and 
“shuffle” can be turned on with little performance penalty. 

@doutriaux1
Copy link
Collaborator

Looking at a default dump I see chuinking is on by default see:

	double lon(lon) ;
		lon:bounds = "lon_bnds" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;
		lon:long_name = "Longitude" ;
		lon:standard_name = "longitude" ;
		lon:_Storage = "contiguous" ;
		lon:_Endianness = "little" ;
	double lon_bnds(lon, bnds) ;
		lon_bnds:_Storage = "chunked" ;
		lon_bnds:_ChunkSizes = 1, 2 ;
		lon_bnds:_DeflateLevel = 1 ;
		lon_bnds:_Endianness = "little" ;
	float ts(time, lat, lon) ;
		ts:standard_name = "surface_temperature" ;
		ts:long_name = "Surface Temperature" ;
		ts:comment = "Temperature of the lower boundary of the atmosphere" ;
		ts:units = "K" ;
		ts:cell_methods = "area: time: mean" ;
		ts:cell_measures = "area: areacella" ;
		ts:missing_value = 1.e+20f ;
		ts:_FillValue = 1.e+20f ;
		ts:history = "2018-09-20T16:13:07Z altered by CMOR: Converted type from \'l\' to \'f\'." ;
		ts:_Storage = "chunked" ;
		ts:_ChunkSizes = 1, 1, 1 ;
		ts:_DeflateLevel = 1 ;
		ts:_Endianness = "little" ;

@taylor13
Copy link
Collaborator Author

The _DeflateLevel = 1 is good as a default. [I don't think it is necessary, however, for the coordinates. Doesn't hurt, I suppose.]

The _ChunkSizes I think are way too small. I think by default CMOR should rely on the netCDF default chunking strategy. Anyone else care to comment?

@taylor13
Copy link
Collaborator Author

Would be nice to check whether turning "shuffle" on degrades performance. Charlie Zender said that when it's on it sometimes significantly reduces storage (and it doesn't hurt performance). Before making shuffle the default though let's check.

@taylor13
Copy link
Collaborator Author

If further guidance is needed, please ask.

@taylor13
Copy link
Collaborator Author

concerning the "too small" _ChunkSizes mentioned in #403 (comment), the default apparently is to define chunksize for longitude as the length of the longitude dimension, and chunksize for latitude as the length of the latitude dimension and set chunksize=1 for all other dimensions ("time" in this case) In the example the length of the longitude and latitude dimensions is 1 so CchunkSizes = 1, 1, 1 results. This is o.k. as the default behavior.

The only thing left to do is to try turning "shuffle" on as the default. Could check how much better the compression is.

@mauzey1
Copy link
Collaborator

mauzey1 commented Nov 2, 2018

I created NetCDF files and profiled their creation time and size with the following script.

https://github.com/mauzey1/cmor/blob/96786014b6a00076df5d2eb74d721736db52a294/Test/shuffle_test.py

With shuffle enabled, the sizes of the files were reduced by ~17% of their previous size and took 23% less time to create.

@durack1
Copy link
Contributor

durack1 commented Nov 2, 2018

Thanks @mauzey1, with the demos how big are the files? And did you use files that included a large number of missing data (ocean grids for e.g.). It would be curious to see how this works for ~10Gb files, and for files where ~30% of points are missing

@mauzey1
Copy link
Collaborator

mauzey1 commented Nov 2, 2018

@durack1 The demo files ranged from 21 MB to 211 MB without shuffle and 17 MB to 175 MB with shuffle. They are using grids that have 360 latitude points, 720 longitude points, and 26 pressure levels. They have 1, 2, 5, and 10 time steps.

Are there any examples of cmor using ocean grids with a lot of missing data?

@doutriaux1
Copy link
Collaborator

@mauzey1 this test: https://github.com/PCMDI/cmor/blob/master/Test/test_python_CMIP6_projections.py

Writes data on an ocean grid, you could tweak it to make the grid really big with a bunch of missing values.

I think I have a branch/pr somewhere which finishes that script to actually write the data.

@mauzey1
Copy link
Collaborator

mauzey1 commented Nov 13, 2018

I have done some test using a mask from this file for ocean data:
https://vesg.ipsl.upmc.fr/thredds/catalog/esgcet/45/CMIP6.CMIP.IPSL.IPSL-CM6A-LR.1pctCO2.r1i1p1f1.Omon.dpco2.gn.v20180727.html?dataset=CMIP6.CMIP.IPSL.IPSL-CM6A-LR.1pctCO2.r1i1p1f1.Omon.dpco2.gn.v20180727.dpco2_Omon_IPSL-CM6A-LR_1pctCO2_r1i1p1f1_gn_185001-199912.nc

The mask removes about 46% of the data from a 332x362 grid. I made a script that creates data with the same longitude and latitude grid as the above file's mask but with varying numbers of time steps.

https://github.com/mauzey1/cmor/blob/shuffle_testing/Test/shuffle_test_missing_data.py

Here are some results from that script. Enabling shuffle causes a 11% decrease in file size with a 13% decrease in run time.

grid dim  	no shuffle  		shuffle		        size diff %	time diff %
332x362x1000 	 219 mb	  27.179 s	 194 mb	  23.322 s 	-11.099		-13.776
332x362x2000 	 436 mb	  53.107 s	 388 mb	  45.851 s 	-11.110		-13.646
332x362x5000 	1090 mb  132.502 s	 969 mb	 115.061 s 	-11.117		-13.153
332x362x10000 	2179 mb  267.695 s	1936 mb  232.012 s 	-11.120		-13.325

I did however find a case where enabling shuffle would cause an increase in file size. I used a file that had a land mask that removes 62% of the data if you mask parts that equal 0.0.

https://esgf.nccs.nasa.gov/thredds/catalog/esgcet/47/CMIP6.CMIP.NASA-GISS.GISS-E2-1-G.piControl.r1i1p1f1.fx.sftlf.gn.v20180824.html?dataset=CMIP6.CMIP.NASA-GISS.GISS-E2-1-G.piControl.r1i1p1f1.fx.sftlf.gn.v20180824.sftlf_fx_piControl_GISS-E2-1-G_r1i1p1f1_gn.nc

https://github.com/mauzey1/cmor/blob/7c3f9dce2482528064dab6ee1983186437bf2687/Test/shuffle_test_missing_data2.py

grid dim        no shuffle              shuffle                 size diff %     time diff %
90x144x1000      18 mb   2.695 s         19 mb   2.412 s        2.721           -10.160
90x144x2000      36 mb   5.138 s         37 mb   4.625 s        2.725           -9.937
90x144x5000      90 mb  12.588 s         92 mb  11.701 s        2.721           -6.988
90x144x10000    179 mb  24.945 s        184 mb  23.566 s        2.721           -5.463
90x144x20000    358 mb  48.430 s        368 mb  45.527 s        2.722           -5.942

Although there is a small decrease in run time, the files are 2.7% larger with shuffle than without it.

@taylor13
Copy link
Collaborator Author

To be sure, in these tests was deflate_level = 1?

Also, is there a simply explanation why the time to generate the files should be less when shuffle is turned on? I would have thought the computer had to work harder when shuffling was invoked.

@mauzey1
Copy link
Collaborator

mauzey1 commented Nov 14, 2018

All of my tests use deflate_level 1.

I'm not sure why enabling shuffling leads to shorter run times. At first I thought it was caused by reduction in disk writing due to smaller files, but the reduction in time also happens when the files get a little bigger. Could the shuffling cause the compression algorithm to run faster?

@taylor13
Copy link
Collaborator Author

sounds reasonable to me. Let's not pursue it further.

I suggest we leave in place the present guidance:

It is recommended that data be compressed by setting the “deflate_level” 
and “shuffle” to values that optimize the balance between reduction of file 
size and degradation in performance. Usually deflate_level=1 will suffice and 
“shuffle” can be turned on with little performance penalty. 

Anyone disagree?

thanks for your careful investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants