compression/shuffle defaults for CMOR3 #403

taylor13 · 2018-09-19T00:02:03Z

In cmor_reset_variable() (in cmor.c) I find:

    cmor_vars[var_id].shuffle = 0;
    cmor_vars[var_id].deflate = 1;
    cmor_vars[var_id].deflate_level = 1;

Is this initializing the settings for shuffle, deflate and deflate_level? And will those settings remain in effect unless the CMOR table specifies different values? Or do they invariably get ignored?

If these are default settings specified for CMOR3, then all data will be compressed by default, correct?

The text was updated successfully, but these errors were encountered:

durack1 · 2018-09-19T00:46:39Z

@taylor13 following Balaji et al., 2018, we should probably amend the above to:

    cmor_vars[var_id].shuffle = 1;
    cmor_vars[var_id].deflate = 1;
    cmor_vars[var_id].deflate_level = 2;

doutriaux1 · 2018-09-19T12:35:37Z

@taylor13 @durack1 I will run a simple test and see if indeed it comes back compressed by default. And what the values are.

durack1 · 2018-09-19T14:46:57Z

@doutriaux1 thanks that would be ideal!

taylor13 · 2018-09-19T15:44:29Z

I didn't see anything specifically indicating that deflate_level=2 is much better than deflate_level=1. Also, in the study mentioned by Balaji, I didn't see "shuffle" analyzed, so why do you suggest suffle=1 and how much that slow things?

durack1 · 2018-09-19T15:52:57Z

To help inform the discussion about compression, we undertook a systematic study of typical
model output files under lossless compression, the results of which are publicly available22.
The study indicates that standard zlib compression in the netCDF4 library with the settings
of deflate=2 (relatively modest, and computationally inexpensive), and shuffle (which
ensures better spatiotemporal homogeneity) ensures the best compromise between
increased computational cost and reduced data volume. For an ESM, we expect a total
savings of about 50 %, with ocean, ice, and land realms benefiting most (owing to large
areas of the globe that are masked) and atmospheric data benefiting least. This 50%
estimate has been verified with sample output from one model whose compression
rates should be quite typical.

With this based on https://public.tableau.com/profile/balticbirch#!/vizhome/NC4/NetCDF4Deflation

taylor13 · 2018-09-19T22:27:09Z

Also note (from Charlie Zender):

FWIW, in my experience deflate_level=1 suffices for 90-99% of lossless
compression, usually 98-99%. Levels > 1 just slow things down:
http://www.geosci-model-dev.net/9/3199/2016

and

In my experience shuffle=on never increases, and can decrease
at little expense in time, compressed filesize.
That's why NCO always turns shuffle=on when compressing.

I have recommended in the CMIP6 output requirements:

It is recommended that data be compressed by setting the “deflate_level” 
and “shuffle” to values that optimize the balance between reduction of file 
size and degradation in performance. Usually deflate_level=1 will suffice and 
“shuffle” can be turned on with little performance penalty.

doutriaux1 · 2018-09-20T16:14:32Z

Looking at a default dump I see chuinking is on by default see:

	double lon(lon) ;
		lon:bounds = "lon_bnds" ;
		lon:units = "degrees_east" ;
		lon:axis = "X" ;
		lon:long_name = "Longitude" ;
		lon:standard_name = "longitude" ;
		lon:_Storage = "contiguous" ;
		lon:_Endianness = "little" ;
	double lon_bnds(lon, bnds) ;
		lon_bnds:_Storage = "chunked" ;
		lon_bnds:_ChunkSizes = 1, 2 ;
		lon_bnds:_DeflateLevel = 1 ;
		lon_bnds:_Endianness = "little" ;
	float ts(time, lat, lon) ;
		ts:standard_name = "surface_temperature" ;
		ts:long_name = "Surface Temperature" ;
		ts:comment = "Temperature of the lower boundary of the atmosphere" ;
		ts:units = "K" ;
		ts:cell_methods = "area: time: mean" ;
		ts:cell_measures = "area: areacella" ;
		ts:missing_value = 1.e+20f ;
		ts:_FillValue = 1.e+20f ;
		ts:history = "2018-09-20T16:13:07Z altered by CMOR: Converted type from \'l\' to \'f\'." ;
		ts:_Storage = "chunked" ;
		ts:_ChunkSizes = 1, 1, 1 ;
		ts:_DeflateLevel = 1 ;
		ts:_Endianness = "little" ;

taylor13 · 2018-09-20T16:34:56Z

The _DeflateLevel = 1 is good as a default. [I don't think it is necessary, however, for the coordinates. Doesn't hurt, I suppose.]

The _ChunkSizes I think are way too small. I think by default CMOR should rely on the netCDF default chunking strategy. Anyone else care to comment?

taylor13 · 2018-09-20T16:39:31Z

Would be nice to check whether turning "shuffle" on degrades performance. Charlie Zender said that when it's on it sometimes significantly reduces storage (and it doesn't hurt performance). Before making shuffle the default though let's check.

taylor13 · 2018-10-24T20:29:43Z

If further guidance is needed, please ask.

taylor13 · 2018-10-25T21:45:29Z

concerning the "too small" _ChunkSizes mentioned in #403 (comment), the default apparently is to define chunksize for longitude as the length of the longitude dimension, and chunksize for latitude as the length of the latitude dimension and set chunksize=1 for all other dimensions ("time" in this case) In the example the length of the longitude and latitude dimensions is 1 so CchunkSizes = 1, 1, 1 results. This is o.k. as the default behavior.

The only thing left to do is to try turning "shuffle" on as the default. Could check how much better the compression is.

mauzey1 · 2018-11-02T18:23:18Z

I created NetCDF files and profiled their creation time and size with the following script.

https://github.com/mauzey1/cmor/blob/96786014b6a00076df5d2eb74d721736db52a294/Test/shuffle_test.py

With shuffle enabled, the sizes of the files were reduced by ~17% of their previous size and took 23% less time to create.

durack1 · 2018-11-02T19:33:04Z

Thanks @mauzey1, with the demos how big are the files? And did you use files that included a large number of missing data (ocean grids for e.g.). It would be curious to see how this works for ~10Gb files, and for files where ~30% of points are missing

mauzey1 · 2018-11-02T21:20:34Z

@durack1 The demo files ranged from 21 MB to 211 MB without shuffle and 17 MB to 175 MB with shuffle. They are using grids that have 360 latitude points, 720 longitude points, and 26 pressure levels. They have 1, 2, 5, and 10 time steps.

Are there any examples of cmor using ocean grids with a lot of missing data?

doutriaux1 · 2018-11-05T16:09:28Z

@mauzey1 this test: https://github.com/PCMDI/cmor/blob/master/Test/test_python_CMIP6_projections.py

Writes data on an ocean grid, you could tweak it to make the grid really big with a bunch of missing values.

I think I have a branch/pr somewhere which finishes that script to actually write the data.

mauzey1 · 2018-11-13T19:37:11Z

I have done some test using a mask from this file for ocean data:
https://vesg.ipsl.upmc.fr/thredds/catalog/esgcet/45/CMIP6.CMIP.IPSL.IPSL-CM6A-LR.1pctCO2.r1i1p1f1.Omon.dpco2.gn.v20180727.html?dataset=CMIP6.CMIP.IPSL.IPSL-CM6A-LR.1pctCO2.r1i1p1f1.Omon.dpco2.gn.v20180727.dpco2_Omon_IPSL-CM6A-LR_1pctCO2_r1i1p1f1_gn_185001-199912.nc

The mask removes about 46% of the data from a 332x362 grid. I made a script that creates data with the same longitude and latitude grid as the above file's mask but with varying numbers of time steps.

https://github.com/mauzey1/cmor/blob/shuffle_testing/Test/shuffle_test_missing_data.py

Here are some results from that script. Enabling shuffle causes a 11% decrease in file size with a 13% decrease in run time.

grid dim  	no shuffle  		shuffle		        size diff %	time diff %
332x362x1000 	 219 mb	  27.179 s	 194 mb	  23.322 s 	-11.099		-13.776
332x362x2000 	 436 mb	  53.107 s	 388 mb	  45.851 s 	-11.110		-13.646
332x362x5000 	1090 mb  132.502 s	 969 mb	 115.061 s 	-11.117		-13.153
332x362x10000 	2179 mb  267.695 s	1936 mb  232.012 s 	-11.120		-13.325

I did however find a case where enabling shuffle would cause an increase in file size. I used a file that had a land mask that removes 62% of the data if you mask parts that equal 0.0.

https://esgf.nccs.nasa.gov/thredds/catalog/esgcet/47/CMIP6.CMIP.NASA-GISS.GISS-E2-1-G.piControl.r1i1p1f1.fx.sftlf.gn.v20180824.html?dataset=CMIP6.CMIP.NASA-GISS.GISS-E2-1-G.piControl.r1i1p1f1.fx.sftlf.gn.v20180824.sftlf_fx_piControl_GISS-E2-1-G_r1i1p1f1_gn.nc

https://github.com/mauzey1/cmor/blob/7c3f9dce2482528064dab6ee1983186437bf2687/Test/shuffle_test_missing_data2.py

grid dim        no shuffle              shuffle                 size diff %     time diff %
90x144x1000      18 mb   2.695 s         19 mb   2.412 s        2.721           -10.160
90x144x2000      36 mb   5.138 s         37 mb   4.625 s        2.725           -9.937
90x144x5000      90 mb  12.588 s         92 mb  11.701 s        2.721           -6.988
90x144x10000    179 mb  24.945 s        184 mb  23.566 s        2.721           -5.463
90x144x20000    358 mb  48.430 s        368 mb  45.527 s        2.722           -5.942

Although there is a small decrease in run time, the files are 2.7% larger with shuffle than without it.

taylor13 · 2018-11-13T21:03:00Z

To be sure, in these tests was deflate_level = 1?

Also, is there a simply explanation why the time to generate the files should be less when shuffle is turned on? I would have thought the computer had to work harder when shuffling was invoked.

mauzey1 · 2018-11-14T01:31:24Z

All of my tests use deflate_level 1.

I'm not sure why enabling shuffling leads to shorter run times. At first I thought it was caused by reduction in disk writing due to smaller files, but the reduction in time also happens when the files get a little bigger. Could the shuffling cause the compression algorithm to run faster?

taylor13 · 2018-11-14T05:21:21Z

sounds reasonable to me. Let's not pursue it further.

I suggest we leave in place the present guidance:

It is recommended that data be compressed by setting the “deflate_level” 
and “shuffle” to values that optimize the balance between reduction of file 
size and degradation in performance. Usually deflate_level=1 will suffice and 
“shuffle” can be turned on with little performance penalty.

Anyone disagree?

thanks for your careful investigation.

taylor13 assigned mauzey1 Oct 24, 2018

taylor13 closed this as completed Dec 14, 2018

taylor13 mentioned this issue Feb 14, 2019

Deflate level #438

Closed

durack1 mentioned this issue May 6, 2020

Performance issues with small chunks #601

Closed

wachsylon mentioned this issue Sep 15, 2022

Enable compression for coordinates #674

Open

sol1105 mentioned this issue Feb 14, 2024

Problems with ATLAS datasets roocs/clisops#317

Closed

larsbuntemeyer mentioned this issue Jul 11, 2024

[cmor]: enable shuffle option by default euro-cordex/py-cordex#265

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compression/shuffle defaults for CMOR3 #403

compression/shuffle defaults for CMOR3 #403

taylor13 commented Sep 19, 2018

durack1 commented Sep 19, 2018

doutriaux1 commented Sep 19, 2018

durack1 commented Sep 19, 2018

taylor13 commented Sep 19, 2018

durack1 commented Sep 19, 2018

taylor13 commented Sep 19, 2018

doutriaux1 commented Sep 20, 2018

taylor13 commented Sep 20, 2018

taylor13 commented Sep 20, 2018

taylor13 commented Oct 24, 2018

taylor13 commented Oct 25, 2018

mauzey1 commented Nov 2, 2018

durack1 commented Nov 2, 2018

mauzey1 commented Nov 2, 2018

doutriaux1 commented Nov 5, 2018

mauzey1 commented Nov 13, 2018

taylor13 commented Nov 13, 2018

mauzey1 commented Nov 14, 2018

taylor13 commented Nov 14, 2018

compression/shuffle defaults for CMOR3 #403

compression/shuffle defaults for CMOR3 #403

Comments

taylor13 commented Sep 19, 2018

durack1 commented Sep 19, 2018

doutriaux1 commented Sep 19, 2018

durack1 commented Sep 19, 2018

taylor13 commented Sep 19, 2018

durack1 commented Sep 19, 2018

taylor13 commented Sep 19, 2018

doutriaux1 commented Sep 20, 2018

taylor13 commented Sep 20, 2018

taylor13 commented Sep 20, 2018

taylor13 commented Oct 24, 2018

taylor13 commented Oct 25, 2018

mauzey1 commented Nov 2, 2018

durack1 commented Nov 2, 2018

mauzey1 commented Nov 2, 2018

doutriaux1 commented Nov 5, 2018

mauzey1 commented Nov 13, 2018

taylor13 commented Nov 13, 2018

mauzey1 commented Nov 14, 2018

taylor13 commented Nov 14, 2018