In [None]:
# NETCDF SAVING AND COMPRESSION

# NetCDF saving and compression

There are several options to decrease the file size of NetCDFs:
1) Enabling compression
<Br>Basically zipping variables
2) Enabling compression with shuffle setting. 
<Br> A setting for compression that usually decreases data size
3) Setting data type
<Br> Saving data with e.g. single precision instead of double precision, or int8 instead of uint64
4) Setting precision
<Br> Saving data with e.g. 2 decimals precision, instead of the full floating point precision
5) Chunksize
<Br> Setting the order in which muldi-dimensional arrays are saved, to reduce variation between consequetive observations and increase compression efficiency
- ...

Note: option 1, 2 and 5 only affect the netcdf size (and the reading and writing speed). Option 2 and 3 also change the data (lossless vs lossy compression)
These options will be demonstrated below. To do so an ADV output file is used, using the example data from the TUD-COASTAL GITHUB.

This file can be downloaded from a public Google Drive: 
https://drive.google.com/drive/folders/1-o7MemJYKXbmKTThVin0ze6XrBeXhEP9?usp=sharing



In [3]:
import os
import xarray as xr


In [43]:
# Import the data
ncDir = r"O:\HybridDune experiment\data ADV, OBS\TUD-Coastal"

file_in = r"vec1_pilot.nc"
ds = xr.open_dataset(os.path.join(ncDir, file_in))

#

In [None]:
# Baseline: export uncompressed
nc_out = os.path.join(ncDir, "vec1_pilot_out0_uncompressed.nc") 
ds.to_netcdf(nc_out)
# Filesize: 531 MB

# drop some of the variables, to make selecting settings per variable easier
ds = ds.drop_vars(['v', 'w', 'anl2', 'a2', 'a3', 'cor2', 'cor3', 'snr2', 'snr3'])
nc_out = os.path.join(ncDir, "vec1_pilot_out0b_uncompressed, selection.nc") 
ds.to_netcdf(nc_out)   # 292 MB

In [None]:
# 1) Enabling compression
comp = dict(zlib=True, complevel=4) # apply deflate level 4 compression
encoding = {var: comp for var in list(ds.data_vars) + list(ds.coords)}  # dictionary with compression settings
ds.encoding=encoding # save applied encoding to netcdf. Not necessary, but useful to keep track of applied encoding and retrieve it lateron

nc_out = os.path.join(ncDir, "vec1_pilot_out1_compressed.nc") 
ds.to_netcdf(nc_out, encoding=ds.encoding) # NB: need to pass encoding when calling to_netcdf, ds.encoding is not applied automatically
# filesize 41 MB

# NB: compression should be applied to the variables and coordinates. If you forget the coordinates, part of the file remains uncompressed. (This is especially important for files 
# with as many coordinates as samples (e.g. 1,000,000 datetimes with pressure), in this case with blocked data (t x N, each relatively short) it is less important)

# Note: Deflate compression level 4 is used here. You can increase the number for stronger compression, lower the number for faster (de)compression. Range 0-9. (de)compression times 
# rise munch faster than the file size reduces...
# XArray also supports other compression algorithms. This is not advised for general use, as it reduces the compatibility with other software. Matlab for instance supports 
# (to my knowledge) only deflate compression. 


In [None]:
# 2) Enabling compression, setting shuffle
# Shuffle determines how the data is rearranged before compression.  
# Technically, it determines the bit order: whether the full value of each data point is stored before the 
# next point is stored, or if instead the first bit of every data point is stored, then the second bit, etc. 
# More practically, often it improves compression efficiency, sometimes is makes it worse. So it can be worth testing. Or even setting it per variable, instead of the same for all.
# Note: by default shuffle is on when deflate compression is applied in xarray exports (when zlib=True, complever=#). For instance Matlabs ncwrite has it off by default. 

# Try shuffle on -----------------------------------------
comp = dict(zlib=True, complevel=4, shuffle=True)                       # apply deflate level 4 compression and enable shuffle
encoding = {var: comp for var in list(ds.data_vars) + list(ds.coords)}  # dictionary with compression settings
ds.encoding=encoding                                                    # save applied encoding to netcdf. Useful for retrieval

nc_out = os.path.join(ncDir, "vec1_pilot_out2 compressed_shuffle.nc") 
ds.to_netcdf(nc_out, encoding=ds.encoding)                              # NB: need to pass encoding when calling to_netcdf, ds.encoding is not applied automatically
# filesize = 41 MB

# Try shuffle off -----------------------------------------
comp = dict(zlib=True, complevel=4, shuffle=False)                      # apply deflate level 4 compression and disable shuffle
encoding = {var: comp for var in list(ds.data_vars) + list(ds.coords)}  # dictionary with compression settings
ds.encoding=encoding                                                    # save applied encoding to netcdf. Useful for retrieval

nc_out = os.path.join(ncDir, "vec1_pilot_out2 compressed_no_shuffle.nc") 
ds.to_netcdf(nc_out, encoding=ds.encoding) # NB: need to pass encoding when calling to_netcdf, ds.encoding is not applied automatically
# filesize = 26 MB

# Set shuffle per variable -----------------------------------------
# NB: Only set for the variables with large data size (t x N)

# First define a custom encoding dictionary for the (large) dataset variables, to select the shuffle setting per variable
encoding = {'p': {'shuffle': True},    # shuffle flag based on trial and error per variable, to determine which settings works best
            'u': {'shuffle': False},
            'anl1': {'shuffle': True},
            'a1': {'shuffle': True},
            'cor1': {'shuffle': True},
            'snr1': {'shuffle': False},
            'voltage': {'shuffle': False},
            'heading': {'shuffle': True},
            'burst': {'shuffle': True} }

# Then extend the dictionary: add deflate compression level 4 to all variables and coordinates in netCDF, without overwriting existing keys
compression = {var: {"zlib": True, "complevel": 4} for var in list(ds.data_vars) + list(ds.coords)}  # temporary dict, with only compression settings
for var, comp in compression.items():  # for each variable in the dataset, 
    if var in encoding:                # if the variable already has an encoding, update it with the compression settings
        encoding[var].update(comp)
    else:                              # if the variable does not have an encoding yet, add it 
        encoding[var] = comp
ds.encoding=encoding                   # save applied encoding to netcdf. Useful for retrieval

nc_out = os.path.join(ncDir, "vec1_pilot_out2 compressed_best_shuffle.nc") 
ds.to_netcdf(nc_out, encoding=ds.encoding) # NB: need to pass encoding when calling to_netcdf, ds.encoding is not applied automatically
# filesize = 23 MB


In [None]:
# 3) Setting data type
# NB: setting the data type CHANGES the data, it REMOVES information/precision. 
# Changing the data from double to single precision (float64 to float32) removes some precision and removes the ability to represent very large values. Make sure 
# you understand what you do! For integer data types, this is even more important: by definition, they cannot store floating point numbers, nor nan or inf. And they 
# have a very defined range (e.g. int8 can only store values from -128 to 127).

# Define a dictionary where the data type is set to float32, or appropiate integer type. 
encoding = {'p': {'dtype': 'float32'}, 
            'u': {'dtype': 'float32'},
            'anl1': {'dtype': 'uint16'}, # data consists of integers, 0-65535, so uint16
            'a1': {'dtype': 'float32'},
            'cor1': {'dtype': 'int8'},   # correlaton, 0-100% as integers, so int8
            'snr1': {'dtype': 'float32'},
            'voltage': {'dtype': 'float32'},
            'heading': {'dtype': 'float32'},
            'burst': {'dtype': 'float32'} }

# Then extend the dictionary: add deflate compression level 4 to all variables and coordinates in netCDF, without overwriting existing keys. Use no shuffle
compression = {var: {"zlib": True, "complevel": 4, "shuffle": False} for var in list(ds.data_vars) + list(ds.coords)}  # temporary dict, with only compression settings
for var, comp in compression.items():  # for each variable in the dataset, 
    if var in encoding:                # if the variable already has an encoding, update it with the compression settings
        encoding[var].update(comp)
    else:                              # if the variable does not have an encoding yet, add it 
        encoding[var] = comp
ds.encoding=encoding                   # save applied encoding to netcdf. Useful for retrieval

nc_out = os.path.join(ncDir, "vec1_pilot_out3 compressed_single_precision_no_shuffle.nc") 
ds.to_netcdf(nc_out, encoding=ds.encoding) # NB: need to pass encoding when calling to_netcdf, ds.encoding is not applied automatically
# Size 20 MB

  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


In [None]:
# 4) Setting precision of variables. By using the precision you need and not more, you can save storage space. But make sure you know understand you do! You need to know both 
# the precision you need, AND the range of values in the data.

# Precision works by multiplying the data with a factor (e.g. 100 for 2 decimals), rounding it to integer, and saving it as an integer data type. Make sure your integer data type can
# handle the range of values in your data. For instance, uint8 can only store values from 0 to 255. So if you are saving 2 decimals, it can handle values between 0 and 2.55 only.

# Note: this setting can be especially useful to match the precision of original data in raw text files. With e.g. 2 decimals in the original textfile, you can use 2 decimals for the 
# netcdf to effectively compress losslessly. # BUT THIS ONLY WORKS IF you make sure you handle the range, missing data etc correctly.  

# Define a custom dictionary to set the scale factor and data type per variable. 
# Note: I have followed the original data precision to avoid loss of information. Alternatively, you could reduce the precision for some vars, according to your needs
# Note 2: You don't need to do this for all the variables. You can also do it for specific ones that are very large, or suitable for lower precision. Here I defined precision 
# for all vars with t,N data (they are the largest), skipped the others. 
encoding = {'p': { 'scale_factor': 10.0, 'dtype': 'uint16', '_FillValue': 0, 'shuffle': False},      # Original data multiples of 10, so scale factor 10.0. NB: scale factor 10.0, not 10, to make unpacked p float (able to contain NaN)
            'u': { 'scale_factor': 0.001, 'dtype': 'int16', '_FillValue': -9999, 'shuffle': False},  # three decimals originally, so scale factor 0.001. max value is 7m/s, with 3 decimals is 7000 options, so int16 scale of ± 32767 is sufficient
            'anl1': { 'dtype': 'uint16'},     # default flag is shuffle true, can be skipped         # analog connection, OBS. Orignal data in counts (int). Possible range 0-65535, so exactly uint16
            'a1': { 'dtype': 'int16', '_FillValue': -9999},                                          # Amplitude beam 1. Original data in ints.  Seemingly 0-~160. Use int16 to be sure
            'cor1': { 'dtype': 'int8', '_FillValue': -99},                                           # Correlation beam 1. Original data in int, range 0-100#. So int8 is sufficient 
            'snr1': { 'scale_factor': 0.1, 'dtype': 'int16', '_FillValue': -9999, 'shuffle': False}, # SNR beam 1. Original 1 decimal, range 0-65. So int16 is sufficient
            'voltage': { 'scale_factor': 0.1, 'dtype': 'int8', '_FillValue': -99, 'shuffle': False}, # Original 1 decimal, with in this case 0-12V. So int8 is sufficient.   
            'heading': { 'scale_factor': 0.1, 'dtype': 'int16', '_FillValue': -9999},                # Original 1 decimal, range 0-360. So int16 is sufficient
            'burst': { 'scale_factor': 1.0, 'dtype': 'int16', '_FillValue': -9999} }                 # burst: original int. int16 chosen, to be able to handle long deployments/short bursts

# Then extend the dictionary: add deflate compression level 4 to all variables and coordinates in netCDF, without overwriting existing keys. 
compression = {var: {"zlib": True, "complevel": 4} for var in list(ds.data_vars) + list(ds.coords)}  # temporary dict, with only compression settings
for var, comp in compression.items():  # for each variable in the dataset, 
    if var in encoding:                # if the variable already has an encoding, update it with the compression settings
        encoding[var].update(comp)
    else:                              # if the variable does not have an encoding yet, add it 
        encoding[var] = comp
ds.encoding=encoding                   # save applied encoding to netcdf. Useful for retrieval

nc_out = os.path.join(ncDir, "vec1_pilot_out4 compressed_set_precision_and_data_type.nc") 
ds.to_netcdf(nc_out, encoding=ds.encoding) # NB: need to pass encoding when calling to_netcdf, ds.encoding is not applied automatically
# 17 MB

  exec(code_obj, self.user_global_ns, self.user_ns)


Some remarks on setting precision
- Packing the data only works when storing vars as int. If data is stored as floating point, all decimals are still stored, and nothing changes in the precision or file size. 
- _FillValue: any nan in the data will be stored as _FillValue. Without the _FillValue setting, data cannot contain nans. Make sure your data does not contain data equal to the fillvalue. 
    - If the voltage would e.g. have scale_factor=0.1 and _FillValue=99, any data point with value 9.9 (so 99 after applying the scale factor) would be stored as nan! Note: for the applied settings for p, with FillValue=0, any pressure that was originally zero is replaced by NaN. 
    - Without FillValue, NaNs cannot be stored. For anl1 this is the case. THIS IS NOT ADVISED! Python actually gives a warning for this. If possible, use the min or max value of the int class for nans. Or
consider using a 'higher' class, int32 in this case. This also makes next processing steps safer: nans may be needed then to filter data (and people may forget to convert data to a type able to contain nans).
- If data is not centered around zero (e.g. pressure), you can use add_offset to center it around zero, to be able to use a smaller integer data type. See the links below.
- Note that scale_factor and add_offset must be of same type and determine the type of the unpacked data. 
    - If you set scale_factor=1 (don't store decimals), the scale factor is a int type. So data will be unpacked to int. This means it cannot contain nans, and obs stored as _FillValue are converted to zero. 
    - If you set the scale factor as float, e.g. 1.0, data is unpacked as floating point (even though it is stored as int). So anything stored as FillValue is unpacked as nan.  

For more info, see https://docs.xarray.dev/en/stable/user-guide/io.html, especially the text under ""Writing encoded data", and https://cfconventions.org/Data/cf-conventions/cf-conventions-1.11/cf-conventions.html#packed-data The second link also gives some additional options for compression, such as sparse data. However, they are more difficult to apply and make reading the data harder and software-specific, so I suggest not using them. 

In [None]:
# 5) Setting chunk size, dim order
# Data is not compressed for an enitre dataset at once. It is compressed per chunk. You can see this as writing tiny zip files for each chunk. 
# If you have a t * x dataset of 100 timesteps (rows) * 10 locations (columns), you could pack everthing togeter (1 chunk), split the file per row (100 chunks of 10 values), per column 
# (10 chunks of 100 values), # or any other combination (e.g. chunks of 2x5 obs). And if a chunk contains multiple dimensions, it matters in which order the dimensions are stored in the 
# chunk.

# The chunk size not only affects the compression efficiency (file size), but also the reading and writing speed. This is especially important if you only  need a subset of the 
# data at once. For example, for our 100*10 t * x dataset, you only need data for the first timestep, so 10 obs. If this was chunked per timestep, you only need to load one chunk 
# with 10 obs. # If it was chunked per location, you need to load 10 chunks of 100 timesteps each, so 1000 obs instead of 10. This is especially important for very large datasets, 
# that do not fit into memory, and are by definition read in parts.

# For demonstration, we use several chunk settings. We will combine them with deflate compression, no shuffle, no special data types or precision settings. 

# First define a custom encoding dictionary for the (large) dataset variables, to set the chunk size per variable
for i in [0,1 ]:  # try different chunk sizes and orders
    if i == 0:
        chunksize_1 = 1     # 1 t per chunk
        chunksize_2 = 9600  # 300 N per chunk (all)
        filename_out = "vec1_pilot_out5a compressed_chunk_1_9600.nc"  # 27 MB
    elif i == 1:
        chunksize_1 = 1     # 1 t per chunk
        chunksize_2 = 300  # 300 N per chunk (all)
        filename_out = "vec1_pilot_out5b compressed_chunk_1_300.nc"   # 45 MB, so smaller than similar size of t-chunks (5c)
    elif i == 2:
        chunksize_1 = 354  # t: all
        chunksize_2 = 1    # N
        filename_out = "vec1_pilot_out5c compressed_chunk_354_1.nc"   # 57 MB
    elif i == 3:
        chunksize_1 = 59   # t  
        chunksize_2 = 1600 # N
        filename_out = "vec1_pilot_out5d compressed_chunk_59_1600.nc" # 27 MB, so dataset with t,N order is smaller than N,t order (5e)
    elif i == 4:
        # Switch the order of the dimensions t and N around, to show effect on saving
        ds=ds.transpose("N", "t") # syntax: order dimensions in ds: N is the first dim, t the second
        chunksize_1 = 1600 # N
        chunksize_2 = 59   # t
        filename_out = "vec1_pilot_out5e compressed_chunk_59_1600_reordered.nc" # 44 MB
    elif i == 5: 
        ds=ds.transpose("t", "N") # original order of dimensions: t, then N
        chunksize_1 = 354  # t: all
        chunksize_2 = 9600 # N: all
        filename_out = "vec1_pilot_out5f compressed_chunk_354_9600.nc"  # 26 MB

    encoding = {'p': {'chunksizes': (chunksize_1, chunksize_2)}, 
                'u': {'chunksizes': (chunksize_1, chunksize_2)},
                'anl1': {'chunksizes': (chunksize_1, chunksize_2)},
                'a1': {'chunksizes': (chunksize_1, chunksize_2)},
                'cor1': {'chunksizes': (chunksize_1, chunksize_2)},
                'snr1': {'chunksizes': (chunksize_1, chunksize_2)},
                'voltage': {'chunksizes': (chunksize_1, chunksize_2)},
                'heading': {'chunksizes': (chunksize_1, chunksize_2)},
                'burst': {'chunksizes': (chunksize_1, chunksize_2)} }
    
    # Then extend the dictionary: add deflate compression level 4 to all variables and coordinates in netCDF, without overwriting existing keys
    compression = {var: {"zlib": True, "complevel": 4, "shuffle": False} for var in list(ds.data_vars) + list(ds.coords)}  # temporary dict, with only compression settings
    for var, comp in compression.items():  # for each variable in the dataset, 
        if var in encoding:                # if the variable already has an encoding, update it with the compression settings
            encoding[var].update(comp)
        else:                              # if the variable does not have an encoding yet, add it 
            encoding[var] = comp
    ds.encoding=encoding                   # save applied encoding to netcdf. Useful for retrieval

    nc_out = os.path.join(ncDir, filename_out) 
    ds.to_netcdf(nc_out, encoding=ds.encoding) # NB: need to pass encoding when calling to_netcdf, ds.encoding is not applied automatically

# Note: data varies much slower over N (1/16 of a second apart) than over t (10min apart). Therefore:
# - For storage efficiency, single-dim chunks should be over the variable with limited variation (N). See 5b vs 5c.   Note: for data usage instead of storage, it depends on your needs...
# - for multi-dim chunks, order matters. For XArray, compression is better if the last chunk dimension compresses easily. (5d vs 5e)
# Also: larger chunks allow for better compression (5a vs 5b, with 300 vs 9600 N per chunk). Of course, they slow down reading small parts of the data...

# Overall, the important question is which dimension(s) a chunk should contain. How large the chunk is exactly is secondary. 



# SUMMARY
There are many ways to compress NetCDF files. Remember, the main goal is retaining all the necessary data. A small file size is secondary. You don't want to introduce clipping, bad precision, loss of infs/nans, or other problems in your compression. Nor excessively slow exporting/importing (or too much coding effort)

At a minimum, apply compression to the variables *and coordinates* in your file. A quick second step is checking if a general shuffle=on or shuffle=off is better. 

You can play with data types or precision. Make sure you kwow what you do. Concentrate on the variables that require a lot of space, you don't need to do this for all variables. 

Changing chunking settings is especially useful for large files. As a quick first step, you can check if the last dimension in the dataset is the dimension with slow changes. And don't forget about data reading: you don't want to read a full file to access a small part of it. 


