# Illustrates the usage of chunkindex

In [1]:
# Append the root of the the project to PYTHONPATH
import sys
sys.path.append('..')

## Dataset creation

Create a dataset in netCDF-4 format. We will use this dataset to illustrate the usage of chunk index.

Dataset info: 
- File location: data/tmp/ramp.nc
- Variables: 'x'
- Data type: int32
- Dimensions: (600, 600)
- Chunk shape: (300, 300)
- Storage: filter: shuffe, compression: zlib (level=1)
  

In [2]:
from pathlib import Path
import xarray as xr
import numpy as np

# Define the dataset path
dataset_dir = Path('data') / 'tmp'
dataset_dir.mkdir(parents=True, exist_ok=True)
dataset_path = dataset_dir / 'ramp.nc'

# Define the datatype
dtype = 'int32'
sizeof_dtype = np.dtype(dtype).itemsize

# Define the number of samples in the dataset
shape = (600, 600)
n = np.prod(shape)
chunk_size = tuple(int(s/2) for s in shape)

# Create the dataset : a ramp with n samples
x = np.arange(n) #+ 10 * np.random.rand(n)
x = x.reshape(shape).astype(dtype)

# Create a data array with xarray
x_xr = xr.DataArray(x)
# Create a dataset
ds = xr.Dataset({'x': x_xr})

# Define the encoding options
encoding = {
    'x': {
        'dtype': dtype,
        'zlib': True,
        'complevel': 1,
        'shuffle': False,
        'chunksizes': chunk_size
    }
}
# Write the dataset to a netcdf file
ds.to_netcdf(dataset_path, encoding=encoding)
ds

## Create an index with chunkindex

Chunkindex create zran index that provides decompression starting points within the chunks.

File location: data/tmp/ramp_indexchunk.nc

In [3]:
import chunkindex
import contextlib
import os
import netCDF4

# Define the index path
index_path = dataset_path.parent.joinpath(str(dataset_path.stem) + '_indexchunk.nc')
# Remove it if it already exists
with contextlib.suppress(FileNotFoundError):
    os.remove(index_path)

# Create the zran index for all variables and chunks of the dataset and write it to the netcdf4 file at index_path
chunkindex.create_index(index_path, dataset_path)

# Display the resulting index for one chunk
index_x00 = xr.open_dataset(index_path, group='x/0.0')
index_x00

## Partial access to the compressed data

Data are access throught the read_slice() method.

In [4]:
with open(dataset_path, 'rb') as ds:
    with open(index_path, mode='rb') as index:
        
            print("\nWe access partial chunks at the begining of the chunk:")
            print(chunkindex.read_slice(ds, index, 'x', (slice(0, 1), slice(0, 8))))
        
            print("\nWe access to data in the first chunk at different locations:")
            print(chunkindex.read_slice(ds, index, 'x', (slice(0, 2), slice(10, 20))))

            print("\nWe access data in the last chunk:")
            print(chunkindex.read_slice(ds, index, 'x', (slice(598, 600), slice(590, 600))))

            print("\nWe access to data at the intersection of the 4 chunks:")
            print(chunkindex.read_slice(ds, index, 'x', (slice(299, 301), slice(290, 350))))


We access partial chunks at the begining of the chunk:
[[0 1 2 3 4 5 6 7]]

We access to data in the first chunk at different locations:
[[ 10  11  12  13  14  15  16  17  18  19]
 [610 611 612 613 614 615 616 617 618 619]]

We access data in the last chunk:
[[359390 359391 359392 359393 359394 359395 359396 359397 359398 359399]
 [359990 359991 359992 359993 359994 359995 359996 359997 359998 359999]]

We access to data at the intersection of the 4 chunks:
[[179690 179691 179692 179693 179694 179695 179696 179697 179698 179699
  179700 179701 179702 179703 179704 179705 179706 179707 179708 179709
  179710 179711 179712 179713 179714 179715 179716 179717 179718 179719
  179720 179721 179722 179723 179724 179725 179726 179727 179728 179729
  179730 179731 179732 179733 179734 179735 179736 179737 179738 179739
  179740 179741 179742 179743 179744 179745 179746 179747 179748 179749]
 [180290 180291 180292 180293 180294 180295 180296 180297 180298 180299
  180300 180301 180302 180303 18