# Illustrates the usage of chunkindex

In [1]:
# Append the root of the the project to PYTHONPATH
import sys
sys.path.append('..')

## Dataset creation

Create a dataset in netCDF-4 format. We will use this dataset to illustrate the usage of chunk index.

Dataset info: 
- File location: data/tmp/ramp.nc
- Variables: 'x'
- Data type: int32
- Dimensions: (600, 600)
- Chunk shape: (300, 300)
- Storage: filter: shuffe, compression: zlib (level=1)
  

In [2]:
from pathlib import Path
import xarray as xr
import numpy as np

# Define the dataset path
dataset_dir = Path('data') / 'tmp'
dataset_dir.mkdir(parents=True, exist_ok=True)
dataset_path = dataset_dir / 'ramp.nc'

# Define the datatype
dtype = 'int32'
sizeof_dtype = np.dtype(dtype).itemsize

# Define the number of samples in the dataset
shape = (600, 600)
n = np.prod(shape)
chunk_size = tuple(int(s/2) for s in shape)

# Create the dataset : a ramp with n samples
x = np.arange(n) #+ 10 * np.random.rand(n)
x = x.reshape(shape).astype(dtype)

# Create a data array with xarray
x_xr = xr.DataArray(x)
# Create a dataset
ds = xr.Dataset({'x': x_xr})

# Define the encoding options
encoding = {
    'x': {
        'dtype': dtype,
        'zlib': True,
        'complevel': 1,
        'shuffle': False,
        'chunksizes': chunk_size
    }
}
# Write the dataset to a netcdf file
ds.to_netcdf(dataset_path, encoding=encoding)
ds

## Create a index with kerchunk

Kerchunk creates an index file that provides the location of the chunks in the netCDF-4/HDF5 file.

File location: data/tmp/ramp_kerchunk.json

In [3]:
import json
import kerchunk.hdf

# Define the file path of the json file in which the informations 
# about the chunks and dataset structure will be stored.
json_path = dataset_path.parent.joinpath(str(dataset_path.stem) + '_kerchunk.json')

# Create the data structure
h5chunks = kerchunk.hdf.SingleHdf5ToZarr(str(dataset_path)).translate()

# Write it to a json file
with open(json_path, 'w') as f_out:
    f_out.write(json.dumps(h5chunks))

h5chunks

{'version': 1,
 'refs': {'.zgroup': '{"zarr_format":2}',
  'x/.zarray': '{"chunks":[300,300],"compressor":{"id":"zlib","level":1},"dtype":"<i4","fill_value":null,"filters":null,"order":"C","shape":[600,600],"zarr_format":2}',
  'x/.zattrs': '{"_ARRAY_DIMENSIONS":["dim_0","dim_1"]}',
  'x/0.0': ['data/tmp/ramp.nc', 8760, 124951],
  'x/0.1': ['data/tmp/ramp.nc', 133711, 124920],
  'x/1.0': ['data/tmp/ramp.nc', 258631, 124942],
  'x/1.1': ['data/tmp/ramp.nc', 383573, 124938]}}

## Create an index with chunkindex

Chunkindex create zran index that provides decompression starting points within the chunks.

File location: data/tmp/ramp_indexchunk.nc

In [4]:
import chunkindex
import contextlib
import os
import netCDF4

# Define the index path
index_path = dataset_path.parent.joinpath(str(dataset_path.stem) + '_indexchunk.nc')
# Remove it if it already exists
with contextlib.suppress(FileNotFoundError):
    os.remove(index_path)

# Create the zran index for all variables and chunks of the dataset and write it to the netcdf4 file at index_path
chunkindex.create_index(index_path, dataset_path)

# Display the resulting index for one chunk
index_x00 = xr.open_dataset(index_path, group='x/0.0')
index_x00

## Usage the ZranReferenceFileSystem

The ZranReferenceFileSystem can be used to open the dataset using both kerchunk and chunkindex indexes. 
It provides the get_partial_values() methods to access partial chunks.

We first create ZranReferenceFileSystem object.

In [5]:
import chunkindex

# Create a file system using ZranReferenceFileSystem
fs = chunkindex.ZranReferenceFileSystem(index=index_path, fo=str(json_path))

# Map it to keys, values pairs so we can open it using zarr
m = fs.get_mapper()
list(m.keys())

['.zgroup', 'x/.zarray', 'x/.zattrs', 'x/0.0', 'x/0.1', 'x/1.0', 'x/1.1']

We can then make use of the mapped filesystem to access to some metadata and to some data more easily using the zarr API:

In [6]:
import zarr

# Open the dataset using zarr API
ds = zarr.open(m)

print("\nWe access only metadata:")
print(ds.x.info)

print("\nWe access data in the fisrt chunk:")
print(ds.x[0, :7])

print("\nWe access data in the last chunk:")
print(ds.x[-1, -7:])

print("\nWe access data in all chunks:")
print(ds.x[:])

print("\nWe access partial chunks at the begining of the chunk:")
print(np.frombuffer(ds.x.store.fs.get_partial_values('x/0.0', 0, 7*4), dtype=np.uint32))

print("\nWe access partial chunks at the end of the chunk:")
print(np.frombuffer(ds.x.store.fs.get_partial_values('x/1.1', (300*300-10)*4, 10*4), dtype=np.uint32))


We access only metadata:
Name               : /x
Type               : zarr.core.Array
Data type          : int32
Shape              : (600, 600)
Chunk shape        : (300, 300)
Order              : C
Read-only          : False
Compressor         : Zlib(level=1)
Store type         : zarr.storage.FSStore
No. bytes          : 1440000 (1.4M)
No. bytes stored   : 499940 (488.2K)
Storage ratio      : 2.9
Chunks initialized : 4/4


We access data in the fisrt chunk:
[0 1 2 3 4 5 6]

We access data in the last chunk:
[359993 359994 359995 359996 359997 359998 359999]

We access data in all chunks:
[[     0      1      2 ...    597    598    599]
 [   600    601    602 ...   1197   1198   1199]
 [  1200   1201   1202 ...   1797   1798   1799]
 ...
 [358200 358201 358202 ... 358797 358798 358799]
 [358800 358801 358802 ... 359397 359398 359399]
 [359400 359401 359402 ... 359997 359998 359999]]

We access partial chunks at the begining of the chunk:
[0 1 2 3 4 5 6]

We access partial chunks at t

## Usage of the ZranStore

The ZranStore is a zarr store provided to simplify the access to partial chunks.

In [7]:
import chunkindex

# Create a file system using ZranReferenceFileSystem
fs = chunkindex.ZranReferenceFileSystem(index=index_path, fo=str(json_path))
m = fs.get_mapper()

# Create a ZranStore for zarr
store = chunkindex.ZranStore(m)

# Use this store as the root group
root = zarr.group(store=store)
# Get the variable 'x'
x = root.x

# We need to fake some attributes as zarr only implements partial access to uncompressed data or to data compressed with blosc
# Fake some attributes so that we can enter the UncompressedPartialReadBufferV3 class from function zarr.array._chunk_getitems()
x._partial_decompress = True
x._compressor = None
x._filters = None

print("\nWe access partial chunks at the begining of the chunk:")
print(x[0, 0:7])

print("\nWe to data in the first chunk at different locations:")
print(x[0:2, 10:20])

print("\nWe access to data at the intersection of the 4 chunks:")
print(x[299:301,290:350])


We access partial chunks at the begining of the chunk:
[0 1 2 3 4 5 6]

We to data in the first chunk at different locations:
[[ 10  11  12  13  14  15  16  17  18  19]
 [610 611 612 613 614 615 616 617 618 619]]

We access to data at the intersection of the 4 chunks:
[[179690 179691 179692 179693 179694 179695 179696 179697 179698 179699
  179700 179701 179702 179703 179704 179705 179706 179707 179708 179709
  179710 179711 179712 179713 179714 179715 179716 179717 179718 179719
  179720 179721 179722 179723 179724 179725 179726 179727 179728 179729
  179730 179731 179732 179733 179734 179735 179736 179737 179738 179739
  179740 179741 179742 179743 179744 179745 179746 179747 179748 179749]
 [180290 180291 180292 180293 180294 180295 180296 180297 180298 180299
  180300 180301 180302 180303 180304 180305 180306 180307 180308 180309
  180310 180311 180312 180313 180314 180315 180316 180317 180318 180319
  180320 180321 180322 180323 180324 180325 180326 180327 180328 180329
  180330 