# Chunking in HDF5

> Objectives:
> * Explain the concept of data chunking
> * Show how to create and read datasets that are chunked
> * Learn how to choose reasonable chunk sizes for your datasets

The HDF5 library supports several layouts so as to store datasets.

* Continuous layout:
  ![Continuous](img/dset_contiguous4x4.jpg)
  More compact, and usually it can be read faster.  Typically used for small datasets (< 1 MB).
  
* Chunked layout:
  ![Chunked](img/dset_chunked4x4.jpg)
  Datasets can be enlarged and compressed.  Can be read fast using a fast decompressor. Typically used for large datasets.

## Creating chunked datasets

In [None]:
import numpy as np
import h5py

In [None]:
import os
import shutil
data_dir = "chunking"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

In [None]:
def create_files(size, chunksize):
    data = np.arange(size, dtype=np.int64)

    # Contiguous array
    with h5py.File(os.path.join(data_dir, "continuous.h5"), "w") as f:
        f.create_dataset(data=data, name="data", dtype=np.int64)

    # Simple chunking
    with h5py.File(os.path.join(data_dir, "chunked.h5"), "w") as f:
        dset = f.create_dataset("data", (size,), chunks=(chunksize,), dtype=np.int64)
        dset[:] = data

    # Automatic chunking and unlimited resizing
    with h5py.File(os.path.join(data_dir, "automatic.h5"), "w") as f:
        dset = f.create_dataset("data", (0,), chunks=True, maxshape=(None,), dtype=np.int64)
        dset.resize((size,))
        dset[:] = data

In [None]:
create_files(size=1000, chunksize=100)

In [None]:
!h5ls -v {data_dir}/chunked.h5

In [None]:
%ls -l chunking

### Exercise 1

In the example above, set the `chunksize` parameter to 99 and re-run it.  How the sizes of the different files changes?  Why?

## Reading chunked datasets

In [None]:
for h5file in ("continuous.h5", "chunked.h5", "automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(os.path.join(data_dir, h5file))['data'][:]

### Exercise 2

In the example above, set the `size` to 10 millions and choose a minimal `chunksize` that offers a reasonable filesize and read speed.