# Chunking in HDF5

The HDF5 library supports several layouts so as to store datasets.

* Continuous layout:
  ![Continuous](img/dset_contiguous4x4.jpg)
  More compact, and usually it can be read faster.  Typically used for small datasets (< 1 MB).
  
* Chunked layout:
  ![Chunked](img/dset_chunked4x4.jpg)
  Datasets can be enlarged and compressed.  Can be read fast using a fast decompressor. Typically used for large datasets.

## Creating datasets

In [6]:
import numpy as np
import h5py

In [37]:
import os
import shutil
if os.path.exists("chunking"):
    shutil.rmtree("chunking")
os.mkdir("chunking")

In [38]:
def create_files(size, chunksize):
    data = np.arange(size, dtype=np.int64)

    # Contiguous array
    with h5py.File("chunking/continuous.h5", "w") as f:
        f.create_dataset(data=data, name="data", dtype=np.int64)

    # Simple chunking
    with h5py.File("chunking/chunked.h5", "w") as f:
        dset = f.create_dataset("data", (size,), chunks=(chunksize,), dtype=np.int64)
        dset[:] = data

    # Automatic chunking and unlimited resizing
    with h5py.File("chunking/automatic.h5", "w") as f:
        dset = f.create_dataset("data", (0,), chunks=True, maxshape=(None,), dtype=np.int64)
        dset.resize((size,))
        dset[:] = data

In [39]:
create_files(size=1000, chunksize=100)

In [40]:
!ls -l chunking/*.h5

-rw-r--r--  1 faltet  staff  11688 May 11 13:15 chunking/automatic.h5
-rw-r--r--  1 faltet  staff  11496 May 11 13:15 chunking/chunked.h5
-rw-r--r--  1 faltet  staff  10144 May 11 13:15 chunking/continuous.h5


### Exercise 1

In the example above, set the `chunksize` parameter to 99 and re-run it.  How the sizes of the different files changes?  Why?

In [41]:
create_files(size=1000, chunksize=99)
!ls -l chunking/*.h5

-rw-r--r--  1 faltet  staff  11688 May 11 13:16 chunking/automatic.h5
-rw-r--r--  1 faltet  staff  12208 May 11 13:16 chunking/chunked.h5
-rw-r--r--  1 faltet  staff  10144 May 11 13:16 chunking/continuous.h5


## Reading datasets

In [43]:
for h5file in ("chunking/continuous.h5", "chunking/chunked.h5", "chunking/automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(h5file)['data'][:]

reading chunking/continuous.h5...
1000 loops, best of 3: 465 µs per loop
reading chunking/chunked.h5...
1000 loops, best of 3: 482 µs per loop
reading chunking/automatic.h5...
1000 loops, best of 3: 464 µs per loop


### Exercise 2

In the example above, set the `size` to 10 millions and choose a minimal `chunksize` that offers a reasonable filesize and read speed.

In [47]:
create_files(size=1000*1000*10, chunksize=100000)
!ls -lh chunking-*.h5

-rw-r--r--  1 faltet  staff    11K May 11 13:12 chunking-automatic.h5
-rw-r--r--  1 faltet  staff    11K May 11 13:12 chunking-chunked.h5
-rw-r--r--  1 faltet  staff   9.9K May 11 13:12 chunking-continuous.h5
-rw-r--r--  1 faltet  staff    77M May 11 13:08 chunking-unlimited.h5


In [48]:
for h5file in ("chunking/continuous.h5", "chunking/chunked.h5", "chunking/automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(h5file)['data'][:]

reading chunking/continuous.h5...
10 loops, best of 3: 39.4 ms per loop
reading chunking/chunked.h5...
10 loops, best of 3: 44.6 ms per loop
reading chunking/automatic.h5...
10 loops, best of 3: 131 ms per loop
