# Chunking in HDF5

> Objectives:
> * Explain the concept of data chunking
> * Show how to create and read datasets that are chunked
> * Learn how to choose reasonable chunk sizes for your datasets

The HDF5 library supports several layouts so as to store datasets.

* Continuous layout:
  ![Continuous](img/dset_contiguous4x4.jpg)
  More compact, and usually it can be read faster.  Typically used for small datasets (< 1 MB).
  
* Chunked layout:
  ![Chunked](img/dset_chunked4x4.jpg)
  Datasets can be enlarged and compressed.  Can be read fast using a fast decompressor. Typically used for large datasets.

## Creating chunked datasets

In [1]:
import numpy as np
import h5py

In [2]:
import os
import shutil
data_dir = "chunking"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

In [3]:
def create_files(size, chunksize):
    data = np.arange(size, dtype=np.int64)

    # Contiguous array
    with h5py.File(os.path.join(data_dir, "continuous.h5"), "w") as f:
        f.create_dataset(data=data, name="data", dtype=np.int64)

    # Simple chunking
    with h5py.File(os.path.join(data_dir, "chunked.h5"), "w") as f:
        dset = f.create_dataset("data", (size,), chunks=(chunksize,), dtype=np.int64)
        dset[:] = data

    # Automatic chunking and unlimited resizing
    with h5py.File(os.path.join(data_dir, "automatic.h5"), "w") as f:
        dset = f.create_dataset("data", (0,), chunks=True, maxshape=(None,), dtype=np.int64)
        dset.resize((size,))
        dset[:] = data

In [4]:
create_files(size=1000, chunksize=100)

In [5]:
!h5ls -v {data_dir}/chunked.h5

Opened "chunking/chunked.h5" with sec2 driver.
data                     Dataset {1000/1000}
    Location:  1:800
    Links:     1
    Chunks:    {100} 800 bytes
    Storage:   8000 logical bytes, 8000 allocated bytes, 100.00% utilization
    Type:      native long


In [6]:
%ls -l chunking

total 72
-rw-r--r--  1 faltet  staff  11688 May 20 13:03 automatic.h5
-rw-r--r--  1 faltet  staff  11496 May 20 13:03 chunked.h5
-rw-r--r--  1 faltet  staff  10144 May 20 13:03 continuous.h5


### Exercise 1

In the example above, set the `chunksize` parameter to 99 and re-run it.  How the sizes of the different files changes?  Why?

In [7]:
create_files(size=1000, chunksize=99)
%ls -l chunking

total 72
-rw-r--r--  1 faltet  staff  11688 May 20 13:03 automatic.h5
-rw-r--r--  1 faltet  staff  12208 May 20 13:03 chunked.h5
-rw-r--r--  1 faltet  staff  10144 May 20 13:03 continuous.h5


In [8]:
!h5ls -v {data_dir}/chunked.h5

Opened "chunking/chunked.h5" with sec2 driver.
data                     Dataset {1000/1000}
    Location:  1:800
    Links:     1
    Chunks:    {99} 792 bytes
    Storage:   8000 logical bytes, 8712 allocated bytes, 91.83% utilization
    Type:      native long


## Reading chunked datasets

In [9]:
for h5file in ("continuous.h5", "chunked.h5", "automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(os.path.join(data_dir, h5file))['data'][:]

reading continuous.h5...
The slowest run took 4.04 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 451 µs per loop
reading chunked.h5...
1000 loops, best of 3: 496 µs per loop
reading automatic.h5...
1000 loops, best of 3: 476 µs per loop


### Exercise 2

In the example above, set the `size` to 10 millions and choose a minimal `chunksize` that offers a reasonable filesize and read speed.

In [10]:
create_files(size=1000*1000*10, chunksize=100000)
%ls -lh chunking/*.h5

-rw-r--r--  1 faltet  staff    77M May 20 13:03 chunking/automatic.h5
-rw-r--r--  1 faltet  staff    76M May 20 13:03 chunking/chunked.h5
-rw-r--r--  1 faltet  staff    76M May 20 13:03 chunking/continuous.h5


In [11]:
for h5file in ("continuous.h5", "chunked.h5", "automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(os.path.join(data_dir, h5file))['data'][:]

reading continuous.h5...
10 loops, best of 3: 37.2 ms per loop
reading chunked.h5...
10 loops, best of 3: 41.1 ms per loop
reading automatic.h5...
10 loops, best of 3: 124 ms per loop


### Exercise 3

Using the 10 million datasets above, retrieve just a small slice (say [10000:30000]) for each and time the time it takes to read.  Do you think that the whole dataset needs to be read in any case?

In [12]:
for h5file in ("continuous.h5", "chunked.h5", "automatic.h5"):
    print("reading %s..." % h5file)
    %timeit h5py.File(os.path.join(data_dir, h5file))['data'][10000:30000]

reading continuous.h5...
1000 loops, best of 3: 482 µs per loop
reading chunked.h5...
1000 loops, best of 3: 656 µs per loop
reading automatic.h5...
1000 loops, best of 3: 590 µs per loop
