# 3 Datasets

Objectives
 * Use the h5pyd package to connect with the HDF Lab
 * Explore characterstics of Datasets (with HDF5Lib and HSDS)
 * Look at different ways of reading/writing to datasets
 * Examine how chunking works with HSDS
 * Tricks for best performance

In [None]:
USE_H5PY = True  # set to False to use HSDS instead
if USE_H5PY:
    import h5py
    WORK_DIR="."  # this directory
else:
    import h5pyd as h5py
    WORK_DIR="hdf5://home/test_user1/"
import os.path as op

In [None]:
filepath = op.join(WORK_DIR, "03.h5")
print(f"creating HDF5 file here: {filepath}")
f = h5py.File(filepath, 'w')
list(f)

Note: The 'w' mode removes and existing file (if any) and creates a new empty file.
Other modes supported are:
 * 'r': Open as read only, file must exist
 * 'r+': Read/write, file must exist
 * 'x': Create file, fail if exist
 * 'a': Read/write if exists, otherwise create

In [None]:
# The only object currently in the new file is the root group, we can get the id like this
root = f['/']
root.id.id

In [None]:
# create a new dataset.  Pass in name, shape, and type
f.create_dataset("test1", (3,4), dtype='i8')  # we've created a dataset!

In [None]:
# now something shows up if we list the contents of the file
list(f)

In [None]:
# The dataset type is fixed at creation time
dset = f['test1']
dset.dtype

In [None]:
# in this case the shape is fixed at create time, though we'll see later it is possible to
# create extensible datasets
dset.shape

In [None]:
# you can read all the elements of a dataset using the ellipsis operator
out = dset[...]
out

In [None]:
# you can update portions of the dataset using numpy-like syntax
dset[0,0:4] = [1,2,3,4]
dset[...]

In [None]:
# only portions of the dataset that actually get written are stored
# create a really big dataset
f.create_dataset("big_data", (1024,1024,1024), dtype='f4')  # 4 GB dataset!
dset = f['big_data']
dset[512,512,512] = 3.12  # write one element

In [None]:
# read back a small region
dset[510:514,512,512]

Problem: Use h5stat (or hsstat for HSDS) with this file.  How many bytes have been allocatted?
Note: hsstat may need a minute before it shows the most recent changes.  

In [None]:
# Dataset storage is broken up into "chunks".  Each chunk is stored as a seperate S3 object
# unlike with h5py, datasets are always chunked (even if it is just one chunk!).
# Chunks are determined automtically if not provided in the dataset create call
dset.chunks  

In [None]:
# specify a chunk layout
f.create_dataset("chunked_data", (1024,1024,1024), dtype='f4',chunks=(1,1024,1024))
dset = f["chunked_data"]
dset.chunks

Problem: The server will "correct" chunk layouts that result in chunks that are too small or too large.  Try creating datasets with very small and very large chunks.  What chunk layout do you get?

In [None]:
# Delete a dataset by using the del operator.  With the HDF5 library, this doesn't leave
# "holes" in the file (you can use the h5repack tool to defragment.
# With HDF Server this is not an issue (since each chunk is stored
# as an object).
del f['test1']
list(f)

In [None]:
# If you would like a default value other than 0, specify a 
#   fill value when creating the dataset
f.create_dataset("fill_value", (1024,1024,1024), dtype='i4', fillvalue=42)
dset = f['fill_value']

In [None]:
dset[1,2,3:6]  # get 3 elements from the array

Problem: Run h5stat/hsstat with this file.  How many chunks in the dataset have been allocatted?

In [None]:
# release the file handle
f.close()