# 3 Datasets

Objectives
 * Use the h5pyd package to connect with the HDF Kita Server
 * Explore characterstics of Datasets
 * Look at different ways of reading/writing to datasets
 * Examine how chunking works with HDF Server
 * Tricks for best performance

In [5]:
USE_H5PY=1  # set to 0 to use HDF Server instead

In [6]:
if USE_H5PY:
    import h5py
else:
    import h5pyd as h5py
import numpy as np
import os

In [7]:
#
# Get folder/directory for HDF files we create  
#
def getMyFolder():
    DIR_NAME = "HDFLabTutorial/"
    if USE_H5PY:
        myfolder = os.getenv("HOME") + "/" + DIR_NAME
        if not os.path.isdir(myfolder):
            # create a directory on the local disk if needed
            print("created folder:", myfolder)
            os.mkdir(myfolder)
    else:
        dir = h5py.Folder('/home/')  # get folder object for root
        username = os.getenv("JUPYTERHUB_USER")
        myfolder = None
        for name in dir:
            # we should come across the given domain
            if username.startswith(name):
                # check any folders where the name matches at least part of the username
                # e.g. folder: "/home/bob/" for username "bob@acme.com"
                path = '/home/' + name + '/'
                f = h5py.Folder(path)
                if f.owner == username:
                    myfolder = path
                f.close()
                if myfolder:
                    break

        dir.close()
    
        # create a workshop subfolder if not already present
        myfolder += DIR_NAME
        try:
            h5py.Folder(myfolder)
        except IOError as ioe:
            if ioe.errno != 404:
                return None  # unexpected error
            # not present - create it now
            h5py.Folder(myfolder, mode='x')
            print("created folder:", myfolder)
       
    return myfolder

In [9]:
# Get folder/directory for this tutorial
home = getMyFolder()
home  # this is the folder where you have permission to write to

'/home/jovyan/HDFLabTutorial/'

In [10]:
# create a file/domain on the server
domain_name = home + "03.h5"
f = h5py.File(domain_name, 'w')
list(f)

[]

Note: The 'w' mode removes and existing file (if any) and creates a new empty file.
Other modes supported are:
 * 'r': Open as read only, file must exist
 * 'r+': Read/write, file must exist
 * 'x': Create file, fail if exist
 * 'a': Read/write if exists, otherwise create

In [11]:
# The only object currently in the new file is the root group, we can get the id like this
root = f['/']
root.id.id

144115188075855872

In [12]:
f.create_dataset("test1", (3,4), dtype='i8')  # we've created a dataset!

<HDF5 dataset "test1": shape (3, 4), type "<i8">

In [13]:
# now something shows up if we list the contents of the file
list(f)

['test1']

In [14]:
# The dataset type is fixed at creation time
dset = f['test1']
dset.dtype

dtype('int64')

In [15]:
# in this case the shape is fixed at create time, though we'll see later it is possible to
# create extensible datasets
dset.shape

(3, 4)

In [16]:
# you can read all the elements of a dataset using the ellipsis operator
out = dset[...]
out

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

In [17]:
# you can update portions of the dataset using numpy-like syntax
dset[0,0:4] = [1,2,3,4]
dset[...]

array([[1, 2, 3, 4],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

In [18]:
# only portions of the dataset that actually get written are stored
# create a really big dataset
f.create_dataset("big_data", (1024,1024,1024), dtype='f4')  # 4 GB dataset!
dset = f['big_data']
dset[512,512,512] = 3.12  # write one element

In [19]:
# read back a small region
dset[510:514,512,512]

array([0.  , 0.  , 3.12, 0.  ], dtype=float32)

Problem: Use hsls -H -v with this file (or h5ls if for H5PY=1).  With hsls at first you may see no storage allocatted for the file, but this will update in a few seconds.
h5ls will show the allocation immediately since everything is synchronous. 

In [20]:
# Dataset storage is broken up into "chunks".  Each chunk is stored as a seperate S3 object
# unlike with h5py, datasets are always chunked (even if it is just one chunk!).
# Chunks are determined automtically if not provided in the dataset create call
dset.chunks  

In [21]:
# specify a chunk layout
f.create_dataset("chunked_data", (1024,1024,1024), dtype='f4',chunks=(1,1024,1024))
dset = f["chunked_data"]
dset.chunks

(1, 1024, 1024)

Problem: The server will "correct" chunk layouts that result in chunks that are too small or too large.  Try creating datasets with very small and very large chunks.  What chunk layout do you get?

In [22]:
# Delete a dataset by using the del operator.  Unlike with the HDF5 library, this doesn't leave
# "holes" in the file.  Only storage that is actually allocated is used
del f['test1']
list(f)

['big_data', 'chunked_data']

In [23]:
# If you would like a default value other than 0, specify a 
#   fill value when creating the dataset
f.create_dataset("fill_value", (1024,1024,1024), dtype='i4', fillvalue=42)
dset = f['fill_value']

In [24]:
dset[1,2,3:6]  # get 3 elements from the array

array([42, 42, 42], dtype=int32)

Problem: Run h5ls/hsls -H -v with this file.  How many chunks in the dataset have been allocatted?