# 3 Datasets

Objectives
 * Use the h5pyd package to connect with the HDF Kita Server
 * Explore characterstics of Datasets
 * Look at different ways of reading/writing to datasets
 * Examine how chunking works with HDF Server
 * Tricks for best performance

In [1]:
import h5pyd
import numpy as np
import os

In [2]:
#
# Get folder in /home/ that is owned by given user
#
def getMyFolder(username):
    if not username:
        return None
    dir = h5pyd.Folder('/home/')  # get folder object for root
    myfolder = None
    for name in dir:
        # we should come across the given domain
        if username.startswith(name):
            # check any folders where the name matches at least part of the username
            # e.g. folder: "/home/bob/" for username "bob@acme.com"
            path = '/home/' + name + '/'
            f = h5pyd.Folder(path)
            if f.owner == username:
                myfolder = path
            f.close()
            if myfolder:
                break

    dir.close()
    
    # create a workshop subfolder if not already present
    myfolder += "HDFLabTutorial/"
    try:
        h5pyd.Folder(myfolder)
    except IOError as ioe:
        if ioe.errno != 404:
            return None  # unexpected error
        h5pyd.Folder(myfolder, mode='x')
        print("created folder:", myfolder)
       
    return myfolder

In [3]:
# Get your home folder
home = getMyFolder(os.getenv("JUPYTERHUB_USER"))
home  # this is the folder where you have permission to write to

'/home/jreadey/HDFLabTutorial/'

In [4]:
# create a domain on the server
domain_name = home + "03.h5"
f = h5pyd.File(domain_name, 'w')
list(f)

[]

Note: The 'w' mode removes and existing file (if any) and creates a new empty file.
Other modes supported are:
 * 'r': Open as read only, file must exist
 * 'r+': Read/write, file must exist
 * 'x': Create file, fail if exist
 * 'a': Read/write if exists, otherwise create

In [5]:
# The only object currently in the new file is the root group, we can get the id like this
root = f['/']
root.id.id

'g-f81c3e56-8663-11e8-a8b5-0242ac120022'

In [6]:
f.create_dataset("test1", (3,4), dtype='i8')  # we've created a dataset!

<HDF5 dataset "test1": shape (3, 4), type "<i8">

In [7]:
# now something shows up if we list the contents of the file
list(f)

['test1']

In [8]:
# The dataset type is fixed at creation time
dset = f['test1']
dset.dtype

dtype('int64')

In [9]:
# in this case the shape is fixed at create time, though we'll see later it is possible to
# create extensible datasets
dset.shape

(3, 4)

In [10]:
# you can read all the elements of a dataset using the ellipsis operator
out = dset[...]
out

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

In [11]:
# you can update portions of the dataset using numpy-like syntax
dset[0,0:4] = [1,2,3,4]
dset[...]

array([[1, 2, 3, 4],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

In [12]:
# only portions of the dataset that actually get written are stored
# create a really big dataset
f.create_dataset("big_data", (1024,1024,1024), dtype='f4')  # 4 GB dataset!
dset = f['big_data']
dset[512,512,512] = 3.12  # write one element

In [13]:
# read back a small region
dset[510:514,512,512]

array([0.  , 0.  , 3.12, 0.  ], dtype=float32)

Problem: Use hsls -H -v with this file.  At first you may see no storage allocatted for the file, but this will update in a minute or two.

In [14]:
# Dataset storage is broken up into "chunks".  Each chunk is stored as a seperate S3 object
# unlike with h5py, datasets are always chunked (even if it is just one chunk!).
# Chunks are determined automtically if not provided in the dataset create call
dset.chunks  

(128, 128, 128)

In [15]:
# specify a chunk layout
f.create_dataset("chunked_data", (1024,1024,1024), dtype='f4',chunks=(1,1024,1024))
dset = f["chunked_data"]
dset.chunks

(1, 1024, 1024)

Problem: The server will "correct" chunk layouts that result in chunks that are too small or too large.  Try creating datasets with very small and very large chunks.  What chunk layout do you get?

In [16]:
# Delete a dataset by using the del operator.  Unlike with the HDF5 library, this doesn't leave
# "holes" in the file.  Only storage that is actually allocated is used
del f['test1']
list(f)

['big_data', 'chunked_data']

In [17]:
# If you would like a default value other than 0, specify a 
#   fill value when creating the dataset
f.create_dataset("fill_value", (1024,1024,1024), dtype='i4', fillvalue=42)
dset = f['fill_value']

In [18]:
dset[1,2,3:6]  # get 3 elements from the array

array([42, 42, 42], dtype=int32)

Problem: Run hsls -H -v with this file.  How many chunks in the dataset have been allocatted?

In [19]:
# open a data file
f = h5pyd.File("/shared/NASA/NCEP3/ncep3.he5", 'r')

Problem: what happens if you open this with the 'a' flag?

In [20]:
# get a dataset
dset = f["/HDFEOS/GRIDS/NCEP/Data Fields/Tair_2m"]

In [21]:
# Get the dimensions
dset.shape

(7850, 720, 1440)

In [22]:
# Read section of the dataset
# For really large datasets it maybe impossible to read all the data into memory,
# so often we will need to work with smaller pieces (hyperslabs) of the data

# Time how long it takes to read 5 slices
start_index = 123
end_index = start_index + 5
%time arr = dset[start_index:end_index,::,::]
arr.mean()

CPU times: user 36 ms, sys: 44 ms, total: 80 ms
Wall time: 287 ms


-504.48132

Problem: How much longer does it take to read 10 slices

In [None]:
# Sometimes it is useful to cull a (hopefully) representive sample of the data by
# using a "step" value with the selection.  This grabs every "step" element.
# We didn't give start and stop values in this case so start will be 0 and stop will be the 
# the size of the dimension

arr = dset[start_index:end_index,::10,::10]
arr.mean()

Problem: What are the dimensions of the returned array?

Problem: Does the value of mean change much with different step values?  
Would that be true for any dataset?

In [None]:
# Another way to select elements from a dataset is via "point selection".
# When you provide a set of coordinates, you will get back a 1D list of selected 
# elements.
coordinates = [(123,234,345),(246,46,69),(340,202,888)]
dset[coordinates]

In [None]:
# If you potentially need to expand a dataset, use the maxshape parameter 
# at creation time.
f = h5pyd.File(domain_name, 'a')  # open original file in append mode
dset = f.create_dataset('resizable', (2,3), maxshape=(20,30))

In [None]:
dset.shape

In [None]:
dset.maxshape

In [None]:
dset.resize((12, 14))

In [None]:
dset.shape

In [None]:
# an error will be returned if you try to go beyond the maxshape bounds...
dset.resize((25,50))

In [None]:
# what if you don't know the maximum size?  None values are interpreted as "unlimited" extent
dset = f.create_dataset('unlimited', (2,3), maxshape=(2,None))

In [None]:
dset.shape

In [None]:
dset.maxshape

In [None]:
dset.resize((2,200))
dset.shape

Problem: What do expect the chunk shape to be for this dataset?