# 3 Datasets

Objectives
 * Use the h5pyd package to connect with the HDF Server
 * Explore characterstics of Datasets
 * Look at different ways of reading/writing to datasets
 * Examine how chunking works with HDF Server
 * Tricks for best performance

In [None]:
import h5pyd
import numpy as np

In [None]:
USERNAME="myusername"  # change this to your your username
# create a file on the server
filename = "/home/"+USERNAME+"/workshop/03.h5"
f = h5pyd.File(filename, 'w')
list(f)

Note: The 'w' mode removes and existing file (if any) and creates a new empty file.
Other modes supported are:
 * 'r': Open as read only, file must exist
 * 'r+': Read/write, file must exist
 * 'x': Create file, fail if exist
 * 'a': Read/write if exists, otherwise create

In [None]:
# The only object currently in the new file is the root group, we can get the id like this
root = f['/']
root.id.id

In [None]:
f.create_dataset("test1", (3,4), dtype='i8')  # we've created a dataset!

In [None]:
# now something shows up if we list the contents of the file
list(f)

In [None]:
# The dataset type is fixed at creation time
dset = f['test1']
dset.dtype

In [None]:
# in this case the shape is fixed at create time, though we'll see later it is possible to
# create extensible datasets
dset.shape

In [None]:
# you can read all the elements of a dataset using the ellipsis operator
out = dset[...]
out

In [None]:
# you can update portions of the dataset using numpy-like syntax
dset[0,0:4] = [1,2,3,4]
dset[...]

In [None]:
# only portions of the dataset that actually get written are stored
# create a really big dataset
f.create_dataset("big_data", (1024,1024,1024), dtype='f4')  # 4 GB dataset!
dset = f['big_data']
dset[512,512,512] = 3.12  # write one element

In [None]:
# read back a small region
dset[510:514,512,512]

Problem: Use hsls -H -v with this file.  At first you may see no storage allocatted for the file, but this will update in a minute or two.

In [None]:
# Dataset storage is broken up into "chunks".  Each chunk is stored as a seperate S3 object
# unlike with h5py, datasets are always chunked (even if it is just one chunk!).
# Chunks are determined automtically if not provided in the dataset create call
dset.chunks  

In [None]:
# specify a chunk layout
f.create_dataset("chunked_data", (1024,1024,1024), dtype='f4',chunks=(1,1024,1024))
dset = f["chunked_data"]
dset.chunks

Problem: The server will "correct" chunk layouts that result in chunks that are too small or too large.  Try creating datasets with very small and very large chunks.  What chunk layout do you get?

In [None]:
# Delete a dataset by using the del operator.  Unlike with the HDF5 library, this doesn't leave
# "holes" in the file.  Only storage that is actually allocated is used
del f['test1']
list(f)

In [None]:
# If you would like a default value other than 0, specify a 
#   fill value when creating the dataset
f.create_dataset("fill_value", (1024,1024,1024), dtype='i4', fillvalue=42)
dset = f['fill_value']

In [None]:
dset[1,2,3:6]  # get 3 elements from the array

Problem: Run hsls -H -v with this file.  How many chunks in the dataset have been allocatted?

In [None]:
# open a data file
f = h5pyd.File("/home/hdf/LOCA/tasmax_day_ACCESS1-0_historical_r1i1p1_19500101-19501231.LOCA_2016-04-02.16th.nc", 'r')

Problem: what happens if you open this with the 'a' flag?

In [None]:
# get a dataset
dset = f['tasmax']

In [None]:
# Get the dimensions
dset.shape

In [None]:
# Read one slice of the dataset
# For really large datasets it maybe impossible to read all the data into memory,
# so often we will need to work with smaller pieces of the data

# Time how long it takes to read 1 element
side = 1
%time arr = dset[123,0:side,0:side]

Problem: How much longer does it take to read 100 elements (side=10)?  Or 10,000?

In [None]:
# Sometimes it is useful to cull a (hopefully) representive sample of the data by
# using a "step" value with the selection.  This grabs every "step" element.
# We didn't give start and stop values in this case so start will be 0 and stop will be the 
# the size of the dimension

arr = dset[123,::10,::10]
arr.mean()

Problem: What are the dimensions of the returned array?

Problem: Does the value of mean change much with different step values?  
Would that be true for any dataset?

In [None]:
# Another way to select elements from a dataset is via "point selection".
# When you provide a set of coordinates, you will get back a 1D list of selected 
# elements.
coordinates = [(123,234,345),(246,46,69),(340,202,888)]
dset[coordinates]

In [None]:
# If you potentially need to expand a dataset, use the maxshape parameter 
# at creation time.
f = h5pyd.File(filename, 'a')  # open original file in append mode
dset = f.create_dataset('resizable', (2,3), maxshape=(20,30))

In [None]:
dset.shape

In [None]:
dset.maxshape

In [None]:
dset.resize((12, 14))

In [None]:
dset.shape

In [None]:
# an error will be returned if you try to go beyond the maxshape bounds...
dset.resize((25,50))

In [None]:
# what if you don't know the maximum size?  None values are interpreted as "unlimited" extent
dset = f.create_dataset('unlimited', (2,3), maxshape=(2,None))

In [None]:
dset.shape

In [None]:
dset.maxshape

In [None]:
dset.resize((2,200))
dset.shape

Problem: What do expect the chunk shape to be for this dataset?