# Table of Contents
* [Learning Objectives:](#Learning-Objectives:)
	* [HDF5 Summary](#HDF5-Summary)
		* [Composition](#Composition)
		* [Warning](#Warning)
		* [Questions](#Questions)
		* [Exploring an HDF5 file found "in the wild"](#Exploring-an-HDF5-file-found-"in-the-wild")


# Learning Objectives:

* Work with data stored in fast, hierarchical scientific data formats:
  * HDF5

## HDF5 Summary

More details at https://www.hdfgroup.org/HDF5/doc/H5.intro.html

1. HDF5 files that are accessed via h5py store and return numpy arrays
2. HDF5 files are composed groups and datasets
3. Storing numerical-ish data is strongly recommended
4. Groups can be accessed like both Python dicts and like Unix filesystem paths
```python
# Full path
hdf5_file['/group1/subgroup2/subsubgroup1']
# Equivalent to:
g = hdf5_file['group1']
g['subgroup2/subsubgroup1']
# Or to nested lookup:
hdf5_file['group1']['subgroup2']['subsubgroup1']
```

5. We won't be covering HDF (aka HDF4).
  * HDF5 and HDF4 are two different things, even though they are by the same group

### Composition

HDF5 files are composed of **groups** and **datasets**.
A group contains any number of groups and datasets plus supporting metadata.
A dataset is a multidimensional array of data elements plus supporting metadata.

HDF5 files are organized like UNIX paths.
Every HDF5 file has a group (the root) at "/".

HDF5 groups are somewhat similar to Python dicts.

### Warning

You may have problems if you try to use both pytables and h5py at the same time.
This has been fixed in recent versions, but some people still use old stuff!!

* http://stackoverflow.com/questions/28333470/use-both-h5py-and-pytables-in-the-same-python-process
* https://github.com/h5py/h5py/issues/390
  
**ALWAYS** close the HDF5 file not matter what, after each small sequence of access.  

Since merely opening a file doesn't require any reads or writes, it is safest to enclose each operation you wish to perform in a `with open("myfile.hdf5"): ...` block.

If you do not have h5py installed in your conda environment run
```
% conda install -y h5py
```

In [None]:
# Step 1: Let's make a file!
import h5py
import numpy as np
import pandas as pd

filename = "tmp/my_first_hdf5.hdf5"

# h5py.File can take a driver="driver", libver="latest|earliest", 
# and userblock=<size> arguments. In general, leave those options alone unless
#  - you are using parallel HDF5 (aka MPI). Then set driver="mpio"
#  - you have to squeeze every bit of performance from the application, 
#    and don't care if no-one else can use it. Then set libver="latest"
#  - userblock is NOT chunking. userblock is some space at the beginning of the 
#    file that really isn't a part of the file.
my_first_hdf5 = h5py.File(filename, mode='w')
my_first_hdf5.close()

# Hurray! We made our first (rather boring) hdf5 file.

In [None]:
# Step 2: Put something in the file
with h5py.File(filename, mode='w') as my_first_hdf5:
    data = list(range(1000))
    my_first_hdf5['dataset1'] = data
    
# This example easily put Python a list into an HDF5 dataset
# We can (sort of) put arbitrary Python things into HDF5, but we shouldn't. 
# What should we store? Numerical-ish things.
# What should we not store? Whatever we want.
#
# Whatever! I do what I want! 
#   - Eric Cartman (S6E3)

In [None]:
# Step 3: Read the data
with h5py.File(filename, mode='r') as my_first_hdf5:
    data2 = my_first_hdf5['dataset1']

print(data2)
# Hmmm. Instead of getting the data, we instead got a "closed HDF5 dataset".
# This is because h5py is lazily loading data instead of loading everything at once.
#
# This is really good!
# What would happen if our dataset was 200GB? Could we load all of that into memory at once?
# Probably not. (Unless you are very lucky to have access to a server with that much RAM)
# But even if we have the memory, it probably doesn't make sense to load the whole thing 
# and then start processing it is probably smarter to iteratively load and process the 
# data in chunks.

In [None]:
# Step 3a: Actually read the data
with h5py.File(filename, mode='r') as my_first_hdf5:
    data2 = my_first_hdf5['dataset1'][:]

print(type(data2))
print(data2[:10])
# We put a Python list into the dataset, but got a numpy array out.
# Why?

In [None]:
# Step 4: Let's play with groups
with h5py.File(filename, mode='w') as my_first_hdf5:
    g1 = my_first_hdf5.create_group("first")
    # We can create nested groups automatically
    # second, third, and fourth will each be different groups
    g2 = my_first_hdf5.create_group("second/third/fourth")
    # We can create groups under a previously created group
    # Note: g1.create_group instead of my_first_hdf5.create_group
    g3 = g1.create_group("nestedfirst")
    g4 = g1.create_group("nestedsecond")
    #Now the group "first" has 
    
    g5 = my_first_hdf5.create_group("first/nestedthird")
    
# Questions:
# Where is group "first"? group "second"?
# How many groups are nested under "first"?
# What is the absolute path to group "nestedsecond"?

In [None]:
# What is a group?
# What is a dataset?
# Can a group contain another group?
# Can a group contain a dataset?
    
with h5py.File(filename, mode='r') as my_first_hdf5:
    list_of_groups = []
    # visit() recursively visits every group and dataset in a file
    # It calls the function that is given as an argument, stopping
    #  if that function returns anything other than None
    my_first_hdf5.visit(list_of_groups.append)
    #my_first_hdf5.visit(print)

list_of_groups

In [None]:
# Step 4a: Let's play with groups
with h5py.File(filename, mode='w') as my_first_hdf5:
    g1 = my_first_hdf5.create_group("first")
    # We can create nested groups automatically
    # second, third, and fourth will each be different groups
    g2 = my_first_hdf5.create_group("second/third/fourth")
    # We can create groups under a previously created group
    # Note: g1.create_group instead of my_first_hdf5.create_group
    g3 = g1.create_group("nestedfirst")
    g4 = g1.create_group("nestedsecond")
    # Now the group "first" has 
    
    g5 = my_first_hdf5.create_group("first/nestedthird")

### Questions

1. Where is group "first"? group "second"?
2. How many groups are nested under "first"?
3. What is the absolute path to group "nestedsecond"?
4. What is a group?
5. What is a dataset?
6. Can a group contain another group?
7. Can a group contain a dataset?

In [None]:
# Step 5: Combining groups and datasets
filename = "tmp/my_second_hdf5.hdf5"
data = [[i+j*10 for i in range(10)] for j in range(100)]
data2 = np.arange(1000).reshape((10,20,5))

with h5py.File(filename, mode='w') as f:
    g = f.create_group("data")
    dset1 = g.create_dataset("dataset1", (100,10), np.dtype('i8'), data=data)
    # We could also have done it like so:
    # f['data/dataset1'] = data
    # What is the difference? create_dataset() is more flexible. It allows us to
    #  - specify size and shape
    #  - specify datatype
    #  - specify chunking
    #  - specify transparent compression
    #  - specify resizability
    dset2 = g.create_dataset("dataset2", data2.shape)
    dset2 = data2

In [None]:
with h5py.File(filename, mode='r') as f:
    dset1 = f['data/dataset1'][:]
    dset2 = f['data/dataset2'][:]
    
print(dset1.shape, "\n", dset1[:1])
print(dset2.shape, "\n", dset2[:1])
# Why is dset2 full of zeros?

In [None]:
with h5py.File(filename, mode='w') as f:
    g = f.create_group("data")
    # Option 1:
    dset2 = g.create_dataset("dataset2", shape=data2.shape, dtype=data2.dtype)
    dset2[:] = data2
    # The [:] is important!
    
    # Option 2:
    # f['dataset2'] = data2

with h5py.File(filename, mode='r') as f:
    dset2 = f['data/dataset2'][:]
    
print(dset1.shape, "\n", dset1[:1])
print(dset2.shape, "\n", dset2[:1])

In [None]:
# Iterating over datasets is also easy.
# Remember, each dataset is basically a numpy array that is read from disk on demand
with h5py.File(filename, mode='r') as f:
    for item in f['data/dataset2']:
        print(item)

In [None]:
# Step 6: Deleting datasets from a file
filename = "tmp/my_third_hdf5.hdf5"

with h5py.File(filename, "w") as f:
    f['data/dataset1'] = np.arange(100000).reshape(10,10000)

%ls -l $filename
with h5py.File(filename, "r+") as f:
    del f['data/dataset1']
    %ls -l $filename

#The dataset isn't actually deleted until the file is closed
%ls -l $filename

with h5py.File(filename, "r+") as f:
    try:
        del f['data/dataset1']
    except KeyError:
        print("Trying to delete dataset that doesn't exist")

In [None]:
# Step 6a: Deleting entire groups
with h5py.File(filename, "w") as f:
    f['data/dataset1'] = np.arange(100000).reshape(10,10000)
    f['data/dataset2'] = np.arange(100000,200000).reshape(10,10000)
    f['data/dataset3'] = np.arange(200000,300000).reshape(10,10000)
    
%ls -l $filename
with h5py.File(filename, "r+") as f:
    del f['data']
    %ls -l $filename

# The dataset isn't actually deleted until the file is closed
%ls -l $filename

with h5py.File(filename, "r+") as f:
    l = []
    f.visit(l.append)

# Notice that the file didn't shrink to a small number of bytes.
# The datasets and group have been unlinked, but the space hasn't been reclaimed.
# To shrink the file, we need to run an "h5repack" on it.
l

In [None]:
# Step 7: Updating an existing dataset
filename = "tmp/my_fourth_hdf5.hdf5"

with h5py.File(filename, "w") as f:
    f['data/dataset1'] = np.arange(100000).reshape(10000,10)
    f['data/dataset2'] = np.arange(100000,200000).reshape(10000,10)
    f['data/dataset3'] = np.arange(200000,300000).reshape(10000,10)

In [None]:
# Step 7: Updating datasets
with h5py.File(filename, "r+") as f:
    print(f['data/dataset1'][:10])
    f['data/dataset1'][:5] = -1
    
with h5py.File(filename, "r+") as f:
    print(f['data/dataset1'][:10])

In [None]:
# Step 8: resizing existing datasets
d1 = np.arange(100000).reshape(10000,10)
with h5py.File(filename, "w") as f:
    # make a new dataset that can grow to 10x the initial size
    dset1 = f.create_dataset("resizable/dataset1", d1.shape, 
                             maxshape=(d1.shape[0]*10, d1.shape[1]))
    dset1[:] = d1
    
    # Here is an alternate way to create the dataset
    # f.create_dataset("resizable/dataset1", d1.shape, 
    #                  maxshape=(d1.shape[0]*10, d1.shape[1]), data=d1)
%ls -l $filename    

with h5py.File(filename, "r+") as f:
    # double the size of the dataset
    dset1 = f["resizable/dataset1"]
    print(dset1.shape)
    print(dset1.maxshape)
    dset1.resize(dset1.shape[0]*2, axis=0)
    print(dset1.shape)
    
    dset1[dset1.shape[0]//2:] = d1

%ls -l $filename
with h5py.File(filename, "r+") as f:
    # Check that the dataset is actually the size we want
    dset1 = f["resizable/dataset1"]
    d1 = dset1[:]
    print(d1.shape)
    print(d1[-1])

In [None]:
with h5py.File(filename, "r+") as f:
    # resize again, past our original limit
    dset1 = f["resizable/dataset1"]
    print(dset1.shape)
    print(dset1.maxshape)
    dset1.resize(dset1.shape[0]*6, axis=0)
    print(dset1.shape)

In order for datasets to be resized, they *must* be chunked.

This chunking happens automatically in some cases, but can be specified. Chunking happens automatically when:

- compression is turned on
- maxshape is specified for the dataset

Intuition about chunking

- Specifying the chunk size is easy to get wrong! Especially when multiple subtle factors are in play:
  - Chunk size
  - Compression
  - Chunk cache size
  - Underlying disk subsystem (especially for parallel filesystems)

http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/
http://www.hdfgroup.org/HDF5/doc/Advanced/Chunking/Chunking_Tutorial_EOS13_2009.pdf

**If the chunk size is wrong, accessing the data can be 10-100 times slower than normal.**

Moral of the story: Don't set chunking yourself unless you can conclusively demonstrate that it is needed.


In [None]:
#Step 9: HDF5 Attributes on Groups and Datasets
#Step 10: Transparent compression
# - Why transparent compression?

### Exploring an HDF5 file found "in the wild"

In [None]:
import numpy as np
import h5py
metadata = "data/Granule_Metadata.xml"
collection = "data/GES_DISC_GPM_3GPROFF16SSMIS_DAY_V03_dif.xml"
hdf5_precip = "data/3A-DAY.F16.SSMIS.GRID2014R2.20150101-S000000-E235959.001.V03C.HDF5"

In [None]:
import webbrowser, os
try:
    from urllib.parse import quote
except ImportError:
    from urllib import quote # Python 2.7
webbrowser.open("file:///%s/%s" % (os.getcwd(), quote(metadata)))
webbrowser.open("file:///%s/%s" % (os.getcwd(), quote(collection)))

In [None]:
f = h5py.File(hdf5_precip, "r")
list(f.items())

In [None]:
f['InputFileNames']

In [None]:
f['InputFileNames'][0]

In [None]:
inputFileNames = list(f['InputFileNames'])[0].decode().split(',')
inputFileNames

In [None]:
grid_datasets = list(f['Grid'])
grid_datasets

In [None]:
rain = f['Grid']['liquidPrecipFraction']
print(rain)

In [None]:
rain[:5,:10]

We notice a lot of these apparently sentinal values in the datasets. The value -9999.90039062 seems to be used as a filled-in number in a presumably sparse array (the file size isn't large enough to hold all the data if it was non-sparse, as we will see)

In [None]:
# Let us see which datasets have meaningful values, and how commonly
from operator import mul
from functools import reduce

for dataset in grid_datasets:
    data = f['Grid'][dataset]
    non_sentinal = data[:] >= -9999
    print(dataset, "has real data in %d of %d positions" % (
                    non_sentinal.sum(), reduce(mul, data.shape, 1)))
    print("-", data)

In [None]:
pd.DataFrame(rain[:10,:10])

In [None]:
pd.DataFrame(rain[705:716,400:411])

In [None]:
drizzle = (.1 < rain[:]) & (rain[:] < .9)              
drizzle.sum()

In [None]:
times = f['InputGenerationDateTimes']
times[0].decode('utf-8').split(',')

In [None]:
list(f.attrs.keys())

In [None]:
f.attrs['FileInfo'].decode('utf-8').split('\n')

In [None]:
f.attrs['FileHeader'].decode('utf-8').split('\n')

In [None]:
# We've already seen that mixedWater is only those sentinal values
# But just want to show how to use a Pandas Panel for N dimensions
mixedWater = f['Grid']['mixedWater']
panel = pd.Panel(f['Grid']['mixedWater'][:])
panel

In [None]:
panel[10:15,700,400:411]

The basic creation of a new HDF5 data file is done with:

```python
>>> import h5py
>>> import numpy as np
>>> f = h5py.File("mytestfile.hdf5", "w")
>>> dset = f.create_dataset("mydataset", (100,), dtype='i')
```