# HDF5 files

Author: Julian Lißner<br>
For questions and feedback please write a mail to: [lissner@mib.uni-stuttgart.de](mailto:lissner@mib.uni-stuttgart.de)

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import h5py
import sys 
import os
import read_h5 as read
sys.path.extend( ['provided_functions', 'incomplete_functions'] )
os.environ["HDF5_USE_FILE_LOCKING"] = "FALSE"
import result_check as check 
import subfunctions as funcs

## Writing hdf5 files

- throughout this lab if you can not open the file restart the kernel of the notebook
- hdf5 files have to be 'open' for read/write access
- `h5py` interfaces python and hdf5 files
- install `h5py` by typing `conda install h5py` into the terminal
- files can be opened with different permissions to circumvent accidential data deletion
- write ( `'w'`) and append ( `'a'`) automatically create a file if it does not exist
- care: write will always create a new file, overwriting existing files

------
__Task:__ Create a hdf5 file and allocate storage for an array of shape (128,128, 3). Recall that `help( h5py.command)` will also show the arguments of the functions.

In [2]:
h5file.close()

NameError: name 'h5file' is not defined

In [None]:
filename = 'temp_data.h5' #create a file in the 'temp_data' folder
h5file = h5py.File( filename, 'w' )
my_data = h5file.create_dataset( 'dataset1', shape=(128,128, 3) ) 
#h5file.close()
#read.disply_all_data(h5file)

In [3]:
read.display_all_data(h5file)

NameError: name 'h5file' is not defined

- `my_data` is a pointer to the dataset stored on your harddrive
- although the array can be directly accessed, it does not have to be stored in memory to access
- datasets can be continuously written with pointers or written immediately (see below)
- when writing entries into the dataset care has to be taken<br>
    $\quad$ `my_data[ :,0,1] = np.arange(128)` $\quad$ set the values at the first row of `my_data` <br>
    $\quad$ `my_data = np.ones( 128, 128, 3)`  $\quad$ set a local variable called `my_data`
- on the second line, a new variable is created and the pointer to the dataset is lost
    
----
__Task:__ Fill the dataset `my_data` with all ones.

In [5]:
my_data = h5file['dataset1'] #fix if pointer was lost
my_data [ :,:,:] = 1
check.allocation( my_data)

pointer to the dataset set


### Metadata
- metadata is data describing data
- metadata should also contain information to reproduce the data
- in this example, it should at least contain auther, creation date and h5py version
- in h5py, the metadata is handled similar to python dictionaries

- data can be neatly organized within the hdf5 file
- folders and metadata enables good management and description of data

-------------------------------
__Task:__ Create a subfolder in your hdf5 file and attach metadata to the folder.

In [6]:
group = h5file.create_group( 'subfolder')
group.attrs.update( dict( created_by='group', creation_date='today' ) )

- `group` is now a pointer to the group/folder in the hdf5 file
- a dataset could be created e.g. by <br>
 $\quad$ `group.create_dataset(..)` $\qquad$ or<br>
 $\quad$ `h5file[ 'subfolder'].create_dataset(..)` <br>
- elements in the hdf5 file are accesed by name, or by reference (pointer variable)

---
__Task:__ Create a dataset containing the `random_array` in the 'subfolder' and attach metadata to it. 

In [7]:
random_array = np.random.rand( 219)
print( 'pointer is the same as reference?', group == h5file[ 'subfolder'] )
dset = group.create_dataset('dataset3' , data = random_array , compression='gzip')#TODO #create a dataset in the subfolder, use the 'data=' argument
metadata = { 'key1':h5py.__version__, 'date' : '07.11.2021' }
dset.attrs.update( metadata ) #add the metadata
check.metadata( dset.attrs)

pointer is the same as reference? True
metadata nicely set


- after you are done, you should always close the file
- thereafter, the reference of the pointers is lost

---
__Task:__ Close the open hdf5 file.

In [8]:
h5file.close()
print( my_data[10:15, 3, 1] )
#print statement should raise an error <not a dataset>

ValueError: Not a dataset (not a dataset)

In [9]:
h5file.close()

-----------------
## Accessing hdf5 files

- reading hdf5 files also works with pointers
- when assigning a viarbale to a dataset, the dataset is not stored in memory (unless specified)
- read/write operations are generally slower when accessing data directly from the hard drive
- the loss of computational speed is, in most cases, neglegible
- references to datasets are lost after the file is closed

---
__Task:__ Read two datasets from the provided hdf5 file. Load the first one into memory, set a pointer to the second dataset.

In [2]:
h5file = h5py.File( 'data/images.h5', 'r' ) #set read permissions
check.permissions( h5file)
dset_0 = h5file[ 'image_data/dset_0'][:]
dset_1 = h5file[ 'image_data/dset_1'][:] #dset_1 in the same folder
#del h5file[result]
print(dset_0)
print(np.shape(dset_0))
print(dset_1)
print(np.shape(dset_1))
print( 'mean value of dset_0:', dset_0.mean() )
print( 'mean value of dset_1:', dset_1.mean() )
h5file.visit( print) 
h5file.close()
print( '################ file closed ################' )

print( 'mean value of dset_0:', dset_0.mean() )
try:
    print( 'mean value of dset_1:', dset_1.mean() )
except:
    print( 'could not read dset_1!' ) 


[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(400, 400)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]
(400, 400)
mean value of dset_0: 0.32746875
mean value of dset_1: 0.32349375
dset_0
image_data
image_data/dset_0
image_data/dset_1
image_data/dset_10
image_data/dset_11
image_data/dset_2
image_data/dset_3
image_data/dset_4
image_data/dset_5
image_data/dset_6
image_data/dset_7
image_data/dset_8
image_data/dset_9
results12
results12/dset_1
results12/dset_2
results12/dset_3
results12/dset_4
results12/dset_5
results12/dset_6
results12/dset_7
results12/dset_8
results12/dset_9
################ file closed ################
mean value of dset_0: 0.32746875
mean value of dset_1: 0.32349375


--- 
__Task:__ Open the file 'data/images.h5' and write the result of `do_image_stuff` in the results subfolder.

In [3]:
result = funcs.do_image_stuff( dset_0 )
h5file = h5py.File( 'data/images.h5', 'r+' ) 
#TODO Create the subfolder and write the result in there, give it the name 'dset_0'
#TODO use 'compression='gzip' on the create_dataset function
#h5file.create_dataset('dset_0' , data= result, compression='gzip')#TODO ...
print(result)
print(h5file['dset_0'][:])
print(h5file['image_data/dset_0'][:])
print(h5file['results12/dset_1'][:])
h5file.close()

[[0.32746875 0.32459375 0.321725   ... 0.3188625  0.321725   0.32459375]
 [0.324475   0.3231125  0.32075625 ... 0.3185375  0.32115625 0.32345625]
 [0.32148125 0.3205625  0.31875625 ... 0.31733125 0.31946875 0.3210375 ]
 ...
 [0.31849375 0.31829375 0.3172125  ... 0.31446875 0.31633125 0.31775625]
 [0.32148125 0.3210375  0.31946875 ... 0.31655625 0.31875625 0.3205625 ]
 [0.324475   0.32345625 0.32115625 ... 0.31813125 0.32075625 0.3231125 ]]
[[0.32746875 0.32459375 0.321725   ... 0.3188625  0.321725   0.32459375]
 [0.324475   0.3231125  0.32075625 ... 0.3185375  0.32115625 0.32345625]
 [0.32148125 0.3205625  0.31875625 ... 0.31733125 0.31946875 0.3210375 ]
 ...
 [0.31849375 0.31829375 0.3172125  ... 0.31446875 0.31633125 0.31775625]
 [0.32148125 0.3210375  0.31946875 ... 0.31655625 0.31875625 0.3205625 ]
 [0.324475   0.32345625 0.32115625 ... 0.31813125 0.32075625 0.3231125 ]]
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 

- Opening a hdf5 file with `a` permissions will always succeed
- a file is created if it doesnt exist, and appended to if it does exist
- `r+` permissions only works if the file exists. This is significantly more safe for debugging. 

--- 
__Task:__ Open the file 'data/images.h5' with `r+` permissions. Compute the results of further datasets without storing them into memory. 

In [4]:
h5file = h5py.File( 'data/images.h5', 'r+' ) 
result_group = h5file.create_group('results12')
for i in range( 1, 10):
    dset = 'dset_{}'.format( i )
    image_path = 'image_data/' + dset
    image = h5file[image_path][:]
    result = funcs.do_image_stuff(image)
    result_group.create_dataset( dset, data=result, compression='gzip' )
h5file.close()

ValueError: Unable to create group (name already exists)

----- 
__Task:__ Open the hdf5 file once again and plot 9 'results' without storing the datasets in memory.

In [None]:
h5file = h5py.File( 'data/images.h5', 'r' ) 
fig, axes = plt.subplots( 3, 3, figsize=(16,16))
axes = axes.flatten()
for i in range(10):
    result = h5file['results12/dset_{}'.format( i )]
    axes[i].imshow( result)
    check.memory( result)

for ax in axes.flatten():
    ax.axis( 'off')
plt.show()

h5file.close()
plt.close('all')

KeyError: "Unable to open object (object 'dset_0' doesn't exist)"

In [3]:
plt.close('all')

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Come up with x and y
x = np.arange(0, 5, 0.1)
y = np.sin(x)

# Just print x and y for fun
print (x)
print (y)

# Plot the x and y and you are supposed to see a sine curve
plt.plot(x, y)

# Without the line below, the figure won't show
#plt.show()