# Datasets, Groups & Attributes

**Source:** *Python and HDF5* by Andrew Collette, O'Reilly 2013.



In [1]:
import numpy as np
import h5py
path_file='' # insert here the path of this notebook

## 1. In this section, we take a closer look at HDF5 datasets. We'll learn the basic **C**reate, **R**ead, **U**pdate, and **D**elete (CRUD) operations.

### A) Dataset = (Element)Type + (Logical)Shape + Value 

In [2]:

f = h5py.File(path_file+"toy_experiment.hdf5", "r")

In [3]:
dataset = f["/series_1/energy"]
dataset.dtype

dtype('<f8')

In [4]:
dataset.shape

(100,)

In [5]:
o=dataset[...]
o

array([136.83069729, 159.11721028, 154.91997737, 145.16674403,
       151.11199007, 148.65033647, 156.64818339, 146.26886367,
       144.81526596, 151.6653548 , 151.35309667, 152.25451221,
       157.15515241, 140.8178072 , 154.40684039, 157.96453193,
       154.27456452, 149.49763498, 147.45165671, 152.13355776,
       148.8129444 , 154.51765225, 152.47338361, 142.34103515,
       146.85172096, 143.20337399, 141.13620853, 146.51441332,
       146.90505853, 153.0742921 , 149.98800347, 162.91503184,
       137.53193938, 142.70420409, 157.42495374, 156.51219119,
       143.80343862, 154.8627704 , 151.35667145, 141.06108154,
       139.25576543, 141.35410052, 145.50258803, 147.89772072,
       148.60758223, 141.34530272, 154.07850245, 140.56066701,
       139.22917227, 146.63095395, 155.10671188, 136.28012674,
       153.6927526 , 155.58903649, 139.43253001, 152.13614536,
       150.44139213, 150.27663615, 143.51059344, 154.90196918,
       160.80845276, 148.53863951, 155.17049036, 148.43

Sometimes people refer to datasets as *array variables*. Every variable has a type and value.

### B) Create

#### I. From a NumPy Array
We have already seen that in Pyton we may create HDF5 datasets directly from NumPy arrays or from the groud up (*greater control!*). WE did so already for the weather.hdf5 file.

In [6]:
f = h5py.File(path_file+"testfile.hdf5", "w", libver="latest", driver="core")

(With the `driver="core"` keyword argument we instruct `h5py` to use a memory buffer as HDF5 file rather than a file on disk.)

In [7]:
arr = np.ones((5,2))
f["my_first_dataset"] = arr

In [8]:
dset = f["my_first_dataset"]
dset

<HDF5 dataset "my_first_dataset": shape (5, 2), type "<f8">

#### II. From Scratch

In [9]:
dset = f.create_dataset("my_scrath_dset1", (10, 10))
dset

<HDF5 dataset "my_scrath_dset1": shape (10, 10), type "<f4">

In [10]:
dset2 = f.create_dataset("my_scrath_dset2", (10, 10), dtype=np.float64)
dset2

<HDF5 dataset "my_scrath_dset2": shape (10, 10), type "<f8">

There are several other keyword arguments, for example, to control the dataset layout in the file, compression, etc. Check out the `h5py` documentation or Andrew's book for details. 

### C) Create large/empty dataset

In [11]:
dset_large =f.create_dataset("my_big_dset", (1024**3,), dtype=np.float64)
dset_large[0:8192]= np.arange(8192)

In [12]:
list(f.keys())

['my_big_dset', 'my_first_dataset', 'my_scrath_dset1', 'my_scrath_dset2']

In [13]:
f.flush()

### D) Working with resizable datasets 

When you create a dataset, in addition to setting its shape, you have the opportunity to
make it resizable up to a certain maximum set of dimensions

This is called  maxshape on the h5py side.

Like shape, maxshape is specified when the dataset is created, but can’t be changed. As
you saw earlier, if you don’t explicitly choose a maxshape, HDF5 will create a nonresizable
dataset and set maxshape = shape.

In [14]:
dset2 = f.create_dataset('my_finite_resizable_set', (2,2), maxshape=(2,2))

In [15]:
dset2.shape

(2, 2)

Let us try to resize...

In [16]:
dset2.resize((1,1))
dset2.shape

(1, 1)

Nice! Let's try again

In [17]:
dset2.resize((3,3))

RuntimeError: Unable to set dataset extent (dimension cannot exceed the existing maximal size (new: 3 max: 2))

We have fixed the maxshape, thus we cannot resize with shapes larger! But we may define the initial dataset in this way

In [19]:
dset = f.create_dataset('my_not_finite_resizable_set', (2,2), dtype=np.int32, maxshape=(None,None))

ValueError: Unable to create dataset (name already exists)

In [20]:
dset.resize((10,10))
dset.shape

(10, 10)

### E) Read

NumPy-style slicing and dicing

In [21]:
out = dset2[...]

In [23]:
out

array([[0.]], dtype=float32)

In [22]:
type(out)

numpy.ndarray

In [23]:
dset[1:3,2:4]

array([[0, 0],
       [0, 0]], dtype=int32)

### F) Update

NumPy-style slicing and dicing

In [24]:
dset[1:4,1] = 2.0

In [25]:
dset[:,1]

array([0, 2, 2, 2, 0, 0, 0, 0, 0, 0], dtype=int32)

### G) Delete

The objects in an HDF5 file (groups, datasets, datatype objects) are interlinked. Deleting an object means first and foremost to *unlink* the object. The storage of the underlying object *may or may not* be freed as a result. We'll return to this point later.

In [26]:
list(f.keys())

['my_big_dset',
 'my_finite_resizable_set',
 'my_first_dataset',
 'my_not_finite_resizable_set',
 'my_scrath_dset1',
 'my_scrath_dset2']

In [27]:
del f['my_first_dataset']
list(f.keys())

['my_big_dset',
 'my_finite_resizable_set',
 'my_not_finite_resizable_set',
 'my_scrath_dset1',
 'my_scrath_dset2']

In [27]:
f

<HDF5 file "testfile.hdf5" (mode r+)>

In [28]:
!ls -lrt *hdf5

-rw-r--r--  1 federica  staff       14336 15 Ott 11:01 toy_experiment.hdf5
-rw-r--r--  1 federica  staff  8590000128 15 Ott 11:15 testfile.hdf5


### H) Little endian vs Big endian- we may skip this

HDF5 is designed to preserve data in any format you want. 
Let us play with endianness, which relates to how
multibyte numbers are represented. 
a floating-point number can be stored in memory:

##### with the least significant byte first (little-endian)
##### with the most significant byte first (big-endian). 

Modern Intel-style x86 chips use the little-endian format, but data can be stored in HDF5 in either fashion.

In [29]:
a=np.ones((1000,1000),dtype='<f4') #Little endian 4-byte float

In [30]:
b=np.ones((1000,1000),dtype='>f4') #Big endian 4-byte float

In [31]:
from timeit import timeit
ta=timeit(a.mean, number=1000)
tb=timeit(b.mean, number=1000)
print('Little endian time: ',ta)
print('Big endian time: ',tb)      

Little endian time:  0.1508109999995213
Big endian time:  0.18424649999360554


In [32]:
c=b.view("float32")

In [33]:
c[:]=b

In [34]:
b=c

In [35]:
timeit(b.mean, number=1000)

0.15782312501687557

In [36]:
d=np.ones((1000,1000),dtype='f4') #standard approach

In [37]:
timeit(d.mean, number=1000)

0.15687920799246058

In [38]:
f.close()

## 2. Groups: organizing the objects

HDF5 groups (and links) are the main tool to organize the objects in an HDF5 file. For beginners, it's OK to think about groups as nested "folders" or "drawers" in an "HDF5 cabinet." To use them effectively you'll have to understand the limitations of that model.

An *HDF5 link* is an explicit representation of an association between a single source (the group) and a single destination. There are different "flavors" of links, which differ in how the destination is specified.

An HDF5 group is a collection of links, **not** objects.

### A) Create

#### I) By-hand

In [39]:
f = h5py.File(path_file+"toy_experiment.hdf5", "r")
for key in f:
    print(key, f[key])
    

series_1 <HDF5 group "/series_1" (2 members)>
series_2 <HDF5 group "/series_2" (2 members)>


In [41]:
f['series_1'].file

<HDF5 file "toy_experiment.hdf5" (mode r)>

In [42]:
f['series_1/energy'].parent

<HDF5 group "/series_1" (2 members)>

In [43]:
f.close()

#### II) From scratch
We can also work from scratch using the create_group method on the class f (which is is a subclass of the more generic group class).

This means that the file itself is a **group** (the root group "/").

In [44]:

f = h5py.File(path_file+"group_ex.hdf5", "w")
f.name

'/'

In [45]:
g1=f.create_group('my_first_group')
sg1=g1.create_group('my_first_subgroup')

In [46]:
sg1.name

'/my_first_group/my_first_subgroup'

In [47]:
all_g=f.create_group('my_first/long/path')

### B) Read
The link collections stored in HDF5 groups can be accessed and traversed like Python dictionaries. 

In [48]:
len(f)

2

In [49]:
list(f.keys())

['my_first', 'my_first_group']

In [50]:
[(x,y) for x, y in f.items()]

[('my_first', <HDF5 group "/my_first" (1 members)>),
 ('my_first_group', <HDF5 group "/my_first_group" (1 members)>)]

In [51]:
def printname(name):
    print(name)
f.visit(printname)    

my_first
my_first/long
my_first/long/path
my_first_group
my_first_group/my_first_subgroup


### C) Working with links

What does it mean to give an object a name in the file?
you might think that the name is part of the object, in the same way that the dtype or
shape are part of a dataset.
But this isn’t the case. There’s a layer between the group object and the objects that are
its members. The two are related by the concept of links.

Links in HDF5 are handled in much the same way as in modern filesystems. **Objects
like datasets and groups don’t have an intrinsic name**; rather, they have an **address (byte
offset)** in the file that HDF5 has to look up. When you assign an object to a name in a
group, that address is recorded in the group and associated with the name you provided
to form a link.
This means that objects in an **HDF5 file can have more than one
name;** in fact, they have as many names as there exist links pointing to them. The number
of links that point to an object is recorded, and when no more links exist, the space used
for the object is freed.
This kind of a link, the default in HDF5, is called a hard link.

<img src="./img/OIP.jpg" />


#### I. Hard Link

Let us create an example of multiple links

In [53]:
g1.name

'/my_first_group'

Let us create a new link pointing to group

In [54]:
f['/my_second_group']=g1
g2=f['/my_second_group']

In [55]:
g1==g2

True

In [56]:
f['/my_third_group']=f['/my_second_group']
list(f.keys())

['my_first', 'my_first_group', 'my_second_group', 'my_third_group']

In [51]:
g1.name

'/my_first_group'

In [52]:
g2.name

'/my_second_group'

In [57]:
f.visit(printname)

my_first
my_first/long
my_first/long/path
my_first_group
my_first_group/my_first_subgroup


In [58]:
f['/my_second_group']

<HDF5 group "/my_second_group" (1 members)>

In [55]:
f.keys()

<KeysViewHDF5 ['my_first', 'my_first_group', 'my_second_group', 'my_third_group']>

Notice "degeneracy" only in the keys but not seen by printname!

In [56]:
g1==g2

True

But Python id....

In [57]:
print("id_g1:",id(g1))
print("id_obj_1:",id(f['/my_first_group']))
print("id_g2:",id(g2))
print("id_obj_2:", id(f['/my_second_group']))

id_g1: 140273325879760
id_obj_1: 140273503473936
id_g2: 140273325649936
id_obj_2: 140273503478608


are different...but hash are 

In [58]:
print("hash_g1:",hash(g1))
print("hash_obj_1:",hash(f['/my_first_group']))
print("hash_g2:",hash(g2))
print("hash_obj_2:", hash(f['/my_second_group']))

hash_g1: -7902240669163372283
hash_obj_1: -7902240669163372283
hash_g2: -7902240669163372283
hash_obj_2: -7902240669163372283


Let's put a dataset in the group


In [59]:
x = np.random.rand(5,5)


In [60]:
f['/my_first_group/my_dataset']=x
dset=f['/my_first_group/my_dataset']
dset[0:2]

array([[0.35283029, 0.02041644, 0.18592154, 0.12311386, 0.29050763],
       [0.83803894, 0.35512436, 0.15304231, 0.00574351, 0.17193565]])

In [30]:
dset[0:2]

array([[0.68977788, 0.7571344 , 0.74439093, 0.70868886, 0.50196574],
       [0.31453525, 0.19987406, 0.31629568, 0.1909404 , 0.18334524]])

We may wonder if the dataset is seen through the other links

In [61]:
f['/my_second_group/my_dataset']

<HDF5 dataset "my_dataset": shape (5, 5), type "<f8">

The name points to the address of the group, thus to eveything inside the group.

<img src="./img/hard_link.png" />

#### II. Soft Links

Unlike “hard” links, which associate a link name with a particular object in the file, soft links instead store **the path to an object**.

We can create links with destinations which may not yet or will never exist.


In [62]:
y = np.random.rand(5,5)
f['/my_fourth_group/my_dataset']=y
dset_s=f['/my_fourth_group/my_dataset']
if 'soft'  in f.keys():
    del f['soft']
f['soft'] = h5py.SoftLink('/my_fourth_group/my_dataset')

In [63]:
f['soft'] == dset_s

True

In [64]:
f['soft']

<HDF5 dataset "soft": shape (5, 5), type "<f8">

Now let's move  the dataset

In [65]:
f['/my_fourth_group'].move('my_dataset','your_dataset')
dset_s

<HDF5 dataset "your_dataset": shape (5, 5), type "<f8">

In [66]:
f['/my_fourth_group/your_dataset']

<HDF5 dataset "your_dataset": shape (5, 5), type "<f8">

In [67]:
'soft' in f.keys()

True

In [38]:
f.keys()

<KeysViewHDF5 ['my_first', 'my_first_group', 'my_fourth_group', 'my_second_group', 'my_third_group', 'soft']>

So 'soft' is still a key, but to something?

In [68]:
f['soft']

KeyError: 'Unable to synchronously open object (component not found)'

#### III. External Links

A link's destination can be an object in another HDF5 file.

In [69]:
f['external_alias'] = h5py.ExternalLink("./file_hdf5/weather.hdf5", "/1/temp")

In [70]:
for name in f:
    print(name, f.get(name, getclass=True, getlink=True))

external_alias <class 'h5py._hl.group.ExternalLink'>
my_first <class 'h5py._hl.group.HardLink'>
my_first_group <class 'h5py._hl.group.HardLink'>
my_fourth_group <class 'h5py._hl.group.HardLink'>
my_second_group <class 'h5py._hl.group.HardLink'>
my_third_group <class 'h5py._hl.group.HardLink'>
soft <class 'h5py._hl.group.SoftLink'>


## 3) Attributes: storing metadata

HDF5 attributes are the main facility to store *metadata*. In the simplest case, they are mere key/value pairs. However, in HDF5, they can be full-blown array variables, albeit without some of the conveniences of their dataset cousins (no partial I/O, no chunking or compression, etc.)

In [84]:
#path_file='C:/Users/bazzocchi/HDAF5_ex/file_hdf5/'
#f = h5py.File(path_file+"group_ex.hdf5", "w")

In [71]:
f.keys()

<KeysViewHDF5 ['external_alias', 'my_first', 'my_first_group', 'my_fourth_group', 'my_second_group', 'my_third_group', 'soft']>

In [42]:
 f

<HDF5 file "group_ex.hdf5" (mode r+)>

The `attrs` property of an HDF5 object is the gateway to it's collection of HDF5 attributes.

In [72]:
f['my_first_group/my_dataset'].attrs

<Attributes of HDF5 object at 140273325478816>

### A) Create

#### I. From python objects

**String.**

In [73]:
f['my_first_group/my_dataset'].attrs['title'] = "Dataset from first round of experiments"
f['my_first_group/my_dataset'].attrs.keys()

<KeysViewHDF5 ['title']>

**Attributes can be full-blown array variables.**

In [74]:
arr=np.ones((3,3))

In [75]:
f['my_first_group'].attrs['setup']=arr

#### II. From scratch

In [80]:
f['my_fourth_group'].attrs.create('two_byte_int', 190, dtype='i2')
f['my_fourth_group'].attrs.create('title', 'fourth run')

### B) Read

In [81]:
[(name, val) for name, val in f['my_fourth_group'].attrs.items()]

[('title', 'fourth run'), ('two_byte_int', 190)]

In [82]:
f['my_fourth_group'].attrs.get('title')

'fourth run'

### C) Update

In [83]:
f['my_fourth_group'].attrs['title']='my fourth run'
[(name, val) for name, val in f['my_fourth_group'].attrs.items()]

[('title', 'my fourth run'), ('two_byte_int', 190)]

### D) Delete

In [84]:
del f['my_fourth_group'].attrs['title']
[(name, val) for name, val in f['my_fourth_group'].attrs.items()]

[('two_byte_int', 190)]