## Using the Hierarchy

In HDF5, all nodes stem from a root ("/").  The nodes can be either `Groups` or `Datasets` (also know as `Leaves` in PyTables).  `Groups` are the equivalent of directories on a filesystem and can container `Datasets` or other `Groups`.  A `Dataset` is a container for data.

In PyTables, you may access nodes as attributes on a Python object, namely `f.root.a_group.some_data`.  This is known as natural naming.  Creating new nodes must be done on the file handle:

In [9]:
import numpy as np
import tables

In [40]:
f = tables.open_file("layout.h5", "w")
group = f.create_group('/', 'a_group')
group

/a_group (Group) ''
  children := []

Inside this group we can create many datasets:

In [41]:
f.create_array(group, "my_array1", np.arange(10))
f.create_array(group, "my_array2", np.ones(100).reshape(10, 10));

In [42]:
print(f)

layout.h5 (File) ''
Last modif.: 'Mon May  8 17:08:42 2017'
Object Tree: 
/ (RootGroup) ''
/a_group (Group) ''
/a_group/my_array1 (Array(10,)) ''
/a_group/my_array2 (Array(10, 10)) ''



With that, you can endow your datasets with any hierachy that would fit better to your needs.

In [43]:
f.close()

## Normalizing and denormalizing tables

Many data sources are expressed in terms of related tables.  For example, part of the [MovieLens dataset](https://grouplens.org/datasets/movielens/) is structured in tables having the next columns:

In [44]:
ratings = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
movies = ['movie_id', 'title', 'genres']

The relation that links the two tables above is the `movie_id` field.  This way, one can query parts of the dataset that involve the two tables, like for example, which users ('user_id') gave a rating of 5 to some movie ('title').  This is called the `normalized` version.

On the other hand, one can fuse the above 2 tables into a single one:

In [45]:
ratings_movies = ['title', 'genres', 'user_id', 'rating', 'unix_timestamp']

As you see, we still keep all the data fields, except for the 'movie_id' that is not needed anymore.  This is called the `denormalized` version.

The advantage of this one is that we have all the fields readily available in one single table, so querying it and getting info about all the fileds is straighforward.  The disadvantage is that this table will have many duplicated information, i.e. the 'title' and 'genres' fields will appear for all the ratings, which can be seen as a waste of space.

However, many times compression can get rid of many of the duplicated info in denormalized tables.  More on compression on a next section.