# Structuring Datasets

> Objectives:
> * How to use a hierarchy to structure datasets inside the same file
> * Learn how to store tabular data in normalized and denormalized forms
> * Use compression to get rid of duplicates (specially in denormalized tables)

## Using the Hierarchy

In HDF5, all nodes stem from a root ("/").  The nodes can be either `Groups` or `Datasets` (also know as `Leaves` in PyTables).  `Groups` are the equivalent of directories on a filesystem and can container `Datasets` or other `Groups`.  A `Dataset` is a container for data.

In PyTables, you may access nodes as attributes on a Python object, namely `f.root.a_group.some_data`.  This is known as natural naming.  Creating new nodes must be done on the file handle:

In [None]:
import numpy as np
import tables

In [None]:
import os
import shutil
data_dir = "structuring"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

In [None]:
f = tables.open_file("%s/layout.h5" % data_dir, "w")
group = f.create_group('/', 'a_group')
group

Inside this group we can create many datasets:

In [None]:
f.create_array(group, "my_array1", np.arange(10))
f.create_array(group, "my_array2", np.ones(100).reshape(10, 10));

In [None]:
print(f)

With that, you can endow your datasets with any hierachy that would fit better to your needs.

In [None]:
f.close()

## Normalizing and denormalizing tables

Many data sources are expressed in terms of related tables.  For example, part of the [MovieLens dataset](https://grouplens.org/datasets/movielens/) is structured in tables having the next columns:

In [None]:
ratings = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
movies = ['movie_id', 'title', 'genres']

The relation that links the two tables above is the `movie_id` field.  This way, one can query parts of the dataset that involve the two tables, like for example, which users ('user_id') gave a rating of 5 to some movie ('title').  This is called the `normalized` version and we have already dealt with that in a previous section.

On the other hand, one can fuse the above 2 tables into a single one:

In [None]:
ratings_movies = ['title', 'genres', 'user_id', 'rating', 'unix_timestamp']

As you see, we still keep all the data fields, except for the 'movie_id' that is not needed anymore.  This is called the `denormalized` version.

The advantage of this one is that we have all the fields readily available in one single table, so querying it and getting info about all the fileds is straighforward.  The disadvantage is that this table will have many duplicated information, i.e. the 'title' and 'genres' fields will appear for all the ratings, which can be seen as a waste of space.

However, many times compression can get rid of many of the duplicated info in denormalized tables.  Let's see how to produce a denormalized table and how it fares compared with the normalized version.

## Denormalizing tables using pandas

In [None]:
import os
import numpy as np
import pandas as pd
import tables

In [None]:
# Import CSV files via pandas
dset = 'movielens-1m'
fdata = os.path.join(dset, 'ratings.dat.gz')
fitem = os.path.join(dset, 'movies.dat.gz')

# pass in column names for each CSV
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(fdata, sep=';', names=r_cols)

m_cols = ['movie_id', 'title', 'genres']
movies = pd.read_csv(fitem, sep=';', names=m_cols,
                     dtype={'title': object, 'genres': object})

In [None]:
# create one merged DataFrame
lens = pd.merge(movies, ratings)

In [None]:
lens.ftypes

In [None]:
def to_hdf5_denorm(lens, filters):

    class Lens(tables.IsDescription):
        user_id = tables.Int32Col(pos=0)
        rating = tables.Int8Col(pos=1)
        unix_timestamp = tables.Int64Col(pos=2)
        title = tables.StringCol(100, pos=3)
        genres = tables.StringCol(50, pos=4)
        
    def get_filename(filters):
        if filters.complevel != 0:
            complib = filters.complib if ":" not in filters.complib else filters.complib.replace(":", "-")
            shuffle = "shuffle" if filters.shuffle else "noshuffle"
            filename = "%s/%s-%d-%s.h5" % (data_dir, complib, filters.complevel, shuffle)
        else:
            filename = "%s/no-compressed.h5" % (data_dir,)
        return filename

    filename = get_filename(filters)
    print("Creating file:", filename)
    with tables.open_file(filename, "w", filters=filters) as f:
        table_lens = f.create_table(f.root, "lens", Lens)
        table_lens.append([lens[col].values for col in table_lens.dtype.names])
    return filename

In [None]:
%%time
filters = tables.Filters(complevel=0, shuffle=True)
h5file = to_hdf5_denorm(lens, filters)

In [None]:
!ptdump -v -R0,10 {h5file}

In [None]:
%ls -lh structuring compression

As can be seen, the size of the denormalized table is much larger than the normalized one (156 MB vs 17 MB).  But that is without using compression.

### Exercise 1

Create a compressed version of the denormalized table and compare it with the same table in the normalized state.
What's the difference in size now?  Why do you think the compression process works much better in this case?

### Exercise 2

Create different files containing the denormalized table using different codecs.  Which one reduces the size better?  How does it compare with the files for the normalized version?

In the next section we will see the effect of querying normalized and denormalized tables.