# Structuring Datasets

> Objectives:
> * How to use a hierarchy to structure datasets inside the same file
> * Learn how to store tabular data in normalized and denormalized forms
> * Use compression to get rid of duplicates (specially in denormalized tables)

## Using the Hierarchy

In HDF5, all nodes stem from a root ("/").  The nodes can be either `Groups` or `Datasets` (also know as `Leaves` in PyTables).  `Groups` are the equivalent of directories on a filesystem and can container `Datasets` or other `Groups`.  A `Dataset` is a container for data.

In PyTables, you may access nodes as attributes on a Python object, namely `f.root.a_group.some_data`.  This is known as natural naming.  Creating new nodes must be done on the file handle:

In [1]:
import numpy as np
import tables

In [2]:
import os
import shutil
data_dir = "structuring"
if os.path.exists(data_dir):
    shutil.rmtree(data_dir)
os.mkdir(data_dir)

In [3]:
f = tables.open_file("%s/layout.h5" % data_dir, "w")
group = f.create_group('/', 'a_group')
group

/a_group (Group) ''
  children := []

Inside this group we can create many datasets:

In [4]:
f.create_array(group, "my_array1", np.arange(10))
f.create_array(group, "my_array2", np.ones(100).reshape(10, 10));

In [5]:
print(f)

structuring/layout.h5 (File) ''
Last modif.: 'Thu May 18 11:01:15 2017'
Object Tree: 
/ (RootGroup) ''
/a_group (Group) ''
/a_group/my_array1 (Array(10,)) ''
/a_group/my_array2 (Array(10, 10)) ''



With that, you can endow your datasets with any hierachy that would fit better to your needs.

In [6]:
f.close()

## Normalizing and denormalizing tables

Many data sources are expressed in terms of related tables.  For example, part of the [MovieLens dataset](https://grouplens.org/datasets/movielens/) is structured in tables having the next columns:

In [7]:
ratings = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
movies = ['movie_id', 'title', 'genres']

The relation that links the two tables above is the `movie_id` field.  This way, one can query parts of the dataset that involve the two tables, like for example, which users ('user_id') gave a rating of 5 to some movie ('title').  This is called the `normalized` version and we have already dealt with that in a previous section.

On the other hand, one can fuse the above 2 tables into a single one:

In [8]:
ratings_movies = ['title', 'genres', 'user_id', 'rating', 'unix_timestamp']

As you see, we still keep all the data fields, except for the 'movie_id' that is not needed anymore.  This is called the `denormalized` version.

The advantage of this one is that we have all the fields readily available in one single table, so querying it and getting info about all the fileds is straighforward.  The disadvantage is that this table will have many duplicated information, i.e. the 'title' and 'genres' fields will appear for all the ratings, which can be seen as a waste of space.

However, many times compression can get rid of many of the duplicated info in denormalized tables.  Let's see how to produce a denormalized table and how it fares compared with the normalized version.

## Denormalizing tables using pandas

In [9]:
import os
import numpy as np
import pandas as pd
import tables

In [10]:
# Import CSV files via pandas
dset = 'movielens-1m'
fdata = os.path.join(dset, 'ratings.dat.gz')
fitem = os.path.join(dset, 'movies.dat.gz')

# pass in column names for each CSV
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(fdata, sep=';', names=r_cols)

m_cols = ['movie_id', 'title', 'genres']
movies = pd.read_csv(fitem, sep=';', names=m_cols,
                     dtype={'title': object, 'genres': object})

In [11]:
# create one merged DataFrame
lens = pd.merge(movies, ratings)

In [12]:
lens.ftypes

movie_id           int64:dense
title             object:dense
genres            object:dense
user_id            int64:dense
rating             int64:dense
unix_timestamp     int64:dense
dtype: object

In [13]:
def to_hdf5_denorm(lens, filters):

    class Lens(tables.IsDescription):
        user_id = tables.Int32Col(pos=0)
        rating = tables.Int8Col(pos=1)
        unix_timestamp = tables.Int64Col(pos=2)
        title = tables.StringCol(100, pos=3)
        genres = tables.StringCol(50, pos=4)
        
    def get_filename(filters):
        if filters.complevel != 0:
            complib = filters.complib if ":" not in filters.complib else filters.complib.replace(":", "-")
            shuffle = "shuffle" if filters.shuffle else "noshuffle"
            filename = "%s/%s-%d-%s.h5" % (data_dir, complib, filters.complevel, shuffle)
        else:
            filename = "%s/no-compressed.h5" % (data_dir,)
        return filename

    filename = get_filename(filters)
    print("Creating file:", filename)
    with tables.open_file(filename, "w", filters=filters) as f:
        table_lens = f.create_table(f.root, "lens", Lens)
        table_lens.append([lens[col].values for col in table_lens.dtype.names])
    return filename

In [14]:
%%time
filters = tables.Filters(complevel=0, shuffle=True)
h5file = to_hdf5_denorm(lens, filters)

Creating file: structuring/no-compressed.h5
CPU times: user 204 ms, sys: 172 ms, total: 376 ms
Wall time: 450 ms


In [15]:
!ptdump -v -R0,10 {h5file}

/ (RootGroup) ''
/lens (Table(1000209,)) ''
  description := {
  "user_id": Int32Col(shape=(), dflt=0, pos=0),
  "rating": Int8Col(shape=(), dflt=0, pos=1),
  "unix_timestamp": Int64Col(shape=(), dflt=0, pos=2),
  "title": StringCol(itemsize=100, shape=(), dflt=b'', pos=3),
  "genres": StringCol(itemsize=50, shape=(), dflt=b'', pos=4)}
  byteorder := 'little'
  chunkshape := (402,)
  Data dump:
[0] (1, 5, 978824268, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[1] (6, 4, 978237008, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[2] (8, 4, 978233496, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[3] (9, 5, 978225952, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[4] (10, 5, 978226474, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[5] (18, 4, 978154768, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[6] (19, 5, 978555994, b'Toy Story (1995)', b"Animation|Children's|Comedy")
[7] (21, 3, 978139347, b'Toy Story (1995)', b"Animation|Children's|Comedy"

In [16]:
%ls -lh structuring compression

compression:
total 121712
-rw-r--r--  1 faltet  staff   5.0M May 18 10:37 blosc-5-shuffle.h5
-rw-r--r--  1 faltet  staff   5.0M May 18 10:38 blosc-blosclz-5-shuffle.h5
-rw-r--r--  1 faltet  staff   5.5M May 18 10:38 blosc-lz4-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.8M May 18 10:38 blosc-lz4hc-5-shuffle.h5
-rw-r--r--  1 faltet  staff   5.5M May 18 10:38 blosc-snappy-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.3M May 18 10:38 blosc-zlib-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.3M May 18 10:38 blosc-zstd-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.1M May 18 10:37 bzip2-5-shuffle.h5
-rw-r--r--  1 faltet  staff    17M May 18 10:34 no-compressed.h5
-rw-r--r--  1 faltet  staff   4.2M May 18 10:37 zlib-5-shuffle.h5

structuring:
total 318752
-rw-r--r--  1 faltet  staff   5.2K May 18 11:01 layout.h5
-rw-r--r--  1 faltet  staff   156M May 18 11:01 no-compressed.h5


As can be seen, the size of the denormalized table is much larger than the normalized one (156 MB vs 17 MB).  But that is without using compression.

### Exercise 1

Create a compressed version of the denormalized table and compare it with the same table in the normalized state.
What's the difference in size now?  Why do you think the compression process works much better in this case?

In [17]:
filters = tables.Filters(complevel=5, complib="blosc:blosclz", shuffle=True)
%time to_hdf5_denorm(lens, filters)

Creating file: structuring/blosc-blosclz-5-shuffle.h5
CPU times: user 324 ms, sys: 102 ms, total: 426 ms
Wall time: 428 ms


'structuring/blosc-blosclz-5-shuffle.h5'

In [18]:
%ls -lh structuring compression

compression:
total 121712
-rw-r--r--  1 faltet  staff   5.0M May 18 10:37 blosc-5-shuffle.h5
-rw-r--r--  1 faltet  staff   5.0M May 18 10:38 blosc-blosclz-5-shuffle.h5
-rw-r--r--  1 faltet  staff   5.5M May 18 10:38 blosc-lz4-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.8M May 18 10:38 blosc-lz4hc-5-shuffle.h5
-rw-r--r--  1 faltet  staff   5.5M May 18 10:38 blosc-snappy-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.3M May 18 10:38 blosc-zlib-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.3M May 18 10:38 blosc-zstd-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.1M May 18 10:37 bzip2-5-shuffle.h5
-rw-r--r--  1 faltet  staff    17M May 18 10:34 no-compressed.h5
-rw-r--r--  1 faltet  staff   4.2M May 18 10:37 zlib-5-shuffle.h5

structuring:
total 333600
-rw-r--r--  1 faltet  staff   7.2M May 18 11:01 blosc-blosclz-5-shuffle.h5
-rw-r--r--  1 faltet  staff   5.2K May 18 11:01 layout.h5
-rw-r--r--  1 faltet  staff   156M May 18 11:01 no-compressed.h5


### Exercise 2

Create different files containing the denormalized table using different codecs.  Which one reduces the size better?  How does it compare with the files for the normalized version?

In [21]:
for complib in ("zlib", "bzip2", "blosc:blosclz", "blosc:lz4", "blosc:lz4hc", "blosc:snappy", "blosc:zlib", "blosc:zstd"):
    filters = tables.Filters(complevel=5, complib=complib, shuffle=True)
    %time to_hdf5_denorm(lens, filters)

Creating file: structuring/zlib-5-shuffle.h5
CPU times: user 1.31 s, sys: 122 ms, total: 1.43 s
Wall time: 1.46 s
Creating file: structuring/bzip2-5-shuffle.h5
CPU times: user 8.96 s, sys: 138 ms, total: 9.09 s
Wall time: 9.11 s
Creating file: structuring/blosc-blosclz-5-shuffle.h5
CPU times: user 313 ms, sys: 83.8 ms, total: 397 ms
Wall time: 402 ms
Creating file: structuring/blosc-lz4-5-shuffle.h5
CPU times: user 286 ms, sys: 86.1 ms, total: 372 ms
Wall time: 375 ms
Creating file: structuring/blosc-lz4hc-5-shuffle.h5
CPU times: user 1.93 s, sys: 86.6 ms, total: 2.02 s
Wall time: 2.03 s
Creating file: structuring/blosc-snappy-5-shuffle.h5
CPU times: user 301 ms, sys: 87 ms, total: 388 ms
Wall time: 390 ms
Creating file: structuring/blosc-zlib-5-shuffle.h5
CPU times: user 1.2 s, sys: 83.6 ms, total: 1.28 s
Wall time: 1.28 s
Creating file: structuring/blosc-zstd-5-shuffle.h5
CPU times: user 654 ms, sys: 82.1 ms, total: 736 ms
Wall time: 740 ms


In [22]:
%ls -lh structuring compression

compression:
total 121712
-rw-r--r--  1 faltet  staff   5.0M May 18 10:37 blosc-5-shuffle.h5
-rw-r--r--  1 faltet  staff   5.0M May 18 10:38 blosc-blosclz-5-shuffle.h5
-rw-r--r--  1 faltet  staff   5.5M May 18 10:38 blosc-lz4-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.8M May 18 10:38 blosc-lz4hc-5-shuffle.h5
-rw-r--r--  1 faltet  staff   5.5M May 18 10:38 blosc-snappy-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.3M May 18 10:38 blosc-zlib-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.3M May 18 10:38 blosc-zstd-5-shuffle.h5
-rw-r--r--  1 faltet  staff   4.1M May 18 10:37 bzip2-5-shuffle.h5
-rw-r--r--  1 faltet  staff    17M May 18 10:34 no-compressed.h5
-rw-r--r--  1 faltet  staff   4.2M May 18 10:37 zlib-5-shuffle.h5

structuring:
total 438952
-rw-r--r--  1 faltet  staff   7.2M May 18 11:02 blosc-blosclz-5-shuffle.h5
-rw-r--r--  1 faltet  staff   7.8M May 18 11:02 blosc-lz4-5-shuffle.h5
-rw-r--r--  1 faltet  staff   6.6M May 18 11:02 blosc-lz4hc-5-shuffle.h5
-rw-r--r

In the next section we will see the effect of querying normalized and denormalized tables.