# HDF5 with PyTables

## About HDF5

PyTables supports working with hierarchical data using Hierachical Data Format 5 (HDF5). It uses the binary HDF5 format that is designed to work with very large amounts of data and maintained by the <a href="http://www.hdfgroup.org/">HDF Group</a> non-profit organization.

HDF5 is widely used in various scientific domains and specially in supercomputing applications, due to its portable platforms, flexibility to support various data types and formats and support for efficient reading of data.

In general, binary formats are more performant than text formats since they are more space efficient, and do not have the burden of converting from text to other types as numbers.

## About PyTables

<a href="http://www.pytables.org/moin">PyTables</a> is a library for handling HDF5 data in Python built on top of the HDF5 C libraries and NumPy. It provides also other additional improvements for performance in evaluating expressions and compression.

Basic dataset classes include:
* Arrays. Fixed amount of elements.
* CArray (chunked array). For compression.
* EArray (extendible array). Capable of extending elements.
* VLArray (variable lenght array). The elements can be heterogeneous. 
* Tables (structured array with named fields). 

Basic data types:
* bool, 8 bits.
* int: 8, 16, 32 (default), 64.
* uint: unsigned integers, 8, 16, 32 (default), 64.
* float: 8, 16, 32 , 64 (default).
* complex: 64 and 128 (default).
* string: 8-bits.


Documentation is here: http://pytables.github.io/

## Creating and HDF5 file

In [2]:
%ls
%rm temp.h5

A quick scraping example.ipynb
Alchemy with SQL.ipynb
Getting data from Web APIs.ipynb
HDF5 with PyTables.ipynb
Parsing XML with lxml.ipynb
Reading tabular data with Pandas.ipynb
Starting data science!.ipynb
The World’s Biggest Public Companies List - Forbes.html
churn.txt
churn2.txt
ejercicios_propuestos.docx
fa.xml
forbes-fragment.html
hdf5.PPT
programs.jpg
sqlalchemy.PPT
rm: temp.h5: No such file or directory


In [3]:
import numpy as np
import tables as tb

# Open for write (from scratch):
# You can also use r (read only), a(append) or "r+" (append but file must exist)
f = tb.open_file("temp.h5", "w", title="My title of the file.")

from tables import IsDescription, StringCol, Int64Col, UInt16Col
class User(IsDescription):
     name      = StringCol(16)   # 16-character String
     interest  = Int64Col()      # Signed 64-bit integer
     visits    = UInt16Col()     # Unsigned short integer


# Create a table in the root of the file.
f.create_table("/", "my_first_table", User)
print f
f.close()

temp.h5 (File) 'My title of the file.'
Last modif.: 'Fri May  8 14:53:35 2015'
Object Tree: 
/ (RootGroup) 'My title of the file.'
/my_first_table (Table(0,)) ''



## HDF5 is a file system in a file

Groups are like folders (directories). There are also soft and hard links like in operating systems. 

In [4]:
f = tb.open_file("temp2.h5", "w", title="Another HDF5 file.")

print f.root # The root of the hierarchy


/ (RootGroup) 'Another HDF5 file.'


Creating new nodes must be done in the file handle:

In [5]:
f.create_group("/", "new_group", "A new group!")

# Pythonic navigation in the hierarchy ("natural naming" like in lxml):
print f.root.new_group

/new_group (Group) 'A new group!'


Creating datasets (tables and arrays typically) on the file handle:

In [6]:
f.create_array("/new_group", "some_data", [7, 24, 56, 78])
print f.root.new_group.some_data

/new_group/some_data (Array(4,)) ''


We can obtain metadata and information on particular nodes of the hierarchy via Python attributes:

In [7]:
print f.root.new_group.some_data.size_on_disk
print f.root.new_group.some_data.byteorder
print f.root.new_group.some_data.flavor
# etc...

32
little
python


## Tables need descriptions

Tables are extendable arrays that have a structured or **record** data type.

In [8]:
dt = np.dtype( [('country', 'S2'), ('sales', int)  ])
sales = np.array([("es", 200), ("fr", 300), ("uk", 500)], dtype=dt)
f.create_table("/new_group", "sales", dt)


/new_group/sales (Table(0,)) ''
  description := {
  "country": StringCol(itemsize=2, shape=(), dflt='', pos=0),
  "sales": Int64Col(shape=(), dflt=0, pos=1)}
  byteorder := 'little'
  chunkshape := (6553,)

In [11]:
f.root.new_group.sales.append(sales)
print f.root.new_group.sales

/new_group/sales (Table(6,)) ''


In [10]:
print f.root.new_group.sales.coldescrs

{'country': StringCol(itemsize=2, shape=(), dflt='', pos=0), 'sales': Int64Col(shape=(), dflt=0, pos=1)}


## Reading datasets

PyTales tries to read the data in its original format in Python. Particularly, it recovers NumPy arrays, and datasets can be manipulated as with NumPy (indexing, slicing, masking).

In [12]:
a = f.root.new_group.sales[:]
print type(a)
print f.root.new_group.sales[1]

<type 'numpy.ndarray'>
('fr', 300)


In [13]:
# Remember to do this.
f.close()

## Datasets and pandas

Pandas provides a HDFStore library to read and write dataframes from and to HDF5 files.

In [14]:
import pandas as pd

In [15]:
store = pd.HDFStore('churn.h5',mode='w')
for chunk in pd.read_csv('churn.txt',chunksize=50):
         store.append('churn',chunk)

In [16]:
print store
print store.root.churn.table[:5]

<class 'pandas.io.pytables.HDFStore'>
File path: churn.h5
/churn            frame_table  (typ->appendable,nrows->3333,ncols->21,indexers->[index])
[ (0, [265.1, 45.07, 197.4, 16.78, 244.7, 11.01, 10.0, 2.7], [128, 415, 25, 110, 99, 91, 3, 1], ['KS', '382-4657', 'no', 'yes', 'False.'])
 (1, [161.6, 27.47, 195.5, 16.62, 254.4, 11.45, 13.7, 3.7], [107, 415, 26, 123, 103, 103, 3, 1], ['OH', '371-7191', 'no', 'yes', 'False.'])
 (2, [243.4, 41.38, 121.2, 10.3, 162.6, 7.32, 12.2, 3.29], [137, 415, 0, 114, 110, 104, 5, 0], ['NJ', '358-1921', 'no', 'no', 'False.'])
 (3, [299.4, 50.9, 61.9, 5.26, 196.9, 8.86, 6.6, 1.78], [84, 408, 0, 71, 88, 89, 7, 2], ['OH', '375-9999', 'yes', 'no', 'False.'])
 (4, [166.7, 28.34, 148.3, 12.61, 186.9, 8.41, 10.1, 2.73], [75, 415, 0, 113, 122, 121, 3, 3], ['OK', '330-6626', 'yes', 'no', 'False.'])]


In [17]:
store.close()