# Using HDF5 Format

There are a number of tools that facilitate efficiently reading and writing large amounts of scientific data in binary format on disk. A popular industry-grade library for this is HDF5, which is a C library with interfaces in many other languages like Java, Python, and MATLAB. The “HDF” in HDF5 stands for hierarchical data format. Each HDF5 file contains an internal file system-like node structure enabling you to store multiple datasets and supporting metadata. Compared with simpler formats, HDF5 supports on-the-fly compression with a variety of compressors, enabling data with repeated patterns to be stored more efficiently. For very large datasets that don’t fit into memory, HDF5 is a good choice as you can efficiently read and write small sections of much larger arrays.

There are not one but two interfaces to the HDF5 library in Python, PyTables and h5py, each of which takes a different approach to the problem. h5py provides a direct, but high-level interface to the HDF5 API, while PyTables abstracts many of the details of HDF5 to provide multiple flexible data containers, table indexing, querying capability, and some support for out-of-core computations.

pandas has a minimal dict-like HDFStore class, which uses PyTables to store pandas objects:

In [16]:
import pandas as pd

In [17]:
store = pd.HDFStore('../../CSV Files/O_Reilly/ch06/mydata.h5')

In [19]:
frame = pd.read_csv('../../CSV Files/O_Reilly/ch06/ex1.csv')

In [21]:
store['obj1'] = frame

In [23]:
store['obj1_col'] = frame['a']

Objects contained in the HDF5 file can be retrieved in a dict-like fashion:

In [26]:
store['obj1_col']

0    1
1    5
2    9
Name: a, dtype: int64

HDF% is not a database. It is best suited for write-once, read-many datasets. While data can be added to a file at any time, if multiple writers do so simultaneously, the file can become corrupted.