# Getting started with HDF5 and PyTables

*11/03/2018 - Giacomo Debidda @PyCon Slovakia*

In [2]:
import os
import numpy as np
import pandas as pd
import tables as tb

In [3]:
np.set_printoptions(precision=2, suppress=True)

In [20]:
tb.print_versions()

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
PyTables version:    3.4.2
HDF5 version:        1.8.18
NumPy version:       1.14.1
Numexpr version:     2.6.4 (not using Intel's VML/MKL)
Zlib version:        1.2.11 (in Python interpreter)
LZO version:         2.09 (Feb 04 2015)
BZIP2 version:       1.0.6 (6-Sept-2010)
Blosc version:       1.11.3 (2017-03-09)
Blosc compressors:   blosclz (1.0.5), lz4 (1.7.5), lz4hc (1.7.5), snappy (1.1.1), zlib (1.2.8), zstd (1.1.3)
Blosc filters:       shuffle, bitshuffle
Python version:      3.6.3 |Anaconda, Inc.| (default, Nov 20 2017, 20:41:42) 
[GCC 7.2.0]
Platform:            Linux-4.4.0-116-generic-x86_64-with-debian-stretch-sid
Byte-ordering:       little
Detected cores:      4
Default encoding:    utf-8
Default FS encoding: utf-8
Default locale:      (en_US, UTF-8)
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=


In [21]:
data_dir = os.path.join(os.getcwd(), 'data')
print(data_dir)

/home/jack/Repos/hdf5-pycon-slovakia/data


### HDF5: a filesystem in a file

HDF5 is a data model, library, and file format for storing and managing big and complex data.

An HDF5 file can be thought of as a container (or group) that holds a variety of heterogeneous data objects (or datasets). The datasets can be almost anything: images, tables, graphs, or even documents, such as PDF or Excel.

- Datasets (i.e. files in a filesystem)
- Groups (i.e. directories in a filesystem)
- Attributes (i.e. metadata of file/directory)

![HDF5 structure](img/hdf5_structure.jpg)

Working with groups and group members is similar to working with directories and files in UNIX.

**/** root group (every HDF5 file has a root group)

**/foo** member of the root group called foo

**/foo/bar** member of the group foo called bar

### HDF5 in the Python data stack

![h5py - PyTables refactor](img/h5py-pytables-refactor.png)

![PyTables logo](img/pytables-logo.png)

- Does not want to be a complete wrapper for the entire HDF5 C API
- High level abstraction over HDF5 (it's more "battery included" than h5py)
- Does not depend on h5py (at the moment)
- Natural naming
- Fast searches (indexing, out-of-core querying)
- Built-in compression
- Undo mode

### Groups, Attributes, ~~Datasets~~

Let's have a brief overview of the PyTables API. And let's forget HDF5 Datasets for a moment...

- how to create groups, attributes and arrays
- how to access nodes in a HDF5 file
- how to access data in a node

### Natural naming

PyTables nodes (i.e. datasets and groups in the HDF5 file) can be accessed with the dot notation if they follow a *natural naming* schema (like in pandas).

In [8]:
with tb.open_file(filename='data/my_pytables_file.h5', mode='w') as f:
    f.create_group(where='/', name='my_group')    

In [9]:
with tb.open_file(filename='data/my_pytables_file.h5', mode='w') as f:
    f.create_group(where='/', name='my group')



When a node which does not follow the natural naming schema you can still use `get_node` to access it.

In [10]:
with tb.open_file(filename='data/my_pytables_file.h5', mode='r') as f:
    my_group = f.get_node(where='/my group')
    print(my_group)
    my_group = f.get_node(where='/', name='my group')
    print(my_group)

/my group (Group) ''
/my group (Group) ''


### Let's create some data

In [11]:
arr1d = np.linspace(start=0, stop=100, num=10, dtype=np.int8)
arr1d

array([  0,  11,  22,  33,  44,  55,  66,  77,  88, 100], dtype=int8)

In [72]:
arr2d = np.arange(30, dtype=np.float32).reshape(10, 3)
arr2d

array([[ 0.,  1.,  2.],
       [ 3.,  4.,  5.],
       [ 6.,  7.,  8.],
       [ 9., 10., 11.],
       [12., 13., 14.],
       [15., 16., 17.],
       [18., 19., 20.],
       [21., 22., 23.],
       [24., 25., 26.],
       [27., 28., 29.]], dtype=float32)

In [13]:
arr3d = np.random.random((10, 3, 5)).astype('float32')
arr3d

array([[[0.49, 0.66, 0.37, 0.81, 0.17],
        [0.5 , 0.83, 0.25, 0.42, 0.09],
        [0.67, 0.95, 0.93, 0.86, 0.83]],

       [[0.47, 0.31, 0.99, 0.87, 0.91],
        [0.69, 0.67, 0.63, 0.17, 0.25],
        [0.47, 0.67, 0.57, 0.17, 0.45]],

       [[0.24, 0.84, 0.15, 0.44, 0.84],
        [0.36, 0.07, 0.43, 0.15, 0.65],
        [0.2 , 0.65, 0.08, 0.89, 0.66]],

       [[0.81, 0.41, 0.16, 0.81, 0.01],
        [0.15, 0.44, 0.13, 0.51, 0.07],
        [0.32, 0.32, 0.51, 0.6 , 0.34]],

       [[0.68, 0.98, 0.42, 0.98, 0.09],
        [0.61, 0.68, 0.36, 0.81, 0.49],
        [0.72, 0.97, 0.94, 0.39, 0.94]],

       [[0.35, 0.16, 0.43, 0.51, 0.36],
        [0.71, 0.68, 0.85, 0.37, 0.97],
        [0.22, 0.33, 0.52, 0.12, 0.88]],

       [[0.85, 0.25, 0.07, 0.79, 0.27],
        [0.92, 0.07, 0.31, 0.19, 0.96],
        [0.82, 0.27, 0.35, 0.71, 0.28]],

       [[0.63, 0.35, 0.03, 0.88, 0.4 ],
        [0.43, 0.15, 0.25, 1.  , 0.06],
        [0.27, 0.09, 0.11, 0.73, 0.3 ]],

       [[0.13, 0.92, 0.5

### Create groups, attributes, arrays

In [14]:
with tb.open_file(filename='data/my_pytables_file.h5', mode='w') as f:
    f.create_array(where='/', 
                   name='array_1d',
                   title='A one-dimensional Array',
                   obj=arr1d)
    f.set_node_attr(where='/array_1d', attrname='SomeAttribute', attrvalue='SomeValue')
    
    f.create_group(where='/', name='multidimensional_data', title='Multi dimensional data')
    f.set_node_attr(where='/multidimensional_data', attrname='SomeOtherAttribute', attrvalue=123)
    
    f.create_array(where='/multidimensional_data', 
                   name='array_2d',
                   title='A two-dimensional Array',
                   obj=arr2d)
    
    f.create_array(where=f.root.multidimensional_data, 
                   name='array_3d',
                   title='A three-dimensional Array',
                   obj=arr3d)

### Traverse a HDF5 file

In [15]:
with tb.open_file('data/my_pytables_file.h5', 'r') as f:
    for node in f:
        print(node)
        
#     for node in f.walk_groups():
#         print(node)

/ (RootGroup) ''
/array_1d (Array(10,)) 'A one-dimensional Array'
/multidimensional_data (Group) 'Multi dimensional data'
/multidimensional_data/array_2d (Array(10, 3)) 'A two-dimensional Array'
/multidimensional_data/array_3d (Array(10, 3, 5)) 'A three-dimensional Array'


### Select a hyperslab of data

Hyperslabs are portions of datasets. A hyperslab selection can be a logically contiguous collection of points in a dataspace, or it can be regular pattern of points or blocks in a dataspace.

In [100]:
with tb.open_file(filename='data/my_pytables_file.h5', mode='r') as f:
    print(repr(f.root.multidimensional_data.array_3d[2:7, 1, :]))

array([[0.35, 0.56, 0.18, 0.89, 0.73],
       [0.77, 0.29, 0.74, 0.28, 0.59],
       [0.53, 0.12, 0.61, 0.23, 0.21],
       [0.7 , 0.39, 0.87, 0.38, 0.89],
       [0.8 , 0.05, 0.3 , 0.41, 0.17]], dtype=float32)


### HDF5 CLI utils

[Here](https://support.hdfgroup.org/products/hdf5_tools/#h5dist) you can find the command line tools developed by the HDF Group. You don't need h5py or PyTables to use them.

If you are on Ubuntu, you can install them with `sudo apt install hdf5-tools`

In [16]:
# -r stands for 'recursive'
!h5ls -r 'data/my_pytables_file.h5'

/                        Group
/array_1d                Dataset {10}
/multidimensional_data   Group
/multidimensional_data/array_2d Dataset {10, 3}
/multidimensional_data/array_3d Dataset {10, 3, 5}


In [17]:
!h5dump 'data/my_pytables_file.h5'

HDF5 "data/my_pytables_file.h5" {
GROUP "/" {
   ATTRIBUTE "CLASS" {
      DATATYPE  H5T_STRING {
         STRSIZE 5;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "GROUP"
      }
   }
   ATTRIBUTE "PYTABLES_FORMAT_VERSION" {
      DATATYPE  H5T_STRING {
         STRSIZE 3;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "2.1"
      }
   }
   ATTRIBUTE "TITLE" {
      DATATYPE  H5T_STRING {
         STRSIZE 1;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  NULL
      DATA {
      }
   }
   ATTRIBUTE "VERSION" {
      DATATYPE  H5T_STRING {
         STRSIZE 3;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DAT

### PyTables CLI utils

Some very useful tools are shipped with PyTables. These are just python scripts that can be used from the command line.

In [18]:
# -a show attributes, -v verbose
!ptdump -av 'data/my_pytables_file.h5'

/ (RootGroup) ''
  /._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    PYTABLES_FORMAT_VERSION := '2.1',
    TITLE := '',
    VERSION := '1.0']
/array_1d (Array(10,)) 'A one-dimensional Array'
  atom := Int8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := None
  /array_1d._v_attrs (AttributeSet), 5 attributes:
   [CLASS := 'ARRAY',
    FLAVOR := 'numpy',
    SomeAttribute := 'SomeValue',
    TITLE := 'A one-dimensional Array',
    VERSION := '2.4']
/multidimensional_data (Group) 'Multi dimensional data'
  /multidimensional_data._v_attrs (AttributeSet), 4 attributes:
   [CLASS := 'GROUP',
    SomeOtherAttribute := 123,
    TITLE := 'Multi dimensional data',
    VERSION := '1.0']
/multidimensional_data/array_2d (Array(10, 3)) 'A two-dimensional Array'
  atom := Float32Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := None
  /multidimensional_

In [19]:
!pttree -L 2 --use-si-units --sort-by 'size' 'data/my_pytables_file.h5'


------------------------------------------------------------

/ (RootGroup)
+--multidimensional_data (Group)
|  +--array_3d (Array)
|  |     mem=600.0B, disk=600.0B [82.2%]
|  `--array_2d (Array)
|        mem=120.0B, disk=120.0B [16.4%]
`--array_1d (Array)
      mem=10.0B, disk=10.0B [ 1.4%]

------------------------------------------------------------
Total branch leaves:    3
Total branch size:      730.0B in memory, 730.0B on disk
Mean compression ratio: 1.00
HDF5 file size:         6.0kB
------------------------------------------------------------



### HDF5 Viewers

- [HDFView](https://support.hdfgroup.org/products/java/hdfview/): I like it!
- [HDF Compass](https://support.hdfgroup.org/projects/compass/): seems to be the go-to choice for Mac users
- [ViTables](http://vitables.org/): weird... indexing starts at 1

> Remember when I said "let's forget HDF5 datasets for a moment"?

> Here is why...

### PyTables provides high-level abstractions over the HDF5 Dataset

Homogenous dataset:

- **Array**
- **CArray**
- **EArray**
- **VLArray**

Heterogenous dataset:

- **Table**

*Note*: some HDF5 libraries/tools could create a HDF5 dataset currently not supported by PyTables. In this case the dataset will be mapped into an [UnImplemented](http://www.pytables.org/usersguide/libref/helper_classes.html#tables.UnImplemented) class instance.

### Wait! What is a *Homogeneous* dataset?

A homogeneous dataset is a dataset where all its elements have the same [Atom](http://www.pytables.org/usersguide/libref/declarative_classes.html#atomclassdescr).

An Atom represents the **type and shape** of the atomic objects to be saved.

It's the most basic, indivisible element that can be stored in a given dataset. Atoms have the property that their length is always the same.

In [40]:
atom = tb.Int64Atom(shape=(2,))
atom

Int64Atom(shape=(2,), dflt=0)

### Array

[Docs](http://www.pytables.org/usersguide/libref/homogenous_storage.html#the-array-class)

An Array contains homogeneous data. Every atomic object (i.e. every single element) has the same type and shape.

- Fastest I/O speed
- Must fit in memory
- Not compressible
- Not enlargeable

### CArray

[Docs](http://www.pytables.org/usersguide/libref/homogenous_storage.html#carrayclassdescr)

- Chunked layout, compressible
- Not enlargeable

In [61]:
filters = tb.Filters(complevel=5, complib='zlib')

Tips on how to use compression (from the PyTables docs)

- A mid-level (5) compression is sufficient. No need to go all the way up (9)
- Use zlib if you must guarantee complete portability
- Use blosc all other times (it is optimized for HDF5)

In [74]:
with tb.open_file(filename='data/my_pytables_file.h5', mode='w') as f:
    f.create_carray(
        where='/',
        name='my_carray',
        title='My PyTables CArray',
        obj=arr2d,
        filters=filters)

### EArray

[Docs](http://www.pytables.org/usersguide/libref/homogenous_storage.html#earrayclassdescr)

- Enlargeable on **one** dimension (append)
- Compressible

In [75]:
# One (and only one) of the shape dimensions *must* be 0.
# The dimension being 0 means that the resulting EArray object can be extended along it.
# Multiple enlargeable dimensions are not supported (at the moment).
num_columns = 5
shape = (0, num_columns)

with tb.open_file(filename='data/my_pytables_file.h5', mode='w') as f:
    # you can create an EArray and fill it later, but you need to specify atom and shape
    f.create_earray(
        where='/',
        name='my_earray',
        title='My PyTables EArray',
        atom=tb.Float32Atom(),
        shape=shape,
        filters=filters)

In [76]:
num_rows = 1000000  # 1 million
matrix = np.random.random((num_rows, num_columns)).astype('float32')

In [77]:
with tb.open_file(filename='data/my_pytables_file.h5', mode='a') as f:
    earray = f.root.my_earray
    earray.append(sequence=matrix[0:10, :])

In [78]:
with tb.open_file(filename='data/my_pytables_file.h5', mode='a') as f:
    earray = f.root.my_earray
    earray.append(sequence=matrix[11:50, :])

### VLArray

[Docs](http://www.pytables.org/usersguide/libref/homogenous_storage.html#the-vlarray-class)

- Each row has a variable number of homogeneous elements (atoms)

See https://github.com/PyTables/PyTables/blob/develop/examples/vlarray1.py
    
See https://github.com/PyTables/PyTables/blob/develop/examples/vlarray2.py

In [89]:
with tb.open_file(filename='data/my_pytables_file.h5', mode='w') as f:
    vlarray = f.create_vlarray(where='/',
                               name='vlarray2',
                               atom=atom,
                               title='Ragged array of vectors')
    vlarray.append([[0, 1]])
    vlarray.append([[2, 3], [4, 5], [6, 7]])
    vlarray.append([[8, 9]])

### Table

[Docs](http://www.pytables.org/usersguide/libref/structured_storage.html?highlight=table#tableclassdescr)

See http://www.pytables.org/usersguide/libref/declarative_classes.html#tables.IsDescription

- Data can be heterogeneous (i.e. different shapes and different dtypes)
- The structure of a table is declared by its description
- multi-column searches

In order to emulate in Python records mapped to HDF5 C structs PyTables implements a special class so as to easily define all its fields and other properties. It's called `IsDescription`.

A *description* defines the table structure (basically, the *schema* of your table).

### Introducing a real dataset: NYC yellow taxi dataset

Without a real world example I find it hard to reason about...
http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml

In [109]:
!less 'data/taxi+_zone_lookup.csv'

"LocationID","Borough","Zone","service_zone"
1,"EWR","Newark Airport","EWR"
2,"Queens","Jamaica Bay","Boro Zone"
3,"Bronx","Allerton/Pelham Gardens","Boro Zone"
4,"Manhattan","Alphabet City","Yellow Zone"
5,"Staten Island","Arden Heights","Boro Zone"
6,"Staten Island","Arrochar/Fort Wadsworth","Boro Zone"
7,"Queens","Astoria","Boro Zone"
8,"Queens","Astoria Park","Boro Zone"
9,"Queens","Auburndale","Boro Zone"
10,"Queens","Baisley Park","Boro Zone"
11,"Brooklyn","Bath Beach","Boro Zone"
12,"Manhattan","Battery Park","Yellow Zone"
13,"Manhattan","Battery Park City","Yellow Zone"
14,"Brooklyn","Bay Ridge","Boro Zone"
15,"Queens","Bay Terrace/Fort Totten","Boro Zone"
16,"Queens","Bayside","Boro Zone"
17,"Brooklyn","Bedford","Boro Zone"
18,"Bronx","Bedford Park","Boro Zone"
19,"Queens","Bellerose","Boro Zone"
20,"Bronx","Belmont","Boro Zone"
21,"Brooklyn","Bensonhurst East","Boro Zone"
22,"Brooklyn","Bensonhurst West","Boro Zone"
[K:[Ka/taxi+_zone_lookup.csv[m[K

In [110]:
!less 'data/yellow_tripdata_2017-12.csv'

VendorID,tpep_pickup_datetime,tpep_dropoff_datetime,passenger_count,trip_distance,RatecodeID,store_and_fwd_flag,PULocationID,DOLocationID,payment_type,fare_amount,extra,mta_tax,tip_amount,tolls_amount,improvement_surcharge,total_amount

1,2017-12-01 00:12:00,2017-12-01 00:12:51,1,.00,1,N,226,226,3,2.5,0.5,0.5,0,0,0.3,3.8
1,2017-12-01 00:13:37,2017-12-01 00:13:47,1,.00,1,N,226,226,3,2.5,0.5,0.5,0,0,0.3,3.8
1,2017-12-01 00:14:15,2017-12-01 00:15:05,1,.00,1,N,226,226,3,2.5,0.5,0.5,0,0,0.3,3.8
1,2017-12-01 00:15:33,2017-12-01 00:15:37,1,.00,1,N,226,226,3,2.5,0.5,0.5,0,0,0.3,3.8
1,2017-12-01 00:50:03,2017-12-01 00:53:35,1,.00,1,N,145,145,2,4,0.5,0.5,0,0,0.3,5.3
1,2017-12-01 00:14:20,2017-12-01 00:28:35,1,4.20,1,N,82,258,2,15,0.5,0.5,0,0,0.3,16.3
1,2017-12-01 00:20:32,2017-12-01 00:31:24,1,5.40,1,N,50,116,2,17,0.5,0.5,0,0,0.3,18.3
1,2017-12-01 00:01:46,2017-12-01 00:12:19,1,1.90,1,N,161,107,1,9,0.5,0.5,2.05,0,0.3,12.35
1,2017-12-01 00:17:52,2017-12-01 00:32:35,1,3.30,1,N,107,263,1,12.5,0.5,0

In [111]:
# data dictionary for NY yellow taxi CSV files
# http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
data_dictionary = {
    'VendorID': 'A code indicating the TPEP provider that provided the record',
    'tpep_pickup_datetime': 'The date and time when the meter was engaged ',
    'tpep_dropoff_datetime': 'The date and time when the meter was disengaged ',
    'passenger_count': 'The number of passengers in the vehicle',
    'trip_distance': 'The elapsed trip distance in miles reported by the taximeter',
    'PULocationID': 'A code indicating the zone + borough of th pickup location',
    'DOLocationID': 'A code indicating the zone + borough of th dropoff location',
    'payment_type': 'A numeric code signifying how the passenger paid for the trip',
    'fare_amount': 'The time-and-distance fare calculated by the meter',
    'tip_amount': 'Tip amount – This field is automatically populated for credit card tips. Cash tips are not included',
    'total_amount': 'The total amount charged to passengers. Does not include cash tips',
}

Table schema

In [112]:
class TaxiTableDescription(tb.IsDescription):
    vendor_id = tb.UInt8Col(pos=0)
    pickup_timestamp_ms = tb.Int64Col()
    dropoff_timestamp_ms = tb.Int64Col()
    passenger_count = tb.UInt8Col()
    trip_distance = tb.Float32Col()
    pickup_location_id = tb.UInt16Col()
    dropoff_location_id = tb.UInt16Col()
    payment_type = tb.UInt8Col()
    fare_amount = tb.Float32Col()
    tip_amount = tb.Float32Col()
    total_amount = tb.Float32Col()

In [113]:
h5_file_path = os.path.join(data_dir, 'NYC-yellow-taxis-100k.h5')
print(h5_file_path)

/home/jack/Repos/hdf5-pycon-slovakia/data/NYC-yellow-taxis-100k.h5


In [114]:
filters = tb.Filters(complevel=5, complib='zlib')

with tb.open_file(filename=h5_file_path, mode='w') as f:
    # create table with pre-defined schema
    f.create_table(
        where='/',
        name='yellow_taxis_2017_12',
        description=TaxiTableDescription,
        title='NYC Yellow Taxi data December 2017',
        filters=filters)
    # add metadata
    table_where = '/yellow_taxis_2017_12'
    for key, val in data_dictionary.items():
        f.set_node_attr(where=table_where, attrname=key, attrvalue=val)

In [115]:
!ptdump 'data/NYC-yellow-taxis-100k.h5'  # try also h5dump

/ (RootGroup) ''
/yellow_taxis_2017_12 (Table(0,), shuffle, zlib(5)) 'NYC Yellow Taxi data December 2017'


In [116]:
def date_to_timestamp_ms(date_obj):
    timestamp_in_nanoseconds = date_obj.astype('int64')
    timestamp_in_ms = (timestamp_in_nanoseconds / 1000000).astype('int64')
    return timestamp_in_ms

def fill_table(table, mapping, df):
    num_records = df.shape[0]  # it's equal to the chunksize used in read_csv
    row = table.row
    for i in range(num_records):
        row['vendor_id'] = df[mapping['vendor_id']].values[i]

        pickup_ms = date_to_timestamp_ms(df[mapping['pickup_datetime']].values[i])
        row['pickup_timestamp_ms'] = pickup_ms
        dropoff_ms = date_to_timestamp_ms(df[mapping['dropoff_datetime']].values[i])
        row['dropoff_timestamp_ms'] = dropoff_ms

        row['passenger_count'] = df['passenger_count'].values[i]
        row['trip_distance'] = df['trip_distance'].values[i]

        row['pickup_location_id'] = df['PULocationID'].values[i]
        row['dropoff_location_id'] = df['DOLocationID'].values[i]

        row['fare_amount'] = df['fare_amount'].values[i]
        row['tip_amount'] = df['tip_amount'].values[i]
        row['total_amount'] = df['total_amount'].values[i]

        row['payment_type'] = df['payment_type'].values[i]
        row.append()
    table.flush()

*Remember to flush:* Remember, flushing a table is a very important step as it will not only help to maintain the integrity of your file, but also will free valuable memory resources (i.e. internal buffers) that your program may need for other things.

In [117]:
%%time
# Open the HDF5 file in 'a'ppend mode and populate the table with CSV data
with tb.open_file(filename=h5_file_path, mode='a') as f:
    # Left, the key we want to use. Right, the key in the CSV file
    mapping = {
        'vendor_id': 'VendorID',
        'pickup_datetime': 'tpep_pickup_datetime',
        'dropoff_datetime': 'tpep_dropoff_datetime',
        'pickup_location_id': 'PULocationID',
        'dropoff_location_id': 'DOLocationID'
    }

    # define the dtype to use when reading the CSV with pandas (this has nothing to do with the HDF5 table)
    dtype = {'VendorID': 'category', 'payment_type': 'category'}
    parse_dates = ['tpep_pickup_datetime', 'tpep_dropoff_datetime']

    table = f.get_node(where='/yellow_taxis_2017_12')
 
    csv_file_path = os.path.join(data_dir, 'yellow_tripdata_2017-12.csv')

    # read in chunks because these CSV files are too big
    chunksize = 100000
    for chunk in pd.read_csv(
        csv_file_path, chunksize=chunksize, dtype=dtype,
        skipinitialspace=True, parse_dates=parse_dates):
        df = chunk.reset_index(drop=True)
        fill_table(table, mapping, df)
        # remove the break statement to process all chunks (it will take ~20 minutes)
        break

CPU times: user 11.2 s, sys: 24 ms, total: 11.2 s
Wall time: 11.6 s


In [46]:
tb.is_pytables_file('data/NYC-yellow-taxis-100k.h5')

'2.1'

In [146]:
# help(table.remove_rows)

In [118]:
!pttree --use-si-units --sort-by 'size' 'data/NYC-yellow-taxis-100k.h5'


------------------------------------------------------------

/ (RootGroup)
`--yellow_taxis_2017_12 (Table)
      mem=3.9MB, disk=1.8MB [100.0%]

------------------------------------------------------------
Total branch leaves:    1
Total branch size:      3.9MB in memory, 1.8MB on disk
Mean compression ratio: 0.45
HDF5 file size:         1.8MB
------------------------------------------------------------



### Expressions

See https://github.com/tomkooij/scipy2017/blob/master/notebooks/07-Expressions.ipynb

In [83]:
x = np.random.uniform(low=1, high=5, size=num_rows).astype('float32')

In [84]:
%%time
with tb.open_file(filename='data/my_pytables_file.h5', mode='w') as f:
    carray = f.create_carray(where='/', name='carray_without_numexpr', atom=tb.Float32Atom(), shape=x.shape)
    carray[:] = x**3 + 0.5*x**2 - x

CPU times: user 140 ms, sys: 32 ms, total: 172 ms
Wall time: 232 ms


In [86]:
%%time
with tb.open_file(filename='data/my_pytables_file.h5', mode='w') as f:
    carray = f.create_carray(where='/', name='carray_with_numexpr', atom=tb.Float32Atom(), shape=x.shape)
    ex = tb.Expr('x**3 + 0.5*x**2 - x')
    ex.set_output(carray) # output will got to the CArray on disk
    ex.eval()

CPU times: user 4 ms, sys: 56 ms, total: 60 ms
Wall time: 183 ms


### Searches

### Indexes

### Create custom datatype

See https://github.com/tomkooij/scipy2017/blob/master/notebooks/02-Datatypes-in-HDF5.ipynb

### Links

external links, soft links, hard links.

See https://github.com/PyTables/PyTables/blob/develop/examples/links.py

See http://www.pytables.org/usersguide/libref/file_class.html#tables.File.create_external_link

### Add column to table (like a database schema migration)

### Filenode

See https://github.com/PyTables/PyTables/blob/develop/examples/filenodes1.py

### todo

### Transactions: mark, undo, redo, goto

See https://github.com/PyTables/PyTables/blob/develop/examples/undo-redo.py

See https://github.com/PyTables/PyTables/blob/develop/examples/tutorial3-2.py

This operation prepares the database for undoing and redoing modifications in the node hierarchy.

Once the Undo/Redo mechanism is enabled, explicit marks (with an optional unique name) can be set on the state of the database using the File.mark() method. There are two implicit marks which are always available: the initial mark (0) and the final mark (-1).

I personally find the method name `undo` a bit misleading and inconsistent (`undo` without specifying a mark rollback last transaction, with a mark with `'after b'` it return to the state marked as `'after b'`).

File.undo() method, which returns the database to the state of a past mark

In [199]:
with tb.open_file(filename='data/my_pytables_file.h5', mode='w') as f:
    f.enable_undo()
    
    f.create_array(where='/', name='array_a', title='Array A', obj=[1, 2, 3])
    f.mark(name='after a')
    
    f.create_array(where='/', name='array_b', title='Array B', obj=[4, 5, 6])
    f.mark(name='after b')
    
    f.create_array(where='/', name='array_c', title='Array C', obj=[7, 8, 9])
    f.mark(name='after c')
    
#     f.undo(mark='after b')
#     f.redo(mark='after c')
    f.goto(mark='after a')
#     f.disable_undo()
    current_mark = f.get_current_mark()
    print(current_mark)

1


You cannot undo/redo everything! EXAMPLES?

Hierarchy manipulation operations on nodes that do not support the Undo/Redo mechanism issue an UndoRedoWarning before changing the database.

In [196]:
print(current_mark)

1


### HDF5 users, use cases and job offers

Add special thanks to all those replied.

### Where to go from here

### Special thanks

[Julien Guillaumin](https://www.linkedin.com/in/julienguillaumin): HDF5 image augmentation for deep learning applications