### HDF5 (PyTables)
HDFStore is a dict-like object which reads and writes pandas using the high performance HDF5 format using the excellent PyTables library. See the cookbook for some advanced strategies

#### Warning

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle. Loading pickled data received from untrusted sources can be unsafe.

See: https://docs.python.org/3/library/pickle.html for more.

In [1]:
import pandas as pd
import numpy as np

In [2]:
store = pd.HDFStore('store.h5')

In [3]:
print(store)

<class 'pandas.io.pytables.HDFStore'>
File path: store.h5



Objects can be written to the file just like adding key-value pairs to a dict:

In [31]:
index = pd.date_range('1/1/2020', periods=10)
df = pd.DataFrame(np.random.randn(10,5), index=index, columns=list('ABCDE'))
df

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.195121,-0.909313,-0.170658,2.717058,-0.252887
2020-01-02,-1.995391,-1.379531,-1.689572,-0.654362,-0.808935
2020-01-03,0.809837,0.976512,-1.925431,-0.121478,0.423694
2020-01-04,1.174184,0.433564,-0.296804,0.265765,-0.868157
2020-01-05,-0.435192,0.087607,1.083748,0.981835,2.45425
2020-01-06,-0.362767,2.540173,1.630619,1.310268,-0.649174
2020-01-07,-0.233982,1.039995,-0.172484,-1.290952,-1.06652
2020-01-08,1.185293,-0.502811,-1.348001,-0.112295,0.630861
2020-01-09,0.113215,1.041406,-0.283289,0.674396,0.813218
2020-01-10,0.373229,0.155789,0.870063,0.092487,-0.766356


In [32]:
s = pd.Series(np.random.randn(5), index=list('abcde'))
s

a    0.947652
b    0.136976
c   -2.050591
d   -0.174019
e    1.573777
dtype: float64

In [33]:
store.is_open

True

In [7]:
# store.put('s', s) is an equivalent method
store['s'] = s

In [34]:
store['df'] = df

In a current or later Python session, you can retrieve stored objects:



In [11]:
# store.get('df') is an equivalent method
store['df']

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.635313,2.214264,0.986995,-0.993651,0.706387
2020-01-02,0.078921,0.226716,0.523962,1.243584,1.410206
2020-01-03,-0.63822,0.783486,-1.892696,-0.954387,1.006346
2020-01-04,0.217764,-0.864551,-1.029765,-0.722419,0.93187
2020-01-05,-1.126184,0.396616,-0.212969,3.244286,1.351915
2020-01-06,1.843747,-0.164955,-0.377947,0.040777,-0.090312
2020-01-07,1.537576,0.15633,-0.670925,-2.391738,-0.529565
2020-01-08,1.340863,0.811314,-0.762913,-0.147732,-0.326164
2020-01-09,0.013538,-1.091931,-3.112693,0.646806,-1.174021
2020-01-10,0.697641,0.219207,-2.062731,-1.733491,1.093881


In [12]:
store.get('s')

a    0.173907
b   -0.428035
c    0.289877
d   -0.312388
e   -0.124705
dtype: float64

In [13]:
# dotted (attribute) access provides get as well
store.df

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.635313,2.214264,0.986995,-0.993651,0.706387
2020-01-02,0.078921,0.226716,0.523962,1.243584,1.410206
2020-01-03,-0.63822,0.783486,-1.892696,-0.954387,1.006346
2020-01-04,0.217764,-0.864551,-1.029765,-0.722419,0.93187
2020-01-05,-1.126184,0.396616,-0.212969,3.244286,1.351915
2020-01-06,1.843747,-0.164955,-0.377947,0.040777,-0.090312
2020-01-07,1.537576,0.15633,-0.670925,-2.391738,-0.529565
2020-01-08,1.340863,0.811314,-0.762913,-0.147732,-0.326164
2020-01-09,0.013538,-1.091931,-3.112693,0.646806,-1.174021
2020-01-10,0.697641,0.219207,-2.062731,-1.733491,1.093881


Deletion of the object specified by the key:

In [14]:
# store.remove('df') is an equivalent method
del store['df']

In [15]:
store.df

AttributeError: 'HDFStore' object has no attribute 'df'

In [28]:
store.put('df', df)

ClosedFileError: store.h5 file is not open!

In [17]:
store['df']

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.635313,2.214264,0.986995,-0.993651,0.706387
2020-01-02,0.078921,0.226716,0.523962,1.243584,1.410206
2020-01-03,-0.63822,0.783486,-1.892696,-0.954387,1.006346
2020-01-04,0.217764,-0.864551,-1.029765,-0.722419,0.93187
2020-01-05,-1.126184,0.396616,-0.212969,3.244286,1.351915
2020-01-06,1.843747,-0.164955,-0.377947,0.040777,-0.090312
2020-01-07,1.537576,0.15633,-0.670925,-2.391738,-0.529565
2020-01-08,1.340863,0.811314,-0.762913,-0.147732,-0.326164
2020-01-09,0.013538,-1.091931,-3.112693,0.646806,-1.174021
2020-01-10,0.697641,0.219207,-2.062731,-1.733491,1.093881


In [27]:
store.df

AttributeError: 'HDFStore' object has no attribute 'df'

In [20]:
store.remove('df')

Closing a Store and using a context manager:



In [21]:
store.close()

In [22]:
store

<class 'pandas.io.pytables.HDFStore'>
File path: store.h5

In [24]:
store.is_open

False

In [35]:
# Working with, and automatically closing the store using a context manager
with pd.HDFStore('store.h5') as store:
    print(store.keys())



['/df', '/s']


### Read/write API
HDFStore supports a top-level API using read_hdf for reading and to_hdf for writing, similar to how read_csv and to_csv work.

In [39]:
df_tl = pd.DataFrame({'A': list(range(5)), 'B': list(range(5))})
df_tl

Unnamed: 0,A,B
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4


In [40]:
df_tl.to_hdf('store_tl.h5', 'table', append=True)

In [41]:
pd.read_hdf('store_tl.h5', 'table', where=['index>2'])

Unnamed: 0,A,B
3,3,3
4,4,4
3,3,3
4,4,4


HDFStore will by default not drop rows that are all missing. This behavior can be changed by setting dropna=True.



In [42]:
df_with_missing = pd.DataFrame({'col1': [0, np.nan, 2],'col2': [1, np.nan, np.nan]})

In [43]:
df_with_missing

Unnamed: 0,col1,col2
0,0.0,1.0
1,,
2,2.0,


In [44]:
df_with_missing.to_hdf('file.h5', 'df_with_missing',format='table', mode='w')

In [45]:
pd.read_hdf('file.h5', 'df_with_missing')

Unnamed: 0,col1,col2
0,0.0,1.0
1,,
2,2.0,


In [46]:
df_with_missing.to_hdf('file.h5', 'df_with_missing',format='table', mode='w', dropna=True)

In [47]:
pd.read_hdf('file.h5', 'df_with_missing')

Unnamed: 0,col1,col2
0,0.0,1.0
1,,
2,2.0,


### Fixed format
The examples above show storing using put, which write the HDF5 to PyTables in a fixed array format, called the fixed format. These types of stores are not appendable once written (though you can simply remove them and rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table stores. This format is specified by default when using put or to_hdf or by format='fixed' or format='f'

In [48]:
#A fixed format will raise a TypeError if you try to retrieve using a where:

pd.DataFrame(np.random.randn(10, 2)).to_hdf('test_fixed.h5', 'df')
pd.read_hdf('test_fixed.h5', 'df', where='index>5')

TypeError: cannot pass a where specification when reading from a Fixed format store. this store must be selected in its entirety

### Table format
HDFStore supports another PyTables format on disk, the table format. Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported. This format is specified by format='table' or format='t' to append or put or to_hdf.

This format can be set as an option as well pd.set_option('io.hdf.default_format','table') to enable put/append/to_hdf to by default store in the table format.

### Table format
HDFStore supports another PyTables format on disk, the table format. Conceptually a table is shaped very much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions. In addition, delete and query type operations are supported. This format is specified by format='table' or format='t' to append or put or to_hdf.

This format can be set as an option as well pd.set_option('io.hdf.default_format','table') to enable put/append/to_hdf to by default store in the table format.

In [49]:
df

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.195121,-0.909313,-0.170658,2.717058,-0.252887
2020-01-02,-1.995391,-1.379531,-1.689572,-0.654362,-0.808935
2020-01-03,0.809837,0.976512,-1.925431,-0.121478,0.423694
2020-01-04,1.174184,0.433564,-0.296804,0.265765,-0.868157
2020-01-05,-0.435192,0.087607,1.083748,0.981835,2.45425
2020-01-06,-0.362767,2.540173,1.630619,1.310268,-0.649174
2020-01-07,-0.233982,1.039995,-0.172484,-1.290952,-1.06652
2020-01-08,1.185293,-0.502811,-1.348001,-0.112295,0.630861
2020-01-09,0.113215,1.041406,-0.283289,0.674396,0.813218
2020-01-10,0.373229,0.155789,0.870063,0.092487,-0.766356


In [54]:
store = pd.HDFStore('store1.h5')

In [55]:
df1 = df[0:4]

In [56]:
df2 = df[4:]

In [57]:
store.append('df', df1)

In [58]:
store.append('df', df2)

In [59]:
store.select('df')

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.195121,-0.909313,-0.170658,2.717058,-0.252887
2020-01-02,-1.995391,-1.379531,-1.689572,-0.654362,-0.808935
2020-01-03,0.809837,0.976512,-1.925431,-0.121478,0.423694
2020-01-04,1.174184,0.433564,-0.296804,0.265765,-0.868157
2020-01-05,-0.435192,0.087607,1.083748,0.981835,2.45425
2020-01-06,-0.362767,2.540173,1.630619,1.310268,-0.649174
2020-01-07,-0.233982,1.039995,-0.172484,-1.290952,-1.06652
2020-01-08,1.185293,-0.502811,-1.348001,-0.112295,0.630861
2020-01-09,0.113215,1.041406,-0.283289,0.674396,0.813218
2020-01-10,0.373229,0.155789,0.870063,0.092487,-0.766356


In [60]:
# the type of stored data
store.root.df._v_attrs.pandas_type

'frame_table'

### Hierarchical keys
Keys to a store can be specified as a string. These can be in a hierarchical path-name like format (e.g. foo/bar/bah), which will generate a hierarchy of sub-stores (or Groups in PyTables parlance). Keys can be specified without the leading ‘/’ and are always absolute (e.g. ‘foo’ refers to ‘/foo’). Removal operations can remove everything in the sub-store and below, so be careful

In [61]:
store.put('foo/bar/bah', df)

In [62]:
store.append('food/orange', df)

In [63]:
store.append('food/apple', df)

In [64]:
# a list of keys are returned
store.keys()

['/df', '/food/apple', '/food/orange', '/foo/bar/bah']

In [65]:
# remove all nodes under this level
store.remove('food')

In [66]:
store.keys()

['/df', '/foo/bar/bah']

You can walk through the group hierarchy using the walk method which will yield a tuple for each group key along with the relative keys of its contents.

In [71]:
for (path, subgroups, subkeys) in store.walk():
    for subgroup in subgroups:
        print('GROUP: {}/{}'.format(path, subgroup))
        for subkey in subkeys:
            key = '/'.join([path, subkey])
            print('KEY: {}'.format(key))
            print(store.get(key))

GROUP: /foo
KEY: /df
                   A         B         C         D         E
2020-01-01 -0.195121 -0.909313 -0.170658  2.717058 -0.252887
2020-01-02 -1.995391 -1.379531 -1.689572 -0.654362 -0.808935
2020-01-03  0.809837  0.976512 -1.925431 -0.121478  0.423694
2020-01-04  1.174184  0.433564 -0.296804  0.265765 -0.868157
2020-01-05 -0.435192  0.087607  1.083748  0.981835  2.454250
2020-01-06 -0.362767  2.540173  1.630619  1.310268 -0.649174
2020-01-07 -0.233982  1.039995 -0.172484 -1.290952 -1.066520
2020-01-08  1.185293 -0.502811 -1.348001 -0.112295  0.630861
2020-01-09  0.113215  1.041406 -0.283289  0.674396  0.813218
2020-01-10  0.373229  0.155789  0.870063  0.092487 -0.766356
GROUP: /foo/bar


### Note:
Hierarchical keys cannot be retrieved as dotted (attribute) access as described above for items stored under the root node.

In [72]:
store.foo.bar.bah

TypeError: cannot create a storer if the object is not existing nor a value are passed

Instead, use explicit string based keys:

In [73]:
store['foo/bar/bah']

Unnamed: 0,A,B,C,D,E
2020-01-01,-0.195121,-0.909313,-0.170658,2.717058,-0.252887
2020-01-02,-1.995391,-1.379531,-1.689572,-0.654362,-0.808935
2020-01-03,0.809837,0.976512,-1.925431,-0.121478,0.423694
2020-01-04,1.174184,0.433564,-0.296804,0.265765,-0.868157
2020-01-05,-0.435192,0.087607,1.083748,0.981835,2.45425
2020-01-06,-0.362767,2.540173,1.630619,1.310268,-0.649174
2020-01-07,-0.233982,1.039995,-0.172484,-1.290952,-1.06652
2020-01-08,1.185293,-0.502811,-1.348001,-0.112295,0.630861
2020-01-09,0.113215,1.041406,-0.283289,0.674396,0.813218
2020-01-10,0.373229,0.155789,0.870063,0.092487,-0.766356


### Storing types
#### Storing mixed types in a table
Storing mixed-dtype data is supported. Strings are stored as a fixed-width using the maximum size of the appended column. Subsequent attempts at appending longer strings will raise a ValueError.

Passing min_itemsize={`values`: size} as a parameter to append will set a larger minimum for the string columns. Storing floats, strings, ints, bools, datetime64 are currently supported. For string columns, passing nan_rep = 'nan' to append will change the default nan representation on disk (which converts to/from np.nan), this defaults to nan.

In [74]:
 df_mixed = pd.DataFrame({'A': np.random.randn(8),
                          'B': np.random.randn(8),
                          'C': np.array(np.random.randn(8), dtype='float32'),
                          'string': 'string',
                          'int': 1,
                          'bool': True,
                          'datetime64': pd.Timestamp('20010102')},
                         index=list(range(8)))

In [75]:
df_mixed.loc[df_mixed.index[3:5], ['A', 'B', 'string', 'datetime64']] = np.nan

In [76]:
store.append('df_mixed', df_mixed, min_itemsize={'values': 50})

In [77]:
df_mixed1 = store.select('df_mixed')

In [78]:
df_mixed1

Unnamed: 0,A,B,C,string,int,bool,datetime64
0,0.734376,-0.735236,1.113173,string,1,True,2001-01-02
1,-0.517398,0.171343,0.961899,string,1,True,2001-01-02
2,0.705349,-0.522678,0.230414,string,1,True,2001-01-02
3,,,0.387898,,1,True,NaT
4,,,-0.746608,,1,True,NaT
5,0.833325,-0.444531,-0.647699,string,1,True,2001-01-02
6,0.142085,-0.947318,0.505955,string,1,True,2001-01-02
7,0.793704,0.5873,1.111425,string,1,True,2001-01-02


In [79]:
df_mixed1.dtypes.value_counts()

float64           2
object            1
bool              1
float32           1
int64             1
datetime64[ns]    1
dtype: int64

In [80]:
store.root.df_mixed.table

/df_mixed/table (Table(8,)) ''
  description := {
  "index": Int64Col(shape=(), dflt=0, pos=0),
  "values_block_0": Float64Col(shape=(2,), dflt=0.0, pos=1),
  "values_block_1": Float32Col(shape=(1,), dflt=0.0, pos=2),
  "values_block_2": Int64Col(shape=(1,), dflt=0, pos=3),
  "values_block_3": Int64Col(shape=(1,), dflt=0, pos=4),
  "values_block_4": BoolCol(shape=(1,), dflt=False, pos=5),
  "values_block_5": StringCol(itemsize=50, shape=(1,), dflt=b'', pos=6)}
  byteorder := 'little'
  chunkshape := (689,)
  autoindex := True
  colindexes := {
    "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}

### Storing MultiIndex DataFrames
Storing MultiIndex DataFrames as tables is very similar to storing/selecting from homogeneous index DataFrames.

In [81]:
index = pd.MultiIndex(levels=[['foo', 'bar', 'baz', 'qux'],
                              ['one', 'two', 'three']],
                      codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3],
                                       [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
                      names=['foo', 'bar'])

In [82]:
df_mi = pd.DataFrame(np.random.randn(10, 3), index=index,columns=['A', 'B', 'C'])

In [84]:
df_mi

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
foo,bar,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
foo,one,0.781119,0.110039,-0.518755
foo,two,0.289236,1.271185,-1.439223
foo,three,-1.300231,-0.208563,0.090684
bar,one,-0.433235,0.333393,1.153843
bar,two,0.77115,1.872144,0.69324
baz,two,-0.355579,-1.16518,-0.700331
baz,three,1.664754,1.010506,0.159067
qux,one,-1.09577,-0.362842,0.912264
qux,two,0.069426,0.501442,-0.618269
qux,three,2.341431,-1.111958,-0.896008


In [85]:
store.append('df_mi', df_mi)

In [86]:
store.select('df_mi')

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
foo,bar,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
foo,one,0.781119,0.110039,-0.518755
foo,two,0.289236,1.271185,-1.439223
foo,three,-1.300231,-0.208563,0.090684
bar,one,-0.433235,0.333393,1.153843
bar,two,0.77115,1.872144,0.69324
baz,two,-0.355579,-1.16518,-0.700331
baz,three,1.664754,1.010506,0.159067
qux,one,-1.09577,-0.362842,0.912264
qux,two,0.069426,0.501442,-0.618269
qux,three,2.341431,-1.111958,-0.896008


In [87]:
store.select('df_mi', 'foo=bar')

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
foo,bar,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,-0.433235,0.333393,1.153843
bar,two,0.77115,1.872144,0.69324


In [4]:
dfq = pd.DataFrame(np.random.randn(10, 4), columns=list('ABCD'),
                   index=pd.date_range('20130101', periods=10))

In [5]:
dfq

Unnamed: 0,A,B,C,D
2013-01-01,-0.954797,1.256995,0.305014,0.393192
2013-01-02,-0.247001,-0.940053,-2.214759,-0.628328
2013-01-03,0.787976,-0.612468,-0.388306,0.149152
2013-01-04,-1.290555,-0.190472,0.855961,0.45424
2013-01-05,1.264106,-2.168926,-1.590191,0.278252
2013-01-06,0.923539,0.233648,-0.790956,-0.127818
2013-01-07,0.394805,-2.097285,-0.00612,-0.143465
2013-01-08,0.146363,-0.352636,0.171754,1.062521
2013-01-09,-1.865539,-0.94187,1.614018,1.271311
2013-01-10,1.643639,-0.835253,-0.045282,-0.61528


In [6]:
store.append('dfq', dfq, format='table', data_columns=True)

Use boolean expressions, with in-line function evaluation.

In [8]:
store.select('dfq', "index>pd.Timestamp('20130104') & columns=['A', 'B']")

Unnamed: 0,A,B
2013-01-05,1.264106,-2.168926
2013-01-06,0.923539,0.233648
2013-01-07,0.394805,-2.097285
2013-01-08,0.146363,-0.352636
2013-01-09,-1.865539,-0.94187
2013-01-10,1.643639,-0.835253


Use inline column reference.

In [9]:
store.select('dfq', where="A>0 or C>0")

Unnamed: 0,A,B,C,D
2013-01-01,-0.954797,1.256995,0.305014,0.393192
2013-01-03,0.787976,-0.612468,-0.388306,0.149152
2013-01-04,-1.290555,-0.190472,0.855961,0.45424
2013-01-05,1.264106,-2.168926,-1.590191,0.278252
2013-01-06,0.923539,0.233648,-0.790956,-0.127818
2013-01-07,0.394805,-2.097285,-0.00612,-0.143465
2013-01-08,0.146363,-0.352636,0.171754,1.062521
2013-01-09,-1.865539,-0.94187,1.614018,1.271311
2013-01-10,1.643639,-0.835253,-0.045282,-0.61528


The columns keyword can be supplied to select a list of columns to be returned, this is equivalent to passing a 'columns=list_of_columns_to_filter':

In [10]:
store.select('df', "columns=['A', 'B']")

TypeError: cannot pass a where specification when reading from a Fixed format store. this store must be selected in its entirety

### Query timedelta64[ns]
You can store and query using the timedelta64[ns] type. Terms can be specified in the format: <float>(<unit>), where float may be signed (and fractional), and unit can be D,s,ms,us,ns for the timedelta. Here’s an example:

In [12]:
from datetime import timedelta

dftd = pd.DataFrame({'A': pd.Timestamp('20130101'),
                     'B': [pd.Timestamp('20130101') + timedelta(days=i,seconds=10)
                           for i in range(10)]})
dftd

Unnamed: 0,A,B
0,2013-01-01,2013-01-01 00:00:10
1,2013-01-01,2013-01-02 00:00:10
2,2013-01-01,2013-01-03 00:00:10
3,2013-01-01,2013-01-04 00:00:10
4,2013-01-01,2013-01-05 00:00:10
5,2013-01-01,2013-01-06 00:00:10
6,2013-01-01,2013-01-07 00:00:10
7,2013-01-01,2013-01-08 00:00:10
8,2013-01-01,2013-01-09 00:00:10
9,2013-01-01,2013-01-10 00:00:10


In [17]:
dftd['C'] = dftd['A'] - dftd['B']

dftd

Unnamed: 0,A,B,C
0,2013-01-01,2013-01-01 00:00:10,-1 days +23:59:50
1,2013-01-01,2013-01-02 00:00:10,-2 days +23:59:50
2,2013-01-01,2013-01-03 00:00:10,-3 days +23:59:50
3,2013-01-01,2013-01-04 00:00:10,-4 days +23:59:50
4,2013-01-01,2013-01-05 00:00:10,-5 days +23:59:50
5,2013-01-01,2013-01-06 00:00:10,-6 days +23:59:50
6,2013-01-01,2013-01-07 00:00:10,-7 days +23:59:50
7,2013-01-01,2013-01-08 00:00:10,-8 days +23:59:50
8,2013-01-01,2013-01-09 00:00:10,-9 days +23:59:50
9,2013-01-01,2013-01-10 00:00:10,-10 days +23:59:50


In [19]:
store.append('dftd1', dftd, data_columns=True)

In [20]:
store.select('dftd1', "C< '-3.5D'")

Unnamed: 0,A,B,C
4,2013-01-01,2013-01-05 00:00:10,-5 days +23:59:50
5,2013-01-01,2013-01-06 00:00:10,-6 days +23:59:50
6,2013-01-01,2013-01-07 00:00:10,-7 days +23:59:50
7,2013-01-01,2013-01-08 00:00:10,-8 days +23:59:50
8,2013-01-01,2013-01-09 00:00:10,-9 days +23:59:50
9,2013-01-01,2013-01-10 00:00:10,-10 days +23:59:50
