Alignment issues with python3 #734

kjmeagher · 2019-05-09T16:39:09Z

There seems to be some sort of alignment issue with python3 on Macos. With this test file in python2 i get the correct results:

python -c 'import tables;f=tables.open_file("test.hdf5");print(f.root.Test[:])'
[(0, 100) (1, 101) (2, 102) (3, 103) (4, 104) (5, 105) (6, 106) (7, 107)
(8, 108) (9, 109)]
Closing remaining open files:test.hdf5...done

However with python3 i get:

python -c 'import tables;f=tables.open_file("test.hdf5");print(f.root.Test[:])'
[(0, 100) (0, 0) (1, 101) (0, 0) (2, 102) (0, 0) (3, 103) (0, 0)
(4, 104) (0, 0)]
Closing remaining open files:test.hdf5...done

It appears to be some sort of alignment issues because if each row is 128 bytes it works fine and with other size rows I get segfaults.

This only happens with files written with the libhdf5 and not files written by pytables. I put the test file on my website since, git doesn't let me upload hdf5 files (even though it is only 8K)
https://icecube.wisc.edu/~kmeagher/test.hdf5

FrancescAlted · 2019-05-10T09:12:56Z

Yes, this should not be happening. We need to find the root of the problem, but if the meanwhile you can contribute a solution, we will be really grateful!

tomkooij · 2019-08-07T13:55:18Z

I can reproduce this on my Windows / python 3.7 env.

The hdf5 contains:

h5ls -d test.hdf5
Test                     Dataset {10/Inf}
    Data:
        (0) {0, 100}, {1, 101}, {2, 102}, {3, 103}, {4, 104}, {5, 105}, {6, 106}, {7, 107}, {8, 108}, {9, 109}
__I3Index__              Group

and

File(filename=test.hdf5, title='', mode='r', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/Test (Table(10,), shuffle, zlib(6)) ''
  description := {
  "A": UInt32Col(shape=(), dflt=0, pos=0),
  "B": UInt32Col(shape=(), dflt=0, pos=1)}
  byteorder := 'little'
  chunkshape := (8191,)
[...]

I'll investigate. This may be due to the table information not being correct. As the testfile was not created using pytables.

tkintscher · 2019-09-23T12:33:04Z

I just ran into the same issue. If it helps, this also occurs in Python2.
Everything works in version 3.4.4, the broken behavior starts with version 3.5.1.

tomkooij · 2019-09-23T13:59:27Z

Thanks putting the finger on the sore spot @tkintscher: I can confirm this broke between 3.4.4 and 3.5.1.

Comparing: v3.4.4...v3.5.1
Points to: #720 as the probable culprit. ~~But that's a large PR.~~
The culprit is: 4a84d81 Its parent: cadb09b is ok.

tomkooij · 2019-09-23T19:29:27Z

I checked / review the offsets PR, but could not find anything wrong. Table.description._v_offsets seems to work fine.

I'm do not have time to investigate further, so I'm leaving some verbose info about this:

test.hdf5 file has some strange padding:

   h5ls -vv test.hdf5
   [...]
   Type:      struct {
                   "A"                +0    native unsigned int
                   "B"                +4    native unsigned int
               } 16 bytes

This struct should be 8 bytes. In fact, it is 8 bytes when I recreate the dataset.

jvansanten · 2019-09-23T19:59:54Z

One thing to note is that the library that wrote the test file above creates compound types whose size is guaranteed to be a multiple of the size of the widest supported type (currently long double). For example, in the test file h5ls shows

    Type:      struct {
                   "A"                +0    native unsigned int
                   "B"                +4    native unsigned int
               } 16 bytes

. This means that you can potentially run into trouble if you assume the size of an item is the offset+size of the last field. It looks like this is what's happening here: even in v3.5.2, hdf.root.Test.description._v_itemsize is 8, but the 16 bytes of the in-memory representation still show up in the numpy array returned by read().

jvansanten · 2019-09-23T20:13:59Z

I suspect that the underlying issue here is that the only way to control the padding in a numpy.dtype is via the align keyword, whereas the padding in an HDF5 compound datatype is completely arbitrary. If that is the case, then there's no way to expose the in-memory representation of the example table as a numpy array, and the incompatibility never showed up before because PyTables removed whatever padding was there.

tomkooij · 2019-09-24T07:01:52Z

@jvansanten: It should be possible to extract the correct numpy.dtype from the HDF5. I'll try to find some time for this.
Thanks for the help!

jvansanten · 2019-09-24T10:49:45Z

Right, it looks like you should be able to reach slightly deeper into PyArray_Descr and set the itemsize explicitly rather than relying on https://github.com/numpy/numpy/blob/7214ca4688545b432c45287195e2f46c5e418ce8/numpy/core/src/multiarray/descriptor.c#L1012-L1338 to do it for you. Whether you rely on HDF5's guarantee that fields in a compound datatype do not overlap or reimplement the dtype sanity checks is another question.

tomkooij · 2019-09-25T08:35:40Z

Setting the correct dtype in Table._v_dtype solves this:

In the interactive session below, I set Table._v_dtype to the correct dtype matching the HDF5 file. I then successfully loaded the data into an array:

In [1]: import tables

In [2]: f = tables.open_file('test.hdf5'); t = f.root.Test

In [4]: t[:]
Out[4]:
array([(0, 100), (0,   0), (1, 101), (0,   0), (2, 102), (0,   0),
       (3, 103), (0,   0), (4, 104), (0,   0)],
      dtype=[('A', '<u4'), ('B', '<u4')])

In [5]: t._v_dtype = {'names':['A','B'], 'formats':['<u4','<u4'], 'offsets':[0,4], 'itemsize':16}

In [6]: t[:]
Out[6]:
array([(0, 100), (1, 101), (2, 102), (3, 103), (4, 104), (5, 105),
       (6, 106), (7, 107), (8, 108), (9, 109)],
      dtype={'names':['A','B'], 'formats':['<u4','<u4'], 'offsets':[0,4], 'itemsize':16})

Unfortunately setting the correct dtype for the Table upon initialisation does not seem to be a oneliner fix... But I'll work on it some more, when time permits.

Regression test for PyTablesgh-734 Test if a H5T_COMPOUND (Table) with padding can be read properly. (test if dtype.itemsize set correctly)

Some improvements for fixing #734

kjmeagher · 2019-10-14T16:14:19Z

@tomkooij, thanks for looking into this and fixing it!

tomkooij self-assigned this Aug 7, 2019

tomkooij added the defect label Aug 7, 2019

tomkooij added this to the python 3.8 milestone Sep 24, 2019

tomkooij mentioned this issue Sep 28, 2019

Fix alignment/padding (incorrect itemsize) #765

Merged

tomkooij added a commit to tomkooij/PyTables that referenced this issue Sep 28, 2019

Test reading a Table with padding

07abdf1

Regression test for PyTablesgh-734 Test if a H5T_COMPOUND (Table) with padding can be read properly. (test if dtype.itemsize set correctly)

tomkooij added a commit to tomkooij/PyTables that referenced this issue Sep 28, 2019

Test reading a Table with padding

e647dc6

Regression test for PyTablesgh-734 Test if a H5T_COMPOUND (Table) with padding can be read properly. (test if dtype.itemsize set correctly)

tomkooij added the in progress label Oct 10, 2019

tomkooij closed this as completed in #765 Oct 13, 2019

tomkooij added a commit that referenced this issue Oct 13, 2019

Merge pull request #770 from PyTables/tomkooij-gh_734

a1623d3

Some improvements for fixing #734

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alignment issues with python3 #734

Alignment issues with python3 #734

kjmeagher commented May 9, 2019

FrancescAlted commented May 10, 2019

tomkooij commented Aug 7, 2019 •

edited

tkintscher commented Sep 23, 2019

tomkooij commented Sep 23, 2019 •

edited

tomkooij commented Sep 23, 2019

jvansanten commented Sep 23, 2019

jvansanten commented Sep 23, 2019

tomkooij commented Sep 24, 2019

jvansanten commented Sep 24, 2019

tomkooij commented Sep 25, 2019 •

edited

kjmeagher commented Oct 14, 2019

Alignment issues with python3 #734

Alignment issues with python3 #734

Comments

kjmeagher commented May 9, 2019

FrancescAlted commented May 10, 2019

tomkooij commented Aug 7, 2019 • edited

tkintscher commented Sep 23, 2019

tomkooij commented Sep 23, 2019 • edited

tomkooij commented Sep 23, 2019

jvansanten commented Sep 23, 2019

jvansanten commented Sep 23, 2019

tomkooij commented Sep 24, 2019

jvansanten commented Sep 24, 2019

tomkooij commented Sep 25, 2019 • edited

kjmeagher commented Oct 14, 2019

tomkooij commented Aug 7, 2019 •

edited

tomkooij commented Sep 23, 2019 •

edited

tomkooij commented Sep 25, 2019 •

edited