Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Alignment issues with python3 #734

Closed
kjmeagher opened this issue May 9, 2019 · 11 comments · Fixed by #765
Closed

Alignment issues with python3 #734

kjmeagher opened this issue May 9, 2019 · 11 comments · Fixed by #765
Assignees
Milestone

Comments

@kjmeagher
Copy link

There seems to be some sort of alignment issue with python3 on Macos. With this test file in python2 i get the correct results:

python -c 'import tables;f=tables.open_file("test.hdf5");print(f.root.Test[:])'
[(0, 100) (1, 101) (2, 102) (3, 103) (4, 104) (5, 105) (6, 106) (7, 107)
(8, 108) (9, 109)]
Closing remaining open files:test.hdf5...done

However with python3 i get:

python -c 'import tables;f=tables.open_file("test.hdf5");print(f.root.Test[:])'
[(0, 100) (0, 0) (1, 101) (0, 0) (2, 102) (0, 0) (3, 103) (0, 0)
(4, 104) (0, 0)]
Closing remaining open files:test.hdf5...done

It appears to be some sort of alignment issues because if each row is 128 bytes it works fine and with other size rows I get segfaults.

This only happens with files written with the libhdf5 and not files written by pytables. I put the test file on my website since, git doesn't let me upload hdf5 files (even though it is only 8K)
https://icecube.wisc.edu/~kmeagher/test.hdf5

@FrancescAlted
Copy link
Member

Yes, this should not be happening. We need to find the root of the problem, but if the meanwhile you can contribute a solution, we will be really grateful!

@tomkooij tomkooij self-assigned this Aug 7, 2019
@tomkooij tomkooij added the defect label Aug 7, 2019
@tomkooij
Copy link
Contributor

tomkooij commented Aug 7, 2019

I can reproduce this on my Windows / python 3.7 env.

The hdf5 contains:

h5ls -d test.hdf5
Test                     Dataset {10/Inf}
    Data:
        (0) {0, 100}, {1, 101}, {2, 102}, {3, 103}, {4, 104}, {5, 105}, {6, 106}, {7, 107}, {8, 108}, {9, 109}
__I3Index__              Group

and

File(filename=test.hdf5, title='', mode='r', root_uep='/', filters=Filters(complevel=0, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) ''
/Test (Table(10,), shuffle, zlib(6)) ''
  description := {
  "A": UInt32Col(shape=(), dflt=0, pos=0),
  "B": UInt32Col(shape=(), dflt=0, pos=1)}
  byteorder := 'little'
  chunkshape := (8191,)
[...]

I'll investigate. This may be due to the table information not being correct. As the testfile was not created using pytables.

@tkintscher
Copy link

I just ran into the same issue. If it helps, this also occurs in Python2.
Everything works in version 3.4.4, the broken behavior starts with version 3.5.1.

@tomkooij
Copy link
Contributor

tomkooij commented Sep 23, 2019

Thanks putting the finger on the sore spot @tkintscher: I can confirm this broke between 3.4.4 and 3.5.1.

Comparing: v3.4.4...v3.5.1
Points to: #720 as the probable culprit. But that's a large PR.
The culprit is: 4a84d81 Its parent: cadb09b is ok.

@tomkooij
Copy link
Contributor

I checked / review the offsets PR, but could not find anything wrong. Table.description._v_offsets seems to work fine.

I'm do not have time to investigate further, so I'm leaving some verbose info about this:

test.hdf5 file has some strange padding:

   h5ls -vv test.hdf5
   [...]
   Type:      struct {
                   "A"                +0    native unsigned int
                   "B"                +4    native unsigned int
               } 16 bytes

This struct should be 8 bytes. In fact, it is 8 bytes when I recreate the dataset.

@jvansanten
Copy link

One thing to note is that the library that wrote the test file above creates compound types whose size is guaranteed to be a multiple of the size of the widest supported type (currently long double). For example, in the test file h5ls shows

    Type:      struct {
                   "A"                +0    native unsigned int
                   "B"                +4    native unsigned int
               } 16 bytes

. This means that you can potentially run into trouble if you assume the size of an item is the offset+size of the last field. It looks like this is what's happening here: even in v3.5.2, hdf.root.Test.description._v_itemsize is 8, but the 16 bytes of the in-memory representation still show up in the numpy array returned by read().

@jvansanten
Copy link

I suspect that the underlying issue here is that the only way to control the padding in a numpy.dtype is via the align keyword, whereas the padding in an HDF5 compound datatype is completely arbitrary. If that is the case, then there's no way to expose the in-memory representation of the example table as a numpy array, and the incompatibility never showed up before because PyTables removed whatever padding was there.

@tomkooij tomkooij added this to the python 3.8 milestone Sep 24, 2019
@tomkooij
Copy link
Contributor

@jvansanten: It should be possible to extract the correct numpy.dtype from the HDF5. I'll try to find some time for this.
Thanks for the help!

@jvansanten
Copy link

Right, it looks like you should be able to reach slightly deeper into PyArray_Descr and set the itemsize explicitly rather than relying on https://github.com/numpy/numpy/blob/7214ca4688545b432c45287195e2f46c5e418ce8/numpy/core/src/multiarray/descriptor.c#L1012-L1338 to do it for you. Whether you rely on HDF5's guarantee that fields in a compound datatype do not overlap or reimplement the dtype sanity checks is another question.

@tomkooij
Copy link
Contributor

tomkooij commented Sep 25, 2019

Setting the correct dtype in Table._v_dtype solves this:

In the interactive session below, I set Table._v_dtype to the correct dtype matching the HDF5 file. I then successfully loaded the data into an array:

In [1]: import tables

In [2]: f = tables.open_file('test.hdf5'); t = f.root.Test

In [4]: t[:]
Out[4]:
array([(0, 100), (0,   0), (1, 101), (0,   0), (2, 102), (0,   0),
       (3, 103), (0,   0), (4, 104), (0,   0)],
      dtype=[('A', '<u4'), ('B', '<u4')])

In [5]: t._v_dtype = {'names':['A','B'], 'formats':['<u4','<u4'], 'offsets':[0,4], 'itemsize':16}

In [6]: t[:]
Out[6]:
array([(0, 100), (1, 101), (2, 102), (3, 103), (4, 104), (5, 105),
       (6, 106), (7, 107), (8, 108), (9, 109)],
      dtype={'names':['A','B'], 'formats':['<u4','<u4'], 'offsets':[0,4], 'itemsize':16})

Unfortunately setting the correct dtype for the Table upon initialisation does not seem to be a oneliner fix... But I'll work on it some more, when time permits.

tomkooij added a commit to tomkooij/PyTables that referenced this issue Sep 28, 2019
Regression test for PyTablesgh-734

Test if a H5T_COMPOUND (Table) with padding can be read properly.
(test if dtype.itemsize set correctly)
tomkooij added a commit to tomkooij/PyTables that referenced this issue Sep 28, 2019
Regression test for PyTablesgh-734

Test if a H5T_COMPOUND (Table) with padding can be read properly.
(test if dtype.itemsize set correctly)
tomkooij added a commit that referenced this issue Oct 13, 2019
@kjmeagher
Copy link
Author

@tomkooij, thanks for looking into this and fixing it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants