Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Honor padding in compound types in native HDF5 files #720

Merged
merged 13 commits into from Mar 8, 2019
Merged

Conversation

FrancescAlted
Copy link
Member

@FrancescAlted FrancescAlted commented Feb 19, 2019

This PR adds support for handling general compound types with padding. So far, PyTables always removed the possible padding (i.e. 'holes') in the compound datatypes, leading to issues when PyTables was used for manipulating or copying HDF5 files created with other tools in that the padding was removed.

With this PR, the HDF5 types are used internally with padding, so preserving it during output operations, most specially with copies (via e.g. ptrepack). This would allow a smoother interation with the HDF5 ecosystem out there (see some messages send from Ken Walker in the mailing list in 2018-10-30).

@FrancescAlted
Copy link
Member Author

As a reference, here it is the message from Ken Walker to pytables-users on 2018-10-30 about the issue that is being addressed in this PR:

This is related to my previous question about H5Tpack().
I am working thru a problem reading/writing data with Pytables. I read some data rows from one HDF5 file/dataset into a numpy record array, then write that array to a dataset in a different HDF5 file (no change to the data). The data in the new file looks fine when interrogated with Pytables or viewed with HDFView. However, a downstream C++ app can't read the Pytables data.
I am told (by the developers) that the compiler for the upstream program is set to pad the data when it writes the original file (that I am reading), and the pad is expected by the downstream reader (that reads the file I created). Padding adds 4 pad characters to the a 4 byte S4 field so the next field starts at an 8 byte memory boundary. Based on observed behavior, they have inferred that Pytables removes the pad characters when reading the dataset, and does not add a pad when writing the new dataset. (all perfectly legal in hdf5 and does not affect data integrity). However the missing pad is expected by the downstream reader, and causes an error (I know, bad code design).

So....I'm wondering...is there something in Pytables that controls padding when reading/writing datasets like this?

FYI, I recreated this read/write process with h5py, and the output file is compatible with my downstream app. Apparently h5py retains the padded characters. This is confirmed when I write the dataset.dtype: h5py reports itemsize:384, vs itemsize:380 when Pytables reads the dataset.
I could rewrite my utility with h5py...but hope to avoid (if possible) because I leverage a lot of pytables unique functionality.
Thanks in advance for any insights into this quirky padding behavior.
-Ken

@FrancescAlted FrancescAlted changed the title Support for padding in native HDF5 files Respect padding in compound types in native HDF5 files Feb 20, 2019
@FrancescAlted
Copy link
Member Author

FrancescAlted commented Feb 20, 2019

With this, PyTables can create tables with paddings as long as they come from NumPy arrays with paddings (i.e. paddings in NumPy structured arrays are respected), and the original paddings are respected during copies too.

To do:

  • Add tests for copies and check that padding is respected (fixed in 51a221a)
  • Document the new _v_offsets attribute in the Description class (fixed in 7fa825e)

@FrancescAlted
Copy link
Member Author

The heavy test suite pass on Linux:

Ran 67234 tests in 6323.946s

OK (skipped=304)

@tomkooij
Copy link
Contributor

The tests LGTM, so if they pass, go ahead and merge. If you'd like an actual review, I can maybe do that tomorrow. Let me know.

Nice work!

@FrancescAlted
Copy link
Member Author

Hi @tomkooij . Yes, as this is a change that affects the format of dataset copies (again, only when padding is present), a review would be greatly appreciated. Thanks!

@FrancescAlted FrancescAlted changed the title Respect padding in compound types in native HDF5 files Honor padding in compound types in native HDF5 files Feb 21, 2019
@FrancescAlted
Copy link
Member Author

Let's merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants