Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix reading of byte-swapped input files (#95) #101

Merged
merged 1 commit into from Feb 17, 2023

Conversation

helmutg
Copy link
Contributor

@helmutg helmutg commented Jan 31, 2023

When reading a byte-swapped file, the input is grouped to 4-byte words and each of them is swapped individually. When we try to read such a file, we first validate its header using zfp_read_header with the ZFP_HEADER_MAGIC flag. This flag causes it to only validate the first word to be "zfp\x05". If it is not exactly that, it gives up. Unfortunately, this magic word can already be swapped. The actual byte swapping code would only be tried once the full header would fail to read, so automatic byte swapping never worked.

Instead, when encountering a header with bad magic, try swapping it already and only try reading the full header once the magic (normal or swapped) has been read successfully.

Thanks to Mark C. Miller, Peter Lindstrom and Enrico Zini for doing most of the debugging to get here.

When reading a byte-swapped file, the input is grouped to 4-byte words
and each of them is swapped individually. When we try to read such a
file, we first validate its header using zfp_read_header with the
ZFP_HEADER_MAGIC flag. This flag causes it to only validate the first
word to be "zfp\x05". If it is not exactly that, it gives up.
Unfortunately, this magic word can already be swapped. The actual byte
swapping code would only be tried once the full header would fail to
read, so automatic byte swapping never worked.

Instead, when encountering a header with bad magic, try swapping it
already and only try reading the full header once the magic (normal or
swapped) has been read successfully.

Thanks to Mark C. Miller, Peter Lindstrom and Enrico Zini for doing most
of the debugging to get here.
Copy link
Collaborator

@brtnfld brtnfld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. The test is in #102

@markcmiller86
Copy link
Member

Ok, sorry for long delay in assessing. I needed to fully understand why orig. implementation was failing and that new approach is best way to go.

What the filter is basically doing is using HDF5's cd_values stuff (an array of user-defined length of unsigned ints) to store a faux (or dummy) ZFP stream header. That is because that header captures everything the ZFP library needs to know about the compressed data. We use ZFP library to write its header into a buffer that is later treated by HDF5 as the dataset's cd_values array and we do the reverse when reading.

But, ZFP is expecting to be able to do that in an endian-agnostic way. In otherwords, you get the same sequence of bytes in the ZFP stream regardless of whether its done on a big-endian or little-endian machine. Thats fine but the cd_values array of unsigned ints is NOT endian-agnostic. It is designed to store little endian to the file but nonetheless handle byte swapping of the cd_values data (on both write and read) when interacting with a big endian caller.

So, ZFP's header will wind up experiencing byte swapping from HDF5 on big endian systems. The implementation was to trigger off a failure of ZFP to read its header as a hint that maybe the data got byte-swapped and then UNswap the data and try reading the header a second time. That worked fine until separate logic to read just ZFP's MAGIC was introduced (cd1de1c) but did not also handle the possible byte-swapping scenario.

Thanks so much to @helmutg for finding and fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants