Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new endianness test #102

Merged
merged 1 commit into from
Feb 17, 2023
Merged

Conversation

spanezz
Copy link
Contributor

@spanezz spanezz commented Jan 31, 2023

I added to tests/ a bigendian.h5 file generated on a s390x big endian machine using:

yes 12345 | head -n 5000 | h5import /dev/stdin -dims 5000 -type TEXTIN -size 32 -o unpacked.h5
h5repack -f UD=32013,0,4,1,0,0,1074921472 unpacked.h5 bigendian.h5

And then added a test-endian1 test that check if h5dump on that file contains 12345 values.

This test fails on little-endian machines without #101 and succeeds with the patch in #101

@brtnfld brtnfld self-requested a review February 16, 2023 00:22
Copy link
Collaborator

@brtnfld brtnfld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. @markcmiller86, I wonder if we should add a testfiles directory to house all the HDF5 files used in the tests?

@markcmiller86
Copy link
Member

FYI...I am working on this. I don't have easy access to a big endian system so I've found a Docker s390x container that is working. I am trying to duplicate the bad behavior that is triggering this code change and so far haven't been able to in spite of having a big endian system.

For a minute, I was worried the two test files test_zfp_be.h5 and test_zfp_le.h5 were mistakenly both little endian. But, they do not appear to be. I am trying to prove that to myself by using od -j <skip> -e to print the original dataset values and prove to myself that doing so on both files returns byte swapped results. I did indeed do that but finding the correct skip value to get to the a file offset that represents the beginning of the original dataset's first chunk was a tad difficult. h5ls emits a dataset location as a pair of ints (e.g. 1:800) and that is certainly not an offset into the file where the dataset starts.

@markcmiller86
Copy link
Member

Ok, well, I guess I should have read the initial comment here.

First, thanks for adding this test!

That is exactly what I was looking for.

Ok, let me just confirm its behaving as I expect for me too.

@markcmiller86
Copy link
Member

@brtnfld I just want to check...how does HDF5 library handle endianness of cd_values array stored to the dataset header? These values are consumed (read and written) by the HDF5 caller but are also stored to the file. I mean, does the lib choose an endianness for the cd_values data stored in the file and automatically byte-swap when cd_values are being read or written in a cross-endian context?

@brtnfld
Copy link
Collaborator

brtnfld commented Feb 16, 2023

From Jordan:

Looks like the library does a UINT32ENCODE on the cd_values when writing out the filter pipeline message

        /* Filter parameters */
        for (j = 0; j < filter->cd_nelmts; j++)
            UINT32ENCODE(p, filter->cd_values[j]);

So I think that means the values will get stored as whatever the machine's endian-ness is that the file is written from, which may not be ideal

@markcmiller86
Copy link
Member

From Jordan:

Looks like the library does a UINT32ENCODE on the cd_values when writing out the filter pipeline message

        /* Filter parameters */
        for (j = 0; j < filter->cd_nelmts; j++)
            UINT32ENCODE(p, filter->cd_values[j]);

So I think that means the values will get stored as whatever the machine's endian-ness is that the file is written from, which may not be ideal

Well, in looking at the definition of that macro in H5Fprivate.h...

#  define UINT32ENCODE(p, i) {                              \
   *(p) = (uint8_t)( (i)        & 0xff); (p)++;                      \
   *(p) = (uint8_t)(((i) >>  8) & 0xff); (p)++;                      \
   *(p) = (uint8_t)(((i) >> 16) & 0xff); (p)++;                      \
   *(p) = (uint8_t)(((i) >> 24) & 0xff); (p)++;                      \
}

That is writing the least significant byte to the lowest memory location and the most significant byte to the highest memory location...little endian. So, its storing little endian to the file.

But, it basically reverses this process when returning cd_values to a reader here...

https://github.com/HDFGroup/hdf5/blob/966454aac1231da7209ef81c11055d3312181f99/src/H5Opline.c#L230-L239

using this logic...

#  define UINT32DECODE(p, i) {                              \
   (i)    =  (uint32_t)(*(p) & 0xff);       (p)++;                  \
   (i) |= ((uint32_t)(*(p) & 0xff) <<  8); (p)++;                  \
   (i) |= ((uint32_t)(*(p) & 0xff) << 16); (p)++;                  \
   (i) |= ((uint32_t)(*(p) & 0xff) << 24); (p)++;                  \
}

So, my read is that its always storing little endian to the file but also effectively properly byte-swapping the result when consumer or producer are NOT little endian.

@jhendersonHDF
Copy link
Contributor

@markcmiller86 Ah yes, thanks for the correction. Doing too many things at once..

I agree, it appears we're writing the filter's cd_values to the file as little endian and then byte-swapping on read.

@spanezz
Copy link
Contributor Author

spanezz commented Feb 17, 2023

This is probably of interest, although a bit late: I did try to verify that cd_values are preserved correctly across machines of different endianness, by creating a mock filter plugin that sets cd_values and leaves data unchanged.

Here it is:
h5zcat.zip
together with an hdf5 file repacked with h5repack on a little endian machine and one repacked on a big endian machine.

@markcmiller86
Copy link
Member

thanks so much @spanezz for the added test. Just what we needed. Cool way to generate the big-endian data too. Much easier than what I was doing. I actually installed development tools and HDF5 on a s390x Docker image and then built and ran the filter and its tests there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants