Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-dimensional arrays have reversed order if loaded by h5py in Python3 #785

Closed
qin-yu opened this issue Dec 20, 2020 · 16 comments · Fixed by #805
Closed

Multi-dimensional arrays have reversed order if loaded by h5py in Python3 #785

qin-yu opened this issue Dec 20, 2020 · 16 comments · Fixed by #805

Comments

@qin-yu
Copy link
Contributor

qin-yu commented Dec 20, 2020

Hi, I noticed that if I save a 3×473×600 Array{UInt8,3} array using h5write() in Julia and then load this dataset into Python3 with h5py.File(), the loaded numpy array has shape (600, 473, 3). Is it a known issue? Also, is it a design choice?

@musm
Copy link
Member

musm commented Dec 22, 2020

Hmm I think this is intentional since Julia uses column-major ordering vs. C/Python row-major ordering. Altough we should probably fix this so it's consistent.

@qin-yu
Copy link
Contributor Author

qin-yu commented Dec 22, 2020

Hi @musm, since you are considering a fix, I propose that we swap the dimensions (of any array with dimension > 1) by permutedims() before writing into and reading from .h5 files.

Although C/Python are more popular these days, do you want to make this an optional argument, say rowmajor=false? so the codes remain correct for existing users.

May I do the fix?

@musm
Copy link
Member

musm commented Dec 22, 2020

Sure, I think you would basically need permutedims with reverse(1:ndims(buf)) or something like that (I haven't checked).

@jmert what do you think about this?

@qin-yu
Copy link
Contributor Author

qin-yu commented Dec 22, 2020

Sure, I think you would basically need permutedims with reverse(1:ndims(buf)) or something like that (I haven't checked).

Exactly, just checked. Let Julia construct a column-major 3D-array with collect(1:2*3*4) like this one:

julia> reshape(collect(1:4*3*2), (2, 3, 4))
2×3×4 Array{Int64,3}:
[:, :, 1] =
 1  3  5
 2  4  6

[:, :, 2] =
 7   9  11
 8  10  12

[:, :, 3] =
 13  15  17
 14  16  18

[:, :, 4] =
 19  21  23
 20  22  24

Save it to a .h5 file:

using HDF5
test3d = reshape(collect(1:4*3*2), (2, 3, 4))  # by default Julia reshape it into column-major
h5write("test3d.h5", "col", test3d)
h5write("test3d.h5", "row", permutedims(test3d, reverse(1:ndims(test3d))))

Read it in Python3:

import h5py
with h5py.File("./test3d.h5", 'r') as f3d:
    print(f3d['col'][:], f3d['col'][:].shape)  # should be row-major now in Python
    print(f3d['row'][:], f3d['row'][:].shape)  # should be column-major now in Python

we have the expected output:

[[[ 1  2]
  [ 3  4]
  [ 5  6]]

 [[ 7  8]
  [ 9 10]
  [11 12]]

 [[13 14]
  [15 16]
  [17 18]]

 [[19 20]
  [21 22]
  [23 24]]] (4, 3, 2)

[[[ 1  7 13 19]
  [ 3  9 15 21]
  [ 5 11 17 23]]

 [[ 2  8 14 20]
  [ 4 10 16 22]
  [ 6 12 18 24]]] (2, 3, 4)

@jmert
Copy link
Contributor

jmert commented Dec 22, 2020

@jmert what do you think about this?

It's something I had in the back of my mind, but I haven't yet take the time to dig through development history to see why this particular choice was made. In my own use case, there's no ambiguity (I've been most interested in reading serialized CSC sparse matrices which get stored as multiple vectors and some metadata) so I haven't worried about it too much.

The immediate complication/downside I see to performing any implicit permuting of dimensions is that memory mapping an array won't equal reading the array.

@qin-yu
Copy link
Contributor Author

qin-yu commented Dec 22, 2020

In my own use case, there's no ambiguity (I've been most interested in reading serialized CSC sparse matrices which get stored as multiple vectors and some metadata) so I haven't worried about it too much.

Ah, I see. For me, people in my lab work on 3D images packed into 4D arrays, so I never thought of compression. I think TensorFlow/PyTorch users (including me) start to try Flux now, so it might be good if HDF5.jl have an out-of-the-box optional arg on read/write.

The immediate complication/downside I see to performing any implicit permuting of dimensions is that memory mapping an array won't equal reading the array.

Maybe we can have a slow version with permutedims() now and make it fast later at a lower level?

@eschnett
Copy link
Contributor

Isn't permutedims an expensive operation that copies a lot of memory? I wouldn't do that by default.

numpy supports Fortran-style arrays, so reading Julia output from Python is possible without transposing arrays. Similarly, Julia supports transposed arrays (the type LinearAlgebra.Adjoint), which means that actually transposing arrays is never necessary.

It is unfortunate that HDF5 does not store a flag that indicates whether an array is stored in a row-major or column-major way. Given this, what I describe above is the standard approach.

@qin-yu
Copy link
Contributor Author

qin-yu commented Jan 11, 2021

Isn't permutedims an expensive operation that copies a lot of memory? I wouldn't do that by default.

numpy supports Fortran-style arrays, so reading Julia output from Python is possible without transposing arrays. Similarly, Julia supports transposed arrays (the type LinearAlgebra.Adjoint), which means that actually transposing arrays is never necessary.

It is unfortunate that HDF5 does not store a flag that indicates whether an array is stored in a row-major or column-major way. Given this, what I describe above is the standard approach.

Thank you for sharing some ideas! Indeed permutedims() sounds expensive (I don't really know) and changing the default breaks code. It also make sense to let Python uses deal with the reading-into-Python problem.

But changing the order is not only about writing but also about reading, and I wouldn't expect Julia users in the future seeing an HDF5 file have to know Python. Also, the (lazy and probably-better-in-someway) LinearAlgebra.Adjoint is "intended for linear algebra usage - for general data manipulation see permutedims" (also [*]).

I guess some users have to face this problem in some ways, with or without this repo. Shall we close the issue or do you think we should get a built-in optional argument? @musm @jmert

@musm
Copy link
Member

musm commented Jan 11, 2021

So if I understand correctly the issue is explicitly with hdf5 arrays created in numpy and that there's no issue with hdf5.jl reading hdf5 arrays created in say fortran w.r.t their ordering.

Does numpy store any optional metadata that we could use to help make a decision on whether we need to wrap it in a tranpose op or not?

@qin-yu
Copy link
Contributor Author

qin-yu commented Jan 11, 2021

So if I understand correctly the issue is explicitly with hdf5 arrays created in numpy and that there's no issue with hdf5.jl reading hdf5 arrays created in say fortran w.r.t their ordering.

Yes there is a problem reading HDF5 arrays created by h5py. I have not tried HDF5.jl on Fortran-written HDF5 files.

Does numpy store any optional metadata that we could use to help make a decision on whether we need to wrap it in a tranpose op or not?

Good point! There is a shape parameter specified (optional but could be inferred anyway) when creating a dataset with h5py. Let me check if it's written to the file and if I can read it with HDF5.jl.

@qin-yu
Copy link
Contributor Author

qin-yu commented Jan 11, 2021

Hi @musm, I tried to check with a file created by Python's h5py, but it seems that there is no information about the shape unless you look at the object itself:

julia> h5open("data/plant1_0hrs.h5", "r") do file
           println(keys(file))
           println(size(file["raw"]))
           println(HDF5.attributes(file["raw"]))
       end

["label", "raw"]
(512, 512, 134)
Attributes of HDF5.Dataset: /raw (file: data/plant1_0hrs.h5 xfer_mode: 0)

I believe the shape information is not explicitly written into the file.

@jmert
Copy link
Contributor

jmert commented Jan 13, 2021

I first found this discourse comment on HDF5's intentions for storing arrays — in my reading, HDF5.jl is doing exactly what was intended and is reversing the order of dimensions to have the "fastest moving" (unit stride) dimension be the last listed dimension in HDF5's dataspace. (This is backed up by how libhdf5's own Fortran interface is documented as behaving: see section 7.3.2.5 of the HDF5 user guide.)

That being said, I too find the situation frustrating — On the one hand, they're right in that the dataspace dimension reversal preserves the [relative] structuring of the data (i.e. adjacent elements are still adjacent, strides between elements are preserved, etc), but it leaves an ambiguity in the "linear algebra"/"image" ordering of the dimensions. They've gone through careful effort in things like providing separate data types for different endianesses, so it would have been nice to have some sort of "dimensions reversed" flag on the data space to indicate whether the dimensions in matrix/math notation and the dataspace agree (C ordering) or are reversed (Fortran ordering).

I guess that means, I'm arguing to err on the side of leaving things as they are since it agrees with how the official Fortran interface behaves (and Julia's column major ordering matches Fortran's). I don't think I'm in support for adding flags since the official libhdf5 docs do clearly state what is intended, and that kind of knowledge probably should be required of the user [of HDF5.jl] — permutedims works easily enough at the package/application level if it's strictly required, and that kind of knowledge usually is something that needs to either be documented in your data format (either as docs or a reference implementation) or can be added as data-specific attributes (much like what Python's h5sparse does to describe sparse matrices).

@qin-yu
Copy link
Contributor Author

qin-yu commented Jan 13, 2021

Indeed there is no elegant solution. I agree that making the users aware of this is one of the best things we can do. The comments here are perhaps very informative and helpful already! Shall we close the issue?

@musm
Copy link
Member

musm commented Jan 13, 2021

Agreed. I think the best we could do is document this behavior here. @qin-yu any chance you want to take a stab at it?

@qin-yu
Copy link
Contributor Author

qin-yu commented Jan 13, 2021

Sure, though I'm not an expert on this. Maybe we can have a short notice in README + a paragraph in documentation + a sentence in docstring for reading functions?

@musm
Copy link
Member

musm commented Jan 13, 2021

The main doc file is simply a markdown file located https://github.com/JuliaIO/HDF5.jl/blob/master/docs/src/index.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants