Multi-dimensional arrays have reversed order if loaded by h5py in Python3 #785

qin-yu · 2020-12-20T00:41:34Z

Hi, I noticed that if I save a 3×473×600 Array{UInt8,3} array using h5write() in Julia and then load this dataset into Python3 with h5py.File(), the loaded numpy array has shape (600, 473, 3). Is it a known issue? Also, is it a design choice?

The text was updated successfully, but these errors were encountered:

musm · 2020-12-22T19:46:28Z

Hmm I think this is intentional since Julia uses column-major ordering vs. C/Python row-major ordering. Altough we should probably fix this so it's consistent.

qin-yu · 2020-12-22T20:26:34Z

Hi @musm, since you are considering a fix, I propose that we swap the dimensions (of any array with dimension > 1) by permutedims() before writing into and reading from .h5 files.

Although C/Python are more popular these days, do you want to make this an optional argument, say rowmajor=false? so the codes remain correct for existing users.

May I do the fix?

musm · 2020-12-22T20:29:19Z

Sure, I think you would basically need permutedims with reverse(1:ndims(buf)) or something like that (I haven't checked).

@jmert what do you think about this?

qin-yu · 2020-12-22T21:18:23Z

Sure, I think you would basically need permutedims with reverse(1:ndims(buf)) or something like that (I haven't checked).

Exactly, just checked. Let Julia construct a column-major 3D-array with collect(1:2*3*4) like this one:

julia> reshape(collect(1:4*3*2), (2, 3, 4))
2×3×4 Array{Int64,3}:
[:, :, 1] =
 1  3  5
 2  4  6

[:, :, 2] =
 7   9  11
 8  10  12

[:, :, 3] =
 13  15  17
 14  16  18

[:, :, 4] =
 19  21  23
 20  22  24

Save it to a .h5 file:

using HDF5
test3d = reshape(collect(1:4*3*2), (2, 3, 4))  # by default Julia reshape it into column-major
h5write("test3d.h5", "col", test3d)
h5write("test3d.h5", "row", permutedims(test3d, reverse(1:ndims(test3d))))

Read it in Python3:

import h5py
with h5py.File("./test3d.h5", 'r') as f3d:
    print(f3d['col'][:], f3d['col'][:].shape)  # should be row-major now in Python
    print(f3d['row'][:], f3d['row'][:].shape)  # should be column-major now in Python

we have the expected output:

[[[ 1  2]
  [ 3  4]
  [ 5  6]]

 [[ 7  8]
  [ 9 10]
  [11 12]]

 [[13 14]
  [15 16]
  [17 18]]

 [[19 20]
  [21 22]
  [23 24]]] (4, 3, 2)

[[[ 1  7 13 19]
  [ 3  9 15 21]
  [ 5 11 17 23]]

 [[ 2  8 14 20]
  [ 4 10 16 22]
  [ 6 12 18 24]]] (2, 3, 4)

jmert · 2020-12-22T21:22:38Z

@jmert what do you think about this?

It's something I had in the back of my mind, but I haven't yet take the time to dig through development history to see why this particular choice was made. In my own use case, there's no ambiguity (I've been most interested in reading serialized CSC sparse matrices which get stored as multiple vectors and some metadata) so I haven't worried about it too much.

The immediate complication/downside I see to performing any implicit permuting of dimensions is that memory mapping an array won't equal reading the array.

qin-yu · 2020-12-22T21:47:57Z

In my own use case, there's no ambiguity (I've been most interested in reading serialized CSC sparse matrices which get stored as multiple vectors and some metadata) so I haven't worried about it too much.

Ah, I see. For me, people in my lab work on 3D images packed into 4D arrays, so I never thought of compression. I think TensorFlow/PyTorch users (including me) start to try Flux now, so it might be good if HDF5.jl have an out-of-the-box optional arg on read/write.

The immediate complication/downside I see to performing any implicit permuting of dimensions is that memory mapping an array won't equal reading the array.

Maybe we can have a slow version with permutedims() now and make it fast later at a lower level?

eschnett · 2021-01-11T19:04:23Z

Isn't permutedims an expensive operation that copies a lot of memory? I wouldn't do that by default.

numpy supports Fortran-style arrays, so reading Julia output from Python is possible without transposing arrays. Similarly, Julia supports transposed arrays (the type LinearAlgebra.Adjoint), which means that actually transposing arrays is never necessary.

It is unfortunate that HDF5 does not store a flag that indicates whether an array is stored in a row-major or column-major way. Given this, what I describe above is the standard approach.

qin-yu · 2021-01-11T19:52:02Z

Isn't permutedims an expensive operation that copies a lot of memory? I wouldn't do that by default.

numpy supports Fortran-style arrays, so reading Julia output from Python is possible without transposing arrays. Similarly, Julia supports transposed arrays (the type LinearAlgebra.Adjoint), which means that actually transposing arrays is never necessary.

It is unfortunate that HDF5 does not store a flag that indicates whether an array is stored in a row-major or column-major way. Given this, what I describe above is the standard approach.

Thank you for sharing some ideas! Indeed permutedims() sounds expensive (I don't really know) and changing the default breaks code. It also make sense to let Python uses deal with the reading-into-Python problem.

But changing the order is not only about writing but also about reading, and I wouldn't expect Julia users in the future seeing an HDF5 file have to know Python. Also, the (lazy and probably-better-in-someway) LinearAlgebra.Adjoint is "intended for linear algebra usage - for general data manipulation see permutedims" (also [*]).

I guess some users have to face this problem in some ways, with or without this repo. Shall we close the issue or do you think we should get a built-in optional argument? @musm @jmert

musm · 2021-01-11T20:01:29Z

So if I understand correctly the issue is explicitly with hdf5 arrays created in numpy and that there's no issue with hdf5.jl reading hdf5 arrays created in say fortran w.r.t their ordering.

Does numpy store any optional metadata that we could use to help make a decision on whether we need to wrap it in a tranpose op or not?

qin-yu · 2021-01-11T20:38:05Z

So if I understand correctly the issue is explicitly with hdf5 arrays created in numpy and that there's no issue with hdf5.jl reading hdf5 arrays created in say fortran w.r.t their ordering.

Yes there is a problem reading HDF5 arrays created by h5py. I have not tried HDF5.jl on Fortran-written HDF5 files.

Does numpy store any optional metadata that we could use to help make a decision on whether we need to wrap it in a tranpose op or not?

Good point! There is a shape parameter specified (optional but could be inferred anyway) when creating a dataset with h5py. Let me check if it's written to the file and if I can read it with HDF5.jl.

qin-yu · 2021-01-11T20:58:42Z

Hi @musm, I tried to check with a file created by Python's h5py, but it seems that there is no information about the shape unless you look at the object itself:

julia> h5open("data/plant1_0hrs.h5", "r") do file
           println(keys(file))
           println(size(file["raw"]))
           println(HDF5.attributes(file["raw"]))
       end

["label", "raw"]
(512, 512, 134)
Attributes of HDF5.Dataset: /raw (file: data/plant1_0hrs.h5 xfer_mode: 0)

I believe the shape information is not explicitly written into the file.

jmert · 2021-01-13T04:12:54Z

I first found this discourse comment on HDF5's intentions for storing arrays — in my reading, HDF5.jl is doing exactly what was intended and is reversing the order of dimensions to have the "fastest moving" (unit stride) dimension be the last listed dimension in HDF5's dataspace. (This is backed up by how libhdf5's own Fortran interface is documented as behaving: see section 7.3.2.5 of the HDF5 user guide.)

That being said, I too find the situation frustrating — On the one hand, they're right in that the dataspace dimension reversal preserves the [relative] structuring of the data (i.e. adjacent elements are still adjacent, strides between elements are preserved, etc), but it leaves an ambiguity in the "linear algebra"/"image" ordering of the dimensions. They've gone through careful effort in things like providing separate data types for different endianesses, so it would have been nice to have some sort of "dimensions reversed" flag on the data space to indicate whether the dimensions in matrix/math notation and the dataspace agree (C ordering) or are reversed (Fortran ordering).

I guess that means, I'm arguing to err on the side of leaving things as they are since it agrees with how the official Fortran interface behaves (and Julia's column major ordering matches Fortran's). I don't think I'm in support for adding flags since the official libhdf5 docs do clearly state what is intended, and that kind of knowledge probably should be required of the user [of HDF5.jl] — permutedims works easily enough at the package/application level if it's strictly required, and that kind of knowledge usually is something that needs to either be documented in your data format (either as docs or a reference implementation) or can be added as data-specific attributes (much like what Python's h5sparse does to describe sparse matrices).

qin-yu · 2021-01-13T09:17:52Z

Indeed there is no elegant solution. I agree that making the users aware of this is one of the best things we can do. The comments here are perhaps very informative and helpful already! Shall we close the issue?

musm · 2021-01-13T20:32:07Z

Agreed. I think the best we could do is document this behavior here. @qin-yu any chance you want to take a stab at it?

qin-yu · 2021-01-13T23:11:58Z

Sure, though I'm not an expert on this. Maybe we can have a short notice in README + a paragraph in documentation + a sentence in docstring for reading functions?

musm · 2021-01-13T23:15:22Z

The main doc file is simply a markdown file located https://github.com/JuliaIO/HDF5.jl/blob/master/docs/src/index.md

jw3126 mentioned this issue Jan 14, 2021

Document gotchas when interoperation with python #804

Open

qin-yu mentioned this issue Jan 14, 2021

Add row/column-major order documentation #805

Merged

musm closed this as completed in #805 Jan 18, 2021

visr mentioned this issue Jan 6, 2022

Order field should be 'F' JuliaIO/Zarr.jl#78

Open

shuuul mentioned this issue Jul 17, 2023

Ensure we handle handedness correctly sethaxen/MRCFile.jl#10

Open

kudav mentioned this issue Oct 22, 2024

Feature/py stel PrincetonUniversity/STELLOPT#289

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-dimensional arrays have reversed order if loaded by h5py in Python3 #785

Multi-dimensional arrays have reversed order if loaded by h5py in Python3 #785

qin-yu commented Dec 20, 2020

musm commented Dec 22, 2020

qin-yu commented Dec 22, 2020

musm commented Dec 22, 2020

qin-yu commented Dec 22, 2020

jmert commented Dec 22, 2020

qin-yu commented Dec 22, 2020

eschnett commented Jan 11, 2021

qin-yu commented Jan 11, 2021

musm commented Jan 11, 2021

qin-yu commented Jan 11, 2021

qin-yu commented Jan 11, 2021

jmert commented Jan 13, 2021

qin-yu commented Jan 13, 2021

musm commented Jan 13, 2021

qin-yu commented Jan 13, 2021

musm commented Jan 13, 2021

Multi-dimensional arrays have reversed order if loaded by h5py in Python3 #785

Multi-dimensional arrays have reversed order if loaded by h5py in Python3 #785

Comments

qin-yu commented Dec 20, 2020

musm commented Dec 22, 2020

qin-yu commented Dec 22, 2020

musm commented Dec 22, 2020

qin-yu commented Dec 22, 2020

jmert commented Dec 22, 2020

qin-yu commented Dec 22, 2020

eschnett commented Jan 11, 2021

qin-yu commented Jan 11, 2021

musm commented Jan 11, 2021

qin-yu commented Jan 11, 2021

qin-yu commented Jan 11, 2021

jmert commented Jan 13, 2021

qin-yu commented Jan 13, 2021

musm commented Jan 13, 2021

qin-yu commented Jan 13, 2021

musm commented Jan 13, 2021