Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Npy file format support #2437

Merged
merged 23 commits into from
Aug 8, 2023
Merged

Npy file format support #2437

merged 23 commits into from
Aug 8, 2023

Conversation

Lestropie
Copy link
Member

@Lestropie Lestropie commented Feb 25, 2022

Superset of #2436. I'll leave that open for now, but showing this PR might help to more clearly provide a justification for those changes.

Listed as draft PR for now as I would like to add some unit tests, which will require additions to the test data repo.

Discussion points:

  1. The functions for loading & saving matrix data from / to the filesystem are here proposed to be moved from the ::MR namespace in file core/math/math.h to the ::MR::File::Matrix namespace in file core/file/matrix.h. I always found it unusual to have those functions in our base namespace, as well as having those functions defined in a header file where the content of the first half is so drastically different and in no way related to filesystem interfacing. While this change would break backward compatibility with any external C++ projects, it to me feels like a much more appropriate location, and external projects could get away with testing the content of MRTRIX_MINOR_VERSION at compile time if preserving 3.0.x compatibility is desired.

  2. Support for 16-bit floating-point data is via a single header include file, which is integrated into the repository, just as was done previously with JSON for Modern C++. It however occurs to me that the utilisation of such libraries is not broadcast to the same extent as are external dependencies that are listed as prerequisites for compilation. I wonder if we should have somewhere listed all packages on which the software depends, regardless of whether they are compile-time dependencies or imported directly into the repository.

  3. 16-bit floating-point support is currently implemented in a manner specific to the NPY file format for matrix or vector data. It does however occur to me that Float16 could theoretically be just another type in MR::DataType, with get/set functions that use the new 16-bit floating-point library, and the MRtrix image format would inherit support for such. That would potentially simplify the NPY code slightly, but would open up a whole bunch of requisite testing to make sure it doesn't break something else (e.g. no idea how easy or difficult this data type might be to support in mrview without trying it).

  4. There is scope to speed up the NPY load / save operations in cases where data type / endianness / row vs. column ordering is identical between disk and RAM, by doing a direct memcpy() with the memory-mapped region.

Lestropie and others added 5 commits February 21, 2022 22:01
- Move files fetch_store.* from core/image_io/ to core/.
- Rename some functions to reflect the fact that they perform a linear transform of intensities in addition to fetching or storing raw data.
- Add new functions that provide access to such template functions but without the use of that linear transformation, which tends to be specific to image data.
New file core/file/matrix.h now contains functions relating to the loading and saving of numerical matrix data from / to text file. In addition, these functions are now in namespace File::Matrix::*, rather than MR::*. The exception is function parse_matrix(), which has instead been moved to file core/mrtrix.h.
Any MRtrix3 command that loads vector / matrix data from text files, or saves such data to text files, should now transparently handle binary NumPy .npy files.
This includes support for float16 data, and floating-point numerical data can be saved in float16 with setting of the corresponding MRtrix config file entry.
@Lestropie Lestropie added this to the 3.1.0 updates milestone Feb 25, 2022
@Lestropie Lestropie self-assigned this Feb 25, 2022
@jdtournier
Copy link
Member

The functions for loading & saving matrix data from / to the filesystem are here proposed to be moved from the ::MR namespace in file core/math/math.h to the ::MR::File::Matrix namespace in file core/file/matrix.h.

👍 Happy with that.

I wonder if we should have somewhere listed all packages on which the software depends, regardless of whether they are compile-time dependencies or imported directly into the repository.

Fully agree. I've had similar thoughts in the past, but not done anything about them... External code that we import wholesale includes the original NIfTI headers, glLoadGen to generate the src/gui/opengl/gl_core_3_3.* files (though the project seems defunct now - this is the only reference I can find to it on the web...), along with JSON for Modern C++. There may be others I've forgotten about...

It does however occur to me that Float16 could theoretically be just another type in MR::DataType, with get/set functions that use the new 16-bit floating-point library, and the MRtrix image format would inherit support for such.

Yes, that sounds like the most sensible way to do it!

but would open up a whole bunch of requisite testing to make sure it doesn't break something else (e.g. no idea how easy or difficult this data type might be to support in mrview without trying it).

I wouldn't worry about MRView - as long as the internal in-RAM representation is one it can deal with, it doesn't matter how it was stored on file. I'd be surprised if that broke anything...

The MRtrix image format now supports half-precision floating-point data. Some handling of the NumPy .npy array file format has been simplified to accommodate these deeper changes.
@Lestropie
Copy link
Member Author

16-bit floating-point images working, and appear properly in mrview:

$ mrconvert dti_FA.nii.gz -datatype float16le - | mrinfo -
mrconvert: [100%] uncompressing image "dti_FA.nii.gz"
mrconvert: [100%] copying from "dti_FA.nii.gz" to "./mrtrix-tmp-Oi6GFd.mif"
************************************************
Image name:          "./mrtrix-tmp-Oi6GFd.mif"
************************************************
  Dimensions:        104 x 104 x 72
  Voxel size:        2.01923 x 2.01923 x 2
  Data strides:      [ -1 2 3 ]
  Format:            Internal pipe
  Data type:         16 bit float (little endian)
  Intensity scaling: offset = 0, multiplier = 1
  Transform:               0.9953    -0.08673     0.04235      -90.83
                          0.09543      0.9501     -0.2969      -106.5
                         -0.01449      0.2996       0.954        -127
  command_history:   C:\msys64\home\rob\src\mrtrix3\bin\mrconvert.exe dti_FA.nii.gz -datatype float16le -  (version=3.0.3-346-g0dc2a396-dirty)
  comments:          FSL5.0
  mrtrix_version:    3.0.3-346-g0dc2a396-dirty

Conflicts:
	cmd/connectomestats.cpp
	core/image_io/fetch_store.cpp
	core/math/stats/shuffle.cpp
	src/dwi/tractography/SIFT2/tckfactor.cpp
- Prevent endianness warning on single-byte data types.
- Explicitly disable complex datatype support.
- Comment out residual debugging print statements.
- Move large functions to .cpp file by isolating those sections that were dependent on templates and only having those within the .h file.
- Explicitly handle bitwise data in the same way in which the NPY format supports it: as one byte per bit of data. This requires explicit special handling at both read and write time as it is incompatible with packed bit storage.
- Provide a ReadInfo intermediate struct during file read that may potentially be used to construct an Eigen::Map if the data type / dimension / order are known a priori.
@Lestropie Lestropie marked this pull request as ready for review July 15, 2023 14:03
@Lestropie
Copy link
Member Author

Worth noting on this PR, given its import of the half header-only library, that C++23 integrates a 16-bit floating-point format into the standard.

In #1509, a CI failure was encountered where on Mac ARM64 the shared library would not link, complaining about the absence of __set_fetch_function() for unsigned long datatype. It is believed that this is being caused by the presence of MR::Math::load_matrix() with size_t template type being used for loading pre-defined permutation matrices. To address this, the new typedef MR::Math::Stats::index_type is being introduced to refer to any item being indexed into an array or matrix. A 32-bit unsigned integer should be sufficient for all cases of such (the only possible exception is the calculation of the maximal number of shuffles, for which 64-bit unsigned integers are used, just for the sake of potentially calculating and reporting that number, even if it turns out to be much larger than the number of shuffles actually utilised).
Changing indexing variable from signed to unsigned integer in 9a6f0bb introduced problem because of decrementing loop attempting to detect a transition past zero.
Further continuation of typedef adoption in 9a6f0bb.
@Lestropie
Copy link
Member Author

Still having some outstanding issues with CI on this PR.

I had a couple of issues compiling on Mac that have been resolved. One of them I'm still a little confused by: it didn't like compiling File::NPY::load_matrix<size_t>() because it claimed __set_store_functions<unsigned long>() didn't exist, even though the functions for uint64_t should all have been explicitly instantiated for the sake of image IO. But I got around it by just not calling load_matrix<size_t>(), and instead typedef-ing an index_type as uint32_t for use all throughout the relevant code.

Where it's still stuck now is my new unit tests for NPY support, which require importing numpy in CI (I both generate NPYs in C++ and read them in Python, and generate NPYs in Python and read them in C++).

  • On Mac, it's added to the list of packages to install, and looks to be installed just fine, but then when the corresponding script is executed Python claims there's no such module. Maybe a Mac person can help me with that one.
  • On MSYS2, numpy refuses to install. By default, it's using the msys version of Python. There's no pacman package for an msys version of numpy. So I'm trying to install numpy through pip. But it fails, regardless of whether or not I upgrade pip beforehand. I have alternatively tried instead installing both Python and numpy through the mingw64 channel; but in that case Python scripts just don't execute at all.

@fionaEyoung
Copy link
Contributor

On Mac, it's added to the list of packages to install, and looks to be installed just fine, but then when the corresponding script is executed Python claims there's no such module. Maybe a Mac person can help me with that one.

the shebang in testing/bin/testing_check_npy:

#!/usr/bin/python3

⬆️ will be the macos-latest xcode python version (3.8 or 9 I think?) and will look for packages in somewhere like
/Library/Frameworks/Python*.framework/Versions/3.9/lib/python3.9/site-packages

but it looks like macos-latest also has homebrew python (3.11) already installed to /usr/local/bin/python3 and the associated pip3 will install to /usr/local/lib/python3.11/site-packages ( or possibly /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages I'm pretty lost about where / why homebrew puts things sometimes.)

this will be the python3 invoked by python3 -m ensurepip && pip3 install numpy

So either

/usr/bin/pip3 install numpy and #!/usr/bin/python3

or

pip3 install numpy and #!/usr/bin/env python3

should sort it out

@Lestropie
Copy link
Member Author

Thanks @fionaEyoung and @daljit46 for tips on the Python multiple version issue; Mac checks are now working. There, I've chosen to stick to the same shebang for these unit test files as the Python executable scripts (#!/usr/bin/python3).

Unfortunately MSYS2 has a similar issue, and that one may be harder to fix. /usr/bin/python3 corresponds to the MSYS2 version of Python, which unlike the MinGW64 version, not only does not have a numpy package that can be installed through pacman, but pip fails to install the module even with python-dev installed. Possible strategies:

  1. On the CI machine, uninstall the MSYS2 version of Python and softlink /usr/bin/python3 to /mingw64/bin/python3.
  2. Permit the NPY unit tests to fail if numpy is not installed.
  3. Find some other solution to get numpy working on the MSYS2 version of Python.

Regardless, we will need to reconsider the hardcoded shebang on dev (#2261, #2047).

@daljit46
Copy link
Member

daljit46 commented Aug 7, 2023

Note, that we are no longer relying on MINGW64, instead the CI checks now use the newer UCRT64 environment (which uses the ucrt C library instead of the old msvcrt).

I think symlinking could be the least disrupting option here. Also, even if we manage to successfully install the numpy package for MSYS2, it would noticeably increase the CI checks time as the package needs to be built from source (there is no pre-packaged numpy for MSYS).

I personally think we should change the shebang for the scripts as mentioned in #2261. Relying on hardcoded paths is not appropriate in my opinion, and I would rather use #!/usr/bin/env python3 to give the user the possibility to choose what python executable they wish to use. I see very little benefit in using #!/usr/bin/python.

@Lestropie
Copy link
Member Author

even if we manage to successfully install the numpy package for MSYS2, it would noticeably increase the CI checks time

If that's the case, that would IMO be an argument for bypassing those checks in the case where numpy is not installed. As long as the relevant unit tests get executed in at least one environment it should be fine; it's not something for which I expect the behaviour to differ between OS's.

I personally think we should change the shebang for the scripts as mentioned in #2261.

I think I'm leaning in the same direction, though I haven't re-familiarised myself with the debate that led to the current decision. Importantly "#!/usr/bin/python3" is only on dev and is not included in any tag. But we can flesh that out in #2261, and figure out there what the consequences might be in the context of these changes.

@Lestropie Lestropie merged commit 598b076 into dev Aug 8, 2023
5 checks passed
@Lestropie Lestropie deleted the npy branch August 8, 2023 05:01
@daljit46 daljit46 mentioned this pull request May 11, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

5 participants