Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add Zarr driver #3411

Closed
wants to merge 2 commits into from
Closed

Conversation

davidbrochart
Copy link

What does this PR do?

This is an effort to add a Zarr driver, based on xtensor-zarr.

What are related issues/pull requests?

Tasklist

  • ADD YOUR TASKS HERE
  • Add test case(s)
  • Add documentation
  • Review
  • Adjust for comments
  • All CI builds and checks have passed

Environment

Provide environment details, if relevant:

  • OS:
  • Compiler:

@@ -1630,7 +1664,7 @@ OGRFORMATS_ENABLED=
OGRFORMATS_ENABLED_CFLAGS=
OGRFORMATS_DISABLED=

AC_DEFUN([INTERNAL_FORMATS],[aaigrid adrg aigrid airsar arg blx bmp bsb cals ceos ceos2 coasp cosar ctg dimap dted e00grid elas envisat ers esric fit gff gsg gxf hf2 idrisi ignfheightasciigrid ilwis ingr iris iso8211 jaxapalsar jdem kmlsuperoverlay l1b leveller map mrf msgn ngsgeoid nitf northwood pds prf r raw rmf rs2 safe saga sdts sentinel2 sgi sigdem srtmhgt stacta terragen tga til tsx usgsdem xpm xyz zmap])
AC_DEFUN([INTERNAL_FORMATS],[aaigrid adrg aigrid airsar arg blx bmp bsb cals ceos ceos2 coasp cosar ctg dimap dted e00grid elas envisat ers esric fit gff gsg gxf hf2 idrisi ignfheightasciigrid ilwis ingr iris iso8211 jaxapalsar jdem kmlsuperoverlay l1b leveller map mrf msgn ngsgeoid nitf northwood pds prf r raw rmf rs2 safe saga sdts sentinel2 sgi sigdem srtmhgt stacta terragen tga til tsx usgsdem xpm xyz zarr zmap])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarr is not an internal driver as it requires a direct dependency. you should follow the same logic as in e.g. hdf5, kakadu, openjpeg...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tbonfort, this should be fixed now.


.. shortname:: Zarr

.. versionadded:: 3.2.2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be 3.3.0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -51,7 +51,9 @@ CXXFLAGS="-std=c++17 -O1 $ARCH_FLAGS" CFLAGS="-O1 $ARCH_FLAGS" ./configure --pre
--with-kea=/usr/bin/kea-config \
--with-tiledb \
--with-crypto \
--with-ecw=/opt/libecwj2-3.3
--with-ecw=/opt/libecwj2-3.3 \
--with-zarr \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be nice if --with-zarr could take a non default install location

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean, we can pass any directory to --with-zarr already? Or do you mean in the CI specifically?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed that part. I did mean that the non-standard path can be selected by the user outside of the CI.
However for checking for the lib/headers in default location you should use something similar to https://github.com/OSGeo/gdal/blob/64db832b3e475bb70d5acae99820c5ee405c9838/gdal/configure.ac#L2397 instead of looking up a hard-coded path. Namely -I/usr/include should never be added manually to the CFLAGS

--with-ecw=/opt/libecwj2-3.3
--with-ecw=/opt/libecwj2-3.3 \
--with-zarr \
--enable-driver-zarr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should not be necessary with a correct configure script, --with-zarr should be sufficient to activate the driver

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


#include <algorithm>

CPL_CVSID("$Id$")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I'm not sure why we keep this. Recently added STACTA driver also contains such line, likely, by copy & paste carry-over. Do we still need such cruft @rouault ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need such cruft @rouault ?

They are substitued by the mkgdaldist.sh script. This is probably only useful to spy and know which version of GDAL some binary build embeds.

Comment on lines 49 to 51
template <typename T>
void assign_chunk(void* pImage, xt::zarray& z, int nBlockYSize, int nBlockXSize,
int nBlockYOff, int nBlockXOff,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For general hygiene, such utilities in .cpp file should live in unnamed namespace:

namespace
{
template <typename T>
void assign_chunk(....
}


public:
ZarrRasterBand( ZarrDataset *, int, xt::xzarr_hierarchy<T>& );
virtual ~ZarrRasterBand();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

virtual is redundant, if anything it should have ~ZarrRasterBand() override


public:
ZarrDataset();
~ZarrDataset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could read ~ZarrDataset() override

ZarrRasterBand( ZarrDataset *, int, xt::xzarr_hierarchy<T>& );
virtual ~ZarrRasterBand();

virtual CPLErr IReadBlock( int, int, void * ) override;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

virtual is redundant

ZarrDataset::ZarrDataset() :
fp(nullptr)
{
std::fill_n(abyHeader, CPL_ARRAYSIZE(abyHeader), static_cast<GByte>(0));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::memset may be better optimised

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@davidbrochart
Copy link
Author

Thanks a lot @mloskot and @tbonfort for the review!
There are a lot of things I was not sure about, this is just how I managed to get it working locally. I'll take your comments into account.

//xt::xzarr_register_compressor<T, xt::xio_blosc_config>(); // TODO

// Create a corresponding GDALDataset.
ZarrDataset *poDS = new ZarrDataset();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can any of the following zarr_* calls throw? Then this will leak.
Wrapping it with std::unique_ptr and .release() ing it at the return may be a bit safer way.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they can throw. I wrapped the pointer with std::unique_ptr, thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they can throw.

If xtensor-zarr can throw exceptions, they must be caught by any of the methods of the GDAL driver API (virtual methods, and methods like Open()), since GDAL core doesn't expect exceptions to propagate to it.

@davidbrochart davidbrochart force-pushed the zarr_driver branch 4 times, most recently from c6b72c6 to 0774b5b Compare January 26, 2021 16:36
@davidbrochart
Copy link
Author

Not sure why the gdrivers/zarr.py test is skipped.

@mloskot
Copy link
Member

mloskot commented Jan 26, 2021

@davidbrochart

This check says:

gdrivers/zarr.py::test_zarr_open SKIPPED (Zarr driver not found!)        [ 86%]

@davidbrochart
Copy link
Author

@mloskot
Copy link
Member

mloskot commented Jan 26, 2021

That log says

xtensor-zarr found - Zarr support enabled

but there is no trace of building anything in gdal/frmts/zarr/, no compiler command line with -DFRMT_zarr, etc.

I suspect configure.ac is giving false positive (due to a bug) and the driver is not enabled, hence the test is skipped.

@nyalldawson
Copy link
Collaborator

Honest question: if the current trend in gdal is toward minimising the code base and removing drivers, doesn't that include limiting inclusion of totally new drivers like this?

@davidbrochart davidbrochart force-pushed the zarr_driver branch 2 times, most recently from 908226b to 6dbf40f Compare January 27, 2021 10:35
@mloskot
Copy link
Member

mloskot commented Jan 27, 2021

@nyalldawson Feeling it's gonna happen, that is precisely why I long-delayed merge of my JPEG XR (#203) ;-) Seriously though, I think such question belongs to gdal-dev ML

@davidbrochart
Copy link
Author

@mloskot Do you think it's still worth working on this PR then?

@mloskot
Copy link
Member

mloskot commented Jan 27, 2021

@davidbrochart Honest answer: ¯\_(ツ)_/¯. Helpful answer: ask on the mailing list, that's the right place for questions, doubts and discussions.

@davidbrochart davidbrochart force-pushed the zarr_driver branch 4 times, most recently from 33971fe to e9cdde2 Compare January 27, 2021 17:26
@davidbrochart
Copy link
Author

In the GDAL mailing-list Tamas mentioned the possibility of drivers being implemented as plugins. I was not aware of that, could you point me to examples of such drivers?

@szekerest
Copy link
Contributor

In the GDAL mailing-list Tamas mentioned the possibility of drivers being implemented as plugins. I was not aware of that, could you point me to examples of such drivers?

The pdf driver would be a good example I think: https://github.com/OSGeo/gdal/tree/master/gdal/frmts/pdf

@sgillies
Copy link
Contributor

sgillies commented Feb 1, 2021

@davidbrochart @hobu I'd like to see discussion on gdal-dev before this is merged. In my experience zarr is not so much a format as a library for storing and accessing arrays. What's the scope of this work? How many of zarr's compressors and storage schemes will be supported here? What about xtensor-zarr's warning that it is not ready for general use? Questions like these need to be explored yet.

@davidbrochart
Copy link
Author

davidbrochart commented Feb 1, 2021

@davidbrochart @hobu I'd like to see discussion on gdal-dev before this is merged.

@sgillies I agree that more discussion is needed, in any case this PR is not ready to be merged yet. There's still work to be done in the xtensor stack for xtensor-zarr to be fully usable. I started this work as a potential application of xtensor-zarr based on a discussion in the Zarr community.

In my experience zarr is not so much a format as a library for storing and accessing arrays.

While I agree that Zarr is mostly known for the Python library, there exists implementations in other languages (Zarr.js in JavaScript and now xtensor-zarr in C++) and they all implement the Zarr protocol. Version 3 of the specification is being defined here.

How many of zarr's compressors and storage schemes will be supported here?

At the moment, xtensor-zarr only supports (uncompressed) raw binary, GZip and Blosc, but we plan to support more compressors. As for the storage schemes, xtensor-zarr supports the local file system, AWS S3, Google Cloud Storage, plus all the ones supported by GDAL's Virtual File System.

I understand that GDAL will require RFCs for new format drivers, and I would be happy to submit one for Zarr.
@rabernat and @jhamman will surely have valuable inputs here too.

@sgillies
Copy link
Contributor

sgillies commented Feb 1, 2021

Email sent to gdal-dev: https://lists.osgeo.org/pipermail/gdal-dev/2021-February/053384.html.

@rabernat
Copy link

rabernat commented Feb 1, 2021

Just dropping in on this conversation from perspective of a zarr developer.

In my experience zarr is not so much a format as a library for storing and accessing arrays.

Zarr is a specification, with many implementations in different languages. The python implementation is the most complete. The V2 spec is currently under review with OGC on the "community standard" track. The V3 spec is under active development and is open for public comment.

My understanding is that the xtensor-zarr implementation will support both versions of the spec.

@davidbrochart
Copy link
Author

My understanding is that the xtensor-zarr implementation will support both versions of the spec.

That is correct, in addition to supporting Zarr v3, xtensor-zarr supports a part of Zarr v2. Actually this PR specifically targets version 2.

@davidbrochart davidbrochart force-pushed the zarr_driver branch 8 times, most recently from 4c5d573 to 2f397b4 Compare February 2, 2021 08:11
gdal/frmts/zarr/zarrdataset.cpp Outdated Show resolved Hide resolved
gdal/frmts/zarr/zarrdataset.cpp Outdated Show resolved Hide resolved
gdal/frmts/zarr/zarrdataset.cpp Outdated Show resolved Hide resolved
@rouault
Copy link
Member

rouault commented Feb 26, 2021

I've given a try at building the driver. My initial attempt on Linux lead to my computer freezing to death due to lack of available RAM.

After rebooting with more RAM available, here are my observations:

compiler flags max RAM usage zarrdataset.o size stripped size compilation time
gcc 8.2 -std=c++14 -O0 -g 4.2 GB 81 438 200 14 623 880 1m35s
gcc 8.2 -std=c++14 -O2 -g 5.0 GB 95 375 528 3 415 376 2m29s
gcc 9.3 -std=c++17 -O0 -g 3.8 GB 104 340 912 14 996 936 1m20s
gcc 9.3 -std=c++17 -O1 3.3 GB 7 626 920 3 666 360 1m21s
gcc 9.3 -std=c++17 -O2 3.6 GB 7 562 584 3 691 480 1m40s
gcc 9.3 -std=c++17 -O2 -g 4.7 GB 186 499 472 3 691 480 3m19s

So there's a very significant RAM usage required (around 5 GB for a -O2 -g build which is typical of how a package is built on a Linux distro). I'm afraid this could be a blocker on restricted build environments (let's say I want to build GDAL on some small cloud VM with just 2 GB RAM). Could that be significantly reduced ? Given that in a GDAL driver context, we just need to access the values in Zarr chunks without any fancy processing, I'm wondering if drawing xtensor and other dependencies isn't overkill. My initial gut feeling for a Zarr driver was to do it "at hand" without any extra dependencies (apart maybe from a few compression methods) than the one already used by GDAL
The disk usage is less a concern, although a 3.7 MB stripped binary size is quite significant for a single driver, compared to the 22 MB of the whole libgdal.so AMD64 Debian build (https://packages.debian.org/sid/libgdal27) with its > 200 drivers.

@SylvainCorlay
Copy link

Hey @rouault, thanks for checking this out! Yeah it is a bit silly that we are not providing zarray in a precompiled form, which should resolve the RAM consumption issue, and mitigate the final binary size.

I will let @JohanMabille comment on this in greater detail.

@JohanMabille
Copy link

The different features provided by zarray (and more generally by the whole xtensor stack) are well separated, making it easy to precompile only a subset of them. Therefore, we can definitely build a minimalist precompiled version of zarray with the operations you need only. This will avoid instantiating all the templates when building gdal and will limit the amount of RAM required. Especially if you don't need all the arithmetic and math operations (which are the most intensive to compile).

@martindurant
Copy link

The discussion corteva/rioxarray#246 seems pertinent here: if GDAL can handle fsspec (python) files, then zarr-in-gdal would be more useful - when called from python.

@rouault
Copy link
Member

rouault commented May 3, 2021

Hi, I just wanted to mention that I'll have soon to work on a GDAL Zarr driver. However, my current position would be to completely go from scratch, and not rely on xtensor-zarr.

There are several reasons for that:

  • I'm not sure how much the RAM usage compilation issue has been solved by the move to a shared library. Hasn't it moved the issue at the time that shared library is built ?
  • I'm not sure about the binary footprint, but the >3.7 MB I found felt too large. I'd expect a driver written at hand to just take at most a few hundred of kilobytes of binary footprint.
  • If the driver now depends on an external shared library, that means it must be packaged in the distribution systems used by GDAL. The only new external dependency from a driver written from scrach would be the compressors, and among them c-blosc, which I see it is available in Debian
  • As far as I can see, the dtype of a zarr array must be registered at build time in https://github.com/xtensor-stack/xtensor-zarr/blob/master/include/xtensor-zarr/xzarr_chunked_array.hpp#L76 . I don't see how this could handle custom compound types that must be parsed at runtime. The GDAL multidimensional API that I plan to implement for this driver can use compound types (similarly to netCDF 4 / HDF5)
  • the value of using the xtensor stack would probably be better as a frontend to GDAL, that is exposing any GDAL multidimensional array or any 2D traditionnal raster dataset as a zarray, not just Zarr. The value of zarray burried in the implementation of a driver appears to be very low, as a GDAL driver only exposes the raster/array values, and doesn't do any computation by itself.

@davidbrochart
Copy link
Author

Hi Even, all the concerns you raised are fair. We need to make a better job at making xtensor-zarr more lightweight when no computational capability is needed, which is the use-case of this driver. We have some ideas about reducing compilation time and resources. About custom data types, it is true that it is not possible today, although definitely feasible.
Thanks for letting me know about your choice of going from scratch, I'll close this PR.

@rouault rouault mentioned this pull request May 28, 2021
27 tasks
@rouault
Copy link
Member

rouault commented May 28, 2021

Another approach at creating a Zarr driver in #3896

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet