Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr driver #3896

Merged
merged 67 commits into from
Jul 19, 2021
Merged

Zarr driver #3896

merged 67 commits into from
Jul 19, 2021

Conversation

rouault
Copy link
Member

@rouault rouault commented May 28, 2021

Another approach at creating a Zarr driver than the one of PR #3411 (see analysis done in #3411 (comment))

Scope is:

  • read/creation/update of Zarr datasets
  • using nominally the multidimensional API
  • and also exposed with the classic 2D API (slicing needed for > 2D datasets)
  • support hierarchical organization in groups
  • support attributes, including a dedicated attribute to encode the CRS, and the XArray _ARRAY_DIMENSIONS attribute to encode the netCDF-like dimension concept
  • support consolidated metadata
  • support reading/writing datasets on cloud storage, .zip (but that comes for free using the VSI I/O layer)
  • support Zarr V2 and in-progress V3 specifications
  • support compression methods: zlib, gzip, zstd, lzma, lz4 (new optional build dependency), blosc (new optional build dependency), with an API allowing to register additional compressors
  • support most data types: bool, [u]int[8/16/32/64], float[32/64], complexfloat[32/64], strings, compound data types (limited to what GDALExtendedDataType supports)
  • limited support for filters with delta filter only as builtin, with an API allowing to register additional filters

Tasklist

  • Read support of arrays with multidimensional API

  • Read support for compression methods, using an extendable API

  • Read support for Zarr V2 arrays

  • Read support for compound data types

  • Read support for non-GDAL-native types (such as int1, int8, uint8, half-float)

  • Read support for Fortran block ordering

  • Read support for CRS

  • Read support for groups

  • Read support for consolidated metadata

  • Read support for XArray _ARRAY_DIMENSIONS

  • Read support for Zarr V3

  • Read support for filters (limited to delta as builtin)

  • Read support for group attributes

  • Read support for array attributes

  • Read support with classic 2D API

  • Creation support of ZarrV2 arrays with multidimensional API

  • Write support for compression methods

  • Write support for filters (limited to delta as builtin)

  • Write support for group creation

  • Write support for group attributes

  • Write support for array attributes

  • Write support for CRS

  • Write support for consolidated metadata

  • Write support for XArray _ARRAY_DIMENSIONS

  • Write support for Zarr V3

  • Write support with classic 2D API

  • Documentation page

@rouault rouault added this to the 3.4.0 milestone May 28, 2021
@rouault rouault marked this pull request as draft May 28, 2021 13:57
@rouault rouault mentioned this pull request May 28, 2021
6 tasks
@rouault rouault force-pushed the zarr branch 2 times, most recently from 62a8e59 to 9f325dd Compare June 18, 2021 12:55
@rouault rouault changed the title [WIP] Zarr driver Zarr driver Jul 1, 2021
@rouault rouault marked this pull request as ready for review July 1, 2021 16:22
@rouault
Copy link
Member Author

rouault commented Jul 1, 2021

This is now feature ready.

Zarr driver RFC

Who asked for the driver and why?

Zarr is an emerging format, natively capable for multidimensional raster and cloud friendly. A number of parties have expressed a desire to see it handled in GDAL.
The Zarr V2 format has just been submitted as a potential OGC community standard: https://www.ogc.org/pressroom/pressreleases/3275

This work takes a different approach than the one proposed in #3411 for the reasons mentioned in #3411 (comment)

It should also be noted that libnetcdf 4.8.0 has initial support for ZarrV2, but based on our testing, it lacks maturity, and having a dedicated driver in GDAL gives more control, particularly regarding the I/O layer where any VSI virtual file system can be used, or the possibility to easily add CRS support as we have done.

Read / Write / ReadWrite ?

Read, creation and update

Raster / Vector / Raster&Vector?

Raster, implementing classic 2D and multidimensional API

Driver name info

  • Short name : Zarr

  • Long name : Zarr

  • Description: Zarr

Should this be compiled in by default, optional build, or plugin?

Built-in by default, can be disabled in autoconf builds using standard mechanism (--disable-driver-zarr)

Does this driver require other drivers be built?

No, but it strongly benefits from building GDAL against liblz4, liblzma, libzstd and libblosc to benefit from all compression methods

Is there a standard(s) that the driver is working to implement?

There's a specification for ZarrV2 at https://zarr.readthedocs.io/en/stable/spec/v2.html, which is the one mentioned above as being submitted to OGC.
The driver also implements the in-progress Zarr V3 spec: https://zarr-specs.readthedocs.io/en/core-protocol-v3.0-dev/protocol/core/v3.0.html

How feature complete will the driver be?

It implements most of the Zarr V2 specification, except:

  • datetime64 (“M”) and timedelta64 (“m”) data types aren't supported (we could likely implement them as double if needed)
  • compound data types that use members of type array, since our multidimensional array abstraction doesn't support that
  • a few compression methods of NumCodecs aren't supported currently: ZFPY, BZ2. It should also be noted that this PR offers a C API to register compression methods that aren't implemented by GDAL, which can provide extensibility for other methods in needed.
  • only the Delta filter is supported. Similarly for compressors, this PR offers a C API to register filter methods that aren't implemented by GDAL, which can provide extensibility for other methods in needed.

Driver capabilities.

Supports: Raster
Supports: Multidimensional raster
Supports: Subdatasets
Supports: Open() - Open existing dataset.
Supports: Create() - Create writable dataset.
Supports: CreateMultiDimensional() - Create multidimensional dataset.
Supports: Virtual IO - eg. /vsimem/
Creation Datatypes: Byte Int16 UInt16 Int32 UInt32 Float32 Float64 CFloat32 CFloat64

Support for the following features :

  • DMD_MIMETYPE:

None

  • DMD_EXTENSIONS:

None

  • DMD_CONNECTION_PREFIX

None

  • GDAL_DMD_OPENOPTIONLIST
<OpenOptionList>
  <Option name="USE_ZMETADATA" type="boolean" description="Whether to use consolidated metadata from .zmetadata" default="YES" />
</OpenOptionList>

  • GDAL_DMD_SUBDATASETS: Yes

  • GDAL_DCAP_OPEN : Yes

  • DCAP_CREATE : Yes

  • GDAL_DCAP_VIRTUALIO: Yes

What external libraries does this driver depend on?

  • Zlib (build requirement of GDAL)
  • libdeflate optionaly (for faster Zlib compression/decompression)
  • liblzma optionaly
  • liblz4 optionaly
  • libblosc-c optionaly. Strongly recommended however, as this is the default compression method used by the Python Zarr implementation when creating a Zarr array
  • libzstd optionaly

Are there version requirements for the libraries?

The versions provided by Ubuntu 20.04 are sufficient to get the driver building and working.

What licenses do the libraries have?

Does the driver require a binary SDK?

No

Any external services required for the driver?

No

Any external service providers required for the driver?

No

Testing plan

This driver has an autotest script in autotest/gdrivers folder

Maintenance plan?

Mostly have a look on where Zarr V3 goes on. Current status in documentation for it is indicated as experimental

Who / what org is responsible for upkeep on this new driver?

Even Rouault

Where to report bugs?

GDAL gitub issue tracker

Draft help text

see gdal/doc/source/drivers/raster/zarr.rst in PR for details

What are the conditions for removing the driver?

If nobody longer interested in keeping it in a buildable and working state

Does the driver already have a draft implementation?

Yes

Who implemented it? Or who will implement this driver? Does an implementer(s) need to be found?

Even Rouault

Is there funding needed?

No, funding of the work has already been provided

Who will review the code / PR?

TBD

@rouault
Copy link
Member Author

rouault commented Jul 5, 2021

Is anyone interested in giving this some review ? I know this is a bit overwhelming. I could offer some funding if that might help and being practical to setup.

@tbonfort
Copy link
Member

tbonfort commented Jul 5, 2021

Is anyone interested in giving this some review ? I know this is a bit overwhelming. I could offer some funding if that might help and being practical to setup.

cc @thomascoquet ?

@sgillies
Copy link
Contributor

sgillies commented Jul 5, 2021

@rouault I've been using zarr (the python package) at work and have some data for testing. I'll try it out later today. I won't be able to provide a serious PR review after all.

@rouault
Copy link
Member Author

rouault commented Jul 13, 2021

I'll merge this PR next week if there is no objection or further comments

@rouault rouault merged commit e85de9a into OSGeo:master Jul 19, 2021
Comment on lines +1697 to +1726
AC_ARG_WITH(blosc,[ --with-blosc[=ARG] Include blosc support (ARG=yes/no/installation_prefix)],,)

if test "$with_blosc" = "" -o "$with_blosc" = "yes" ; then
AC_CHECK_LIB(blosc,blosc_cbuffer_validate,HAVE_BLOSC=yes,HAVE_BLOSC=no,)

if test "$HAVE_BLOSC" = "yes" ; then
LIBS="-lblosc $LIBS"
else
if test "$with_blosc" = "yes" ; then
AC_MSG_ERROR([libblosc not found])
else
echo "libblosc not found - BLOSC support disabled"
fi
fi
elif test "$with_blosc" != "" -a "$with_blosc" != "no"; then

AC_CHECK_LIB(blosc,blosc_cbuffer_validate,HAVE_BLOSC=yes,HAVE_BLOSC=no,-L$with_blosc/lib)

if test "$HAVE_BLOSC" = "yes" -a -f "$with_blosc/include/blosc.h" ; then
LIBS="-L$with_blosc/lib -lblosc $LIBS"
EXTRA_INCLUDES="-I$with_blosc/include $EXTRA_INCLUDES"
else
AC_MSG_ERROR([libblosc not found])
fi

else
HAVE_BLOSC=no
fi

AC_SUBST(HAVE_BLOSC,$HAVE_BLOSC)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too fragile, c-blosc may depend on lz4, snappy, zlib and zstd. A naive AC_CHECK_LIB may break due to missing dependencies in LIBS. c-blosc provides a pkg-config file: blosc.pc

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too fragile, c-blosc may depend on lz4, snappy, zlib and zstd. A naive AC_CHECK_LIB may break due to missing dependencies in LIBS. c-blosc provides a pkg-config file: blosc.pc

too fragile for static builds of c-blosc ? should work fine hopefully for dynamic builds of c-blosc. Anyway autoconf builds will be soon deprecated, so any improvements if any needed should go to the CMake builds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too fragile for static builds of c-blosc ?

yes (in conan-center, by default all libs are built against static dependencies so we quickly see these kind of issues).

Anyway autoconf builds will be soon deprecated, so any improvements if any needed should go to the CMake builds

Great ! I'll check that while packaging the first release using CMakeLists.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great ! I'll check that while packaging the first release using CMakeLists.

early testing of CMake builds of master would be appreciated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants