Skip to content

Conversation

@d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Jul 29, 2025

initial working pydantic models for geozarr

@d-v-b
Copy link
Contributor Author

d-v-b commented Aug 18, 2025

@emmanuelmathot this is ready for review. The models right now are pretty narrow -- I define a DataArray class for Zarr v2 and Zarr v3. The DataArray class models a Zarr array with explicit dimension names, which for zarr v2 are expressed via the _ARRAY_DIMENSIONS key in attributes, or via the dimension_names array attribute in zarr v3.

A Dataset class models a Zarr group that contains DataArrays. The Dataset ensures that all the DataArrays it contains have coordinates that are valid (meaning, for every named dimension defined for an array, there is a separate Zarr array with that name, and with the correct shape).

The Dataset class is also where the multiscales metadata is defined, although getting this to match the geozarr spec is kind of tricky, since that part of the spec isn't fine-grained enough for completely automated validation. I do have classes for one specific type of multiscale declaration that we are currently using in this repo.

If this looks good, maybe we merge this and then I can start working on integrating these classes with the rest of the codebase?

@d-v-b d-v-b requested a review from emmanuelmathot August 18, 2025 11:23
@d-v-b d-v-b marked this pull request as ready for review August 18, 2025 11:37
Copy link
Contributor

@emmanuelmathot emmanuelmathot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the initial pydantic model @d-v-b ! I probably have more questions as it is fully clear how this work to me. Let's have a review meeting.

)


CFStandardName = Annotated[str, AfterValidator(check_standard_name)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we miss the grid_mapping attribute verification defaulted to spatial_ref scalar with the EPSG code. https://zarr.dev/geozarr-spec/documents/standard/template/geozarr-spec.html#_e15d59bd-f2ec-28e8-8016-4e541c95e10f

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add these, and if they are required we should make that more clear in the spec. right now the spec says

CF Conventions – Including attributes such as standard_name, units, axis, and grid_mapping to express spatiotemporal semantics and coordinate system properties.

but it isn't clear which CF attributes are required, optional, etc

CF_STANDARD_NAMES = get_cf_standard_names(url=CF_STANDARD_NAME_URL)


def check_standard_name(name: str) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bear with my limited knowledge of pydantic but how is made the link with the actual standard_name field name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pydantic does most of its validation routines based on type annotations. When we annotate an attribute on a pydantic model with this type: https://github.com/d-v-b/data-model/blob/3d11af412e460993f8e603dcff0555c5342c4e8f/src/eopf_geozarr/data_api/geozarr/common.py#L70, then pydantic will run the check_standard_name function after checking that the input is a string.

return model


class Dataset(GroupSpec[DatasetAttrs, GroupSpec[Any, Any] | DataArray]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we call this GeoZarrDataset?
Then, how does pydantic-zarr discriminate the "normal" groups from a geoZarr Dataset group?

Copy link
Contributor Author

@d-v-b d-v-b Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • we could definitely call it a GeoZarrDataset

Then, how does pydantic-zarr discriminate the "normal" groups from a geoZarr Dataset group?

see https://docs.pydantic.dev/latest/concepts/unions/. Basically there are 3 options for resolving a union: order-based, best-match-based, and discriminant-based.

Ideally a geozarr dataset group would have some specific requirements or traits that distinguish it from a regular zarr group, or a zarr group that happens to have cf metadata, but I don't think the spec as written expresses these requirements, so this might require some work

@emmanuelmathot
Copy link
Contributor

@d-v-b What is the status of this PR?

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 2, 2025

still in progress, I was previously blocked on the semantics of the grid_mapping but that's been sorted out in the short term. This should be finished by the of the week.

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 5, 2025

@emmanuelmathot I am getting an interesting test failure that would be helpful for you to check. Within a dataset, for each data variable that declares coordinates (a, b, c) via {"attributes": {"coordinates": "a b c"}}, a, b, and c must all be the names of coordinate variable arrays in the dataset. But in my test data (which I generated via the conversion script and is included as a test fixture in this PR), I'm finding some data variables (such as "quality/l1c_quicklook/r10m/5/tci") that declare a coordinate "band" that's not present as a coordinate variable. Is this consistent with the data model?

@emmanuelmathot
Copy link
Contributor

probably bugs from the EOPF CPM as we simply copy this dataset. I will give a look and report if any. I'll trace that here.

@emmanuelmathot
Copy link
Contributor

@d-v-b As a more general question. Is this {"attributes": {"coordinates": "a b c"}} declaration a convention or a specifications of any sort? If no, why is it there, Shouldn't this be actually declared in a proper way in addition of fixing the coordinates existence check?

@emmanuelmathot
Copy link
Contributor

@d-v-b
Copy link
Contributor Author

d-v-b commented Sep 17, 2025

I added a new types module that just contains types to replace the use of Dict[str, Any] that occur throughout geozarr.py, and updated the type annotations accordingly. This might be out of scope, in which case I'm happy to spin this out into a new PR.

@emmanuelmathot emmanuelmathot self-requested a review September 25, 2025 12:55
@emmanuelmathot emmanuelmathot merged commit d752349 into EOPF-Explorer:main Sep 25, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants