Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Index.validate_dataarray_coord #10137

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Mar 17, 2025

  • Tests added
  • User visible changes (including notable bug fixes) are documented in whats-new.rst
  • New functions/methods are listed in api.rst

Same goal than #10116 using the alternative approach suggested in #10116 (comment) where the propagation (validation) of coordinates in a DataArray is delegated to their index (if any).

I find this approach cleaner than #10116. Here is the same notebook example adapted for this PR.

Index API alternatives

  1. Index.validate_dataarray_coord(self, name: Hashable, var: Variable, dims: set[Hashable]) -> None

    • check one index coordinate at a time, thus called multiple times for a multi-coordinate index
  2. Index.validate_dataarray_coords(self, variables: dict[Hashable, Variable], dims: set[Hashable]) -> None

    • check all coordinates of an index at once
    • perhaps could make things easier for developers of custom Xarray indexes (perhaps not?), but it would make Xarray internals more complicated with possible significant impact on performance (we'd have to rely on Indexes.group_by_index while in option 1 we simply need to iterate over _variables, _coords and/or _indexes).
  3. Index.validate_dataarray_coords(self, variables: dict[Hashable, Variable], dims: set[Hashable]) -> Coordinates

    • Same than option 2 but allow returning a new set of coordinate variables and their indexes.
    • This has much more flexibility (e.g., convert a custom 2D spatial index to a 1D default PandasIndex when the result has only one of the two spatial dimensions), but I find it also complicated and prone to confusion.

Option 1, the simplest and working well with IntervalIndex, is currently implemented in this PR.

cc @shoyer @dcherian

@benbovy benbovy marked this pull request as ready for review March 17, 2025 12:19
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick thoughts

k, self._variables[k], needed_dims
)
coords[k] = self._variables[k]
except ValueError:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ValueError can be raised by many different things going wrong in Python. If we're going to check specifically for an error, we should only catch a specific Exception subclass.

Copy link
Member Author

@benbovy benbovy Mar 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added CoordinateValidationError in 5da014e.

For now it is defined in xarray/core/indexes.py to avoid a cyclic import. That's a bit messy but we could maybe move the definition of all Xarray custom error types in a dedicated xarray.exception module (in a later PR?). EDIT: changed my mind in 426ddce.

benbovy added 4 commits March 18, 2025 12:09
Functions with a leading underscore are marked by pyright as unused if
they are not used from within the module in which they are defined.

Also remove unneeded nested import.
Move check_dataarray_coords in xarray.core.coordinates module and rename
it to validate_dataarray_coords (name consistent with
Index.validate_dataarray_coord).

Move CoordinateValidationError from xarray.core.indexes to
xarray.core.coordinates module.
@benbovy
Copy link
Member Author

benbovy commented Mar 18, 2025

Thanks @shoyer for taking a look!

This is now ready for (another round of) review.

Compared to #10116 this PR still represents a major data model change for DataArray (in the sense that it allows overriding strict enforcement of the DataArray model in certain cases), but I think that the risk is mitigated since it is here done explicitly via Index.validate_dataarray_coord and the default behavior is to keep conforming to the DataArray model. Limiting an index to (in)validate its coordinates may also be good to keep things simple and predictable for end-users.

@dcherian regarding your #10116 (comment), in the example notebook the DataArray reprs shown without any list of dimensions for the Coordinates section do not strike me much. That said, the Coordinates repr is still basic and might benefit from such addition (it doesn't have any html repr BTW).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants