Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

weigths are always realized by the iris.analysis module #5338

Closed
bouweandela opened this issue Jun 6, 2023 · 2 comments 路 Fixed by #5341
Closed

weigths are always realized by the iris.analysis module #5338

bouweandela opened this issue Jun 6, 2023 · 2 comments 路 Fixed by #5341

Comments

@bouweandela
Copy link
Member

馃悰 Bug Report

How To Reproduce

Steps to reproduce the behaviour:

Recent versions of iris realize the weights arrays. It looks like the issue was introduced in #5084, so iris versions since 3.5 are affected.

Example:

Use cube.collapsed(aggregator=iris.analysis.MEAN, weights=weights, coords=['latitude', 'longitude']) where cube is an iris.cube.Cube and weights is a dask.array.Array. This will issue a warning like

/home/bandela/mambaforge/envs/test-iris-3.6/lib/python3.11/site-packages/distributed/client.py:3109: UserWarning: Sending large graph of size 386.72 MiB.
This may cause some slowdown.
Consider scattering data ahead of time and using futures.
  warnings.warn(

because the weights are realized in this code:

class _Weights(np.ndarray):
"""Class for handling weights for weighted aggregation.
This subclasses :class:`numpy.ndarray`; thus, all methods and properties of
:class:`numpy.ndarray` (e.g., `shape`, `ndim`, `view()`, etc.) are
available.
Details on subclassing :class:`numpy.ndarray` are given here:
https://numpy.org/doc/stable/user/basics.subclassing.html
"""
def __new__(cls, weights, cube, units=None):
"""Create class instance.
Args:
* weights (Cube, string, _DimensionalMetadata, array-like):
If given as a :class:`iris.cube.Cube`, use its data and units. If
given as a :obj:`str` or :class:`iris.coords._DimensionalMetadata`,
assume this is (the name of) a
:class:`iris.coords._DimensionalMetadata` object of the cube (i.e.,
one of :meth:`iris.cube.Cube.coords`,
:meth:`iris.cube.Cube.cell_measures`, or
:meth:`iris.cube.Cube.ancillary_variables`). If given as an
array-like object, use this directly and assume units of `1`. If
`units` is given, ignore all units derived above and use the ones
given by `units`.
* cube (Cube):
Input cube for aggregation. If weights is given as :obj:`str` or
:class:`iris.coords._DimensionalMetadata`, try to extract the
:class:`iris.coords._DimensionalMetadata` object and corresponding
dimensional mappings from this cube. Otherwise, this argument is
ignored.
* units (string, Unit):
If ``None``, use units derived from `weights`. Otherwise, overwrite
the units derived from `weights` and use `units`.
"""
# `weights` is a cube
# Note: to avoid circular imports of Cube we use duck typing using the
# "hasattr" syntax here
# --> Extract data and units from cube
if hasattr(weights, "add_aux_coord"):
obj = np.asarray(weights.data).view(cls)
obj.units = weights.units
# `weights`` is a string or _DimensionalMetadata object
# --> Extract _DimensionalMetadata object from cube, broadcast it to
# correct shape using the corresponding dimensional mapping, and use
# its data and units
elif isinstance(weights, (str, _DimensionalMetadata)):
dim_metadata = cube._dimensional_metadata(weights)
arr = dim_metadata._values
if dim_metadata.shape != cube.shape:
arr = iris.util.broadcast_to_shape(
arr,
cube.shape,
dim_metadata.cube_dims(cube),
)
obj = np.asarray(arr).view(cls)
obj.units = dim_metadata.units
# Remaining types (e.g., np.ndarray): try to convert to ndarray.
else:
obj = np.asarray(weights).view(cls)
obj.units = Unit("1")
# Overwrite units from units argument if necessary
if units is not None:
obj.units = units
return obj
def __array_finalize__(self, obj):
"""See https://numpy.org/doc/stable/user/basics.subclassing.html.
Note
----
`obj` cannot be `None` here since ``_Weights.__new__`` does not call
``super().__new__`` explicitly.
"""
self.units = getattr(obj, "units", Unit("1"))
@classmethod
def update_kwargs(cls, kwargs, cube):
"""Update ``weights`` keyword argument in-place.
Args:
* kwargs (dict):
Keyword arguments that will be updated in-place if a `weights`
keyword is present which is not ``None``.
* cube (Cube):
Input cube for aggregation. If weights is given as :obj:`str`, try
to extract a cell measure with the corresponding name from this
cube. Otherwise, this argument is ignored.
"""
if kwargs.get("weights") is not None:
kwargs["weights"] = cls(kwargs["weights"], cube)

Expected behaviour

The laziness of the weights array should be preserved. Because the weights array must be the same size as the data (on a side note: why does it need to be the same size? is this a limitation of numpy?), this makes it impossible to use this feature on large datasets.

Environment

  • OS & Version: Ubuntu 23.04
  • Iris Version: 3.5, 3.6
@schlunma
Copy link
Contributor

schlunma commented Jun 7, 2023

I thought about this for a while now. I don't think it's feasible to subclass dask.array.core.Array consistently similar to what is done with np.ndarray here (it would be possible to simply set the units attribute for Dask arrays; however, this is lost after operations like indexing or calling compute).

The easiest solution to this is probably to not subclass np.ndarray anymore but rather use to two attributes array and units for _Weights. _Weights.array contains either the numpy or dask array. This requires some changes in the code (we need to replace weights with weights.array everywhere), but this ensures that the weights are not realized if not necessary.

I'll try to come up with a PR for this.

@ESadek-MO
Copy link
Contributor

This was fixed in v3.6.1 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants