Merge slow for multiple Cubes with large `AncillaryVariable`s

## 📰 Custom Issue

**TLDR**; `AncillaryVariable`s have their data payload read multiple times during the merge process. Often, Cube's have `AncillaryVariables` the same shape as their data (which can be BIG) and this is causing a lot of I/O and very slow merge times.

## Details

When merging multiple cubes that contain an `AncillaryVariable` both the ancillary metadata and **variable data** are compared for equality:

https://github.com/SciTools/iris/blob/94b80d0f0ae29096e14747da10c871621024a38b/lib/iris/_merge.py#L448-L449

via the inherited method from ther `_DimensionalMatadata` base class:
https://github.com/SciTools/iris/blob/94b80d0f0ae29096e14747da10c871621024a38b/lib/iris/coords.py#L665-L669

This makes sense in the context of **merging** as merging by design will only expand _scalar variables_ and expects all the other dimensional metadata like objects on the cube to be the same.

However, `AncillaryVariable`s often form some sort of **status flag** data for the cube data, and in this case the user likely wants to concatenate them into a single cube (assuming for instance that they have separate files per timestep for a variable). This can be achieved by adding a new axis to the cube and ancillary variable to prior to concatenation, as detailed in #6790.

However, in this case the merge process still checks every Cube's ancillary coord against every other candidate Cube. As the AncillaryVariables are often the same size as the cube data (which can potentially be very large) this results in the ancillary data being repeatedly read in from disk during the merge process. This is a potentially big I/O hit and can result in very slow merge times for large datasets.

A workaround for a specific case was to patch the `__eq__` operator on the `AncillaryVariable` to only check the metadata:

```python
# Custom equality operator for AncillaryVariable class:
def ancil_eq(self, other):
    if other is self:
        return True

    if hasattr(other, "metadata"):
        # metadata comparison
        return self.metadata == other.metadata

    return NotImplemented

# Patch AncillaryVariable __eq__ operator:
orig_eq_method = iris.coords.AncillaryVariable.__eq__
iris.coords.AncillaryVariable.__eq__ = ancil_eq
iris.load(...)

# Revert patch:
iris.coords.AncillaryVariable.__eq__ = orig_ancil_eq
```

Obviously, this only works in specific cases where the user knows it is safe to ignore the value comparison of the `AncillaryVariable` data.

## Potential solution
It might be possible to pass a `check_ancils` flag to `iris.cube.CubeList.merge()` (and `.merge_cube()`) in the same way as `iris.cube.CubeList.concatenate()`. This would allow the user to optionally turn off the comparison of ancillary data (just compare the metadata) if they are confident it is safe to do so with their data files and they intend to concatenate multiple cubes with an `AncillaryVariable` into a single `Cube`.

	# data values comparison
	if eq and eq is not NotImplemented:
	eq = iris.util.array_equal(
	self._core_values(), other._core_values(), withnans=True
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge slow for multiple Cubes with large `AncillaryVariable`s #7063

📰 Custom Issue

Details

Potential solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if self.ancillary_variables_and_dims != other.ancillary_variables_and_dims:
	msgs.append("cube.ancillary_variables differ")

Merge slow for multiple Cubes with large AncillaryVariables #7063

Description

📰 Custom Issue

Details

Potential solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Merge slow for multiple Cubes with large `AncillaryVariable`s #7063