Skip to content

Merge slow for multiple Cubes with large AncillaryVariables #7063

@ukmo-ccbunney

Description

@ukmo-ccbunney

📰 Custom Issue

TLDR; AncillaryVariables have their data payload read multiple times during the merge process. Often, Cube's have AncillaryVariables the same shape as their data (which can be BIG) and this is causing a lot of I/O and very slow merge times.

Details

When merging multiple cubes that contain an AncillaryVariable both the ancillary metadata and variable data are compared for equality:

iris/lib/iris/_merge.py

Lines 448 to 449 in 94b80d0

if self.ancillary_variables_and_dims != other.ancillary_variables_and_dims:
msgs.append("cube.ancillary_variables differ")

via the inherited method from ther _DimensionalMatadata base class:

iris/lib/iris/coords.py

Lines 665 to 669 in 94b80d0

# data values comparison
if eq and eq is not NotImplemented:
eq = iris.util.array_equal(
self._core_values(), other._core_values(), withnans=True
)

This makes sense in the context of merging as merging by design will only expand scalar variables and expects all the other dimensional metadata like objects on the cube to be the same.

However, AncillaryVariables often form some sort of status flag data for the cube data, and in this case the user likely wants to concatenate them into a single cube (assuming for instance that they have separate files per timestep for a variable). This can be achieved by adding a new axis to the cube and ancillary variable to prior to concatenation, as detailed in #6790.

However, in this case the merge process still checks every Cube's ancillary coord against every other candidate Cube. As the AncillaryVariables are often the same size as the cube data (which can potentially be very large) this results in the ancillary data being repeatedly read in from disk during the merge process. This is a potentially big I/O hit and can result in very slow merge times for large datasets.

A workaround for a specific case was to patch the __eq__ operator on the AncillaryVariable to only check the metadata:

# Custom equality operator for AncillaryVariable class:
def ancil_eq(self, other):
    if other is self:
        return True

    if hasattr(other, "metadata"):
        # metadata comparison
        return self.metadata == other.metadata

    return NotImplemented

# Patch AncillaryVariable __eq__ operator:
orig_eq_method = iris.coords.AncillaryVariable.__eq__
iris.coords.AncillaryVariable.__eq__ = ancil_eq
iris.load(...)

# Revert patch:
iris.coords.AncillaryVariable.__eq__ = orig_ancil_eq

Obviously, this only works in specific cases where the user knows it is safe to ignore the value comparison of the AncillaryVariable data.

Potential solution

It might be possible to pass a check_ancils flag to iris.cube.CubeList.merge() (and .merge_cube()) in the same way as iris.cube.CubeList.concatenate(). This would allow the user to optionally turn off the comparison of ancillary data (just compare the metadata) if they are confident it is safe to do so with their data files and they intend to concatenate multiple cubes with an AncillaryVariable into a single Cube.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions