Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/changelog_fragments/195.dev.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Fixed xarray load tests for new behaviour of xarray.Dataset.identical.
40 changes: 39 additions & 1 deletion tests/integration/test_xarray_load_and_save_equivalence.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
(2) check equivalence of files : xarray -> file VS xarray->ncdata->file
"""

import numpy as np
import pytest
import xarray
from ncdata.netcdf4 import from_nc4, to_nc4
Expand Down Expand Up @@ -37,6 +38,35 @@ def use_xarraylock():
yield


def equivalence_fix_datasets(
ds_from: xarray.Dataset, ds_to: xarray.Dataset
) -> (xarray.Dataset, xarray.Dataset):
"""
Modify datasets in legitimate ways to make "ds_from.identical(ds_to)".

The key differences are due to coordinates remaining lazy in loading via ncdata, but
have data fetched in the "normal" load.
The coordinates apparently remain 'identical', but it affects the dataset indexes.

Minimum found necessary : where in 'ds_from' we find a lazy coordinate, which is a
real one in 'ds_to', remove the associated index from 'ds_to'.
"""
drop_indices = []
for varname, var in ds_from.variables.items():
if hasattr(var.data, "compute"):
var_other = ds_to.variables.get(varname, None)
if isinstance(var_other.data, np.ndarray):
# This is lazy, but the reference var is real : replace with real data.
if varname in ds_to.indexes:
drop_indices.append(varname)

# NB drop_indexes is *not* an inplace operation!
# So replace returned 'ds_to' with new dataset.
ds_to = ds_to.drop_indexes(drop_indices)
# NB: as it currently is, we do *not* ever have to modify/replace 'ds_from'.
return ds_from, ds_to


def test_load_direct_vs_viancdata(standard_testcase, use_xarraylock, tmp_path):
source_filepath = standard_testcase.filepath
ncdata = from_nc4(source_filepath)
Expand All @@ -51,7 +81,15 @@ def test_load_direct_vs_viancdata(standard_testcase, use_xarraylock, tmp_path):
# Load same, via ncdata
xr_ncdata_ds = to_xarray(ncdata)

# Treat as OK if it passes xarray comparison
# Check that datasets are "equal" : but NB this only compares values
assert xr_ds.equals(xr_ncdata_ds)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just proves that A and B were "equal" (whatever that means) before I mangled them.
Do we really need this as well ?


# 'Fix' equivalence, by making lazy vars real + removing missing indices.
# These are the expected differences due to ncdata passing lazy arrays.
# This should then make "Dataset.identical" true.
xr_ncdata_ds, xr_ds = equivalence_fix_datasets(
ds_from=xr_ncdata_ds, ds_to=xr_ds
)
assert xr_ds.identical(xr_ncdata_ds)
Copy link
Member Author

@pp-mo pp-mo Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, Dataset.identical is what just changed, and xarray don't consider that a breaking change, because it's classed as a "FIX" : pydata/xarray#11035

They don't state what "identical" actually means., but it is now comparing indexes.
However it still considers lazy data as "identical" to real -- and I'm still relying on that.

TBH I'd be much happier if "identical" meant "in all respects" : then I could then adapt/equalise the datasets in specific ways before testing with "identical".
Unfortunately, Xarray are a bit vague about equality testing.
They provide "identical" "equals" and "broadcast equals"
https://docs.xarray.dev/en/latest/api/dataarray.html#comparisons.
But as noted, Dataset.equals only compares data, not metadata (an odd choice), so that really doesn't cover what I want either

So, perhaps I should just write a custom comparison routine here, with the exactly necessary tolerance engineered ? The problem with that is, I need to be confident that I have understood what are all the possible content components of xarray Datasets -- and that, again, isn't made totally clear (it's obviously based on netcdf, but what makes a variable a coordinate is never clearly stated, indexes are an additional thing, etc etc).
I don't think I can reasonably use the ncdata dataset comparison, since the point here is to compare xarray datasets.

Copy link
Member Author

@pp-mo pp-mo Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are all the possible content components of xarray Datasets -- and that, again, isn't made totally clear

OK we do have this section : https://docs.xarray.dev/en/latest/user-guide/data-structures.html#dataset
So perhaps I was being a bit unfair. But it doesn't mention indexes.



Expand Down
Loading