Fixes to dataset equivalence testing on xarray loads.#195
Fixes to dataset equivalence testing on xarray loads.#195
Conversation
|
|
||
| # Treat as OK if it passes xarray comparison | ||
| # Check that datasets are "equal" : but NB this only compares values | ||
| assert xr_ds.equals(xr_ncdata_ds) |
There was a problem hiding this comment.
This just proves that A and B were "equal" (whatever that means) before I mangled them.
Do we really need this as well ?
db82739 to
58ffa85
Compare
for more information, see https://pre-commit.ci
| xr_ncdata_ds, xr_ds = equivalence_fix_datasets( | ||
| ds_from=xr_ncdata_ds, ds_to=xr_ds | ||
| ) | ||
| assert xr_ds.identical(xr_ncdata_ds) |
There was a problem hiding this comment.
So, Dataset.identical is what just changed, and xarray don't consider that a breaking change, because it's classed as a "FIX" : pydata/xarray#11035
They don't state what "identical" actually means., but it is now comparing indexes.
However it still considers lazy data as "identical" to real -- and I'm still relying on that.
TBH I'd be much happier if "identical" meant "in all respects" : then I could then adapt/equalise the datasets in specific ways before testing with "identical".
Unfortunately, Xarray are a bit vague about equality testing.
They provide "identical" "equals" and "broadcast equals"
https://docs.xarray.dev/en/latest/api/dataarray.html#comparisons.
But as noted, Dataset.equals only compares data, not metadata (an odd choice), so that really doesn't cover what I want either
So, perhaps I should just write a custom comparison routine here, with the exactly necessary tolerance engineered ? The problem with that is, I need to be confident that I have understood what are all the possible content components of xarray Datasets -- and that, again, isn't made totally clear (it's obviously based on netcdf, but what makes a variable a coordinate is never clearly stated, indexes are an additional thing, etc etc).
I don't think I can reasonably use the ncdata dataset comparison, since the point here is to compare xarray datasets.
There was a problem hiding this comment.
what are all the possible content components of xarray Datasets -- and that, again, isn't made totally clear
OK we do have this section : https://docs.xarray.dev/en/latest/user-guide/data-structures.html#dataset
So perhaps I was being a bit unfair. But it doesn't mention indexes.
|
@chrisbunney I'd value your opinion of this, before I merge it ! Since I'm about to migrate the repo to Scitools, it might make a useful testcase for permissions there. |
Found that later xarray (v2026.01.0) was breaking the xarray load "direct vs via ncdata" tests.
Apparently, the exact meaning of
Dataset.identicalhas changed.From experiment, the problem seems to be that
Dataset.identicalnow checks indexes.Presumably this? : pydata/xarray#11035
This doesn't really affect ncdata behaviour : it was always the case that e.g. printing the with/without-ncdata loaded datasets showed differences ("via-ncdata" version has extra lazy coords and missing indexes).
Just the means of checking the "equivalence" needed fixing.