Skip to content

fix(core): make dataset hash deterministic across pandas versions#741

Merged
lewisjared merged 2 commits into
mainfrom
fix/deterministic-dataset-hash
Jun 18, 2026
Merged

fix(core): make dataset hash deterministic across pandas versions#741
lewisjared merged 2 commits into
mainfrom
fix/deterministic-dataset-hash

Conversation

@lewisjared

@lewisjared lewisjared commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Description

DatasetCollection / ExecutionDatasetCollection hashed dataset identifiers via pandas.util.hash_pandas_object, whose output varies across pandas releases and platforms. The same committed catalog.yaml therefore hashed to different values in different environments, which let a regression baseline minted on one machine fail the CI coupling gate on another, and could make the solver re-run executions unnecessarily.

This replaces the pandas-based hash with a hashlib SHA1 digest over the sorted slug values, and combines the per-source digests in a fixed, source-type-keyed order, so the result is stable across pandas versions and platforms and independent of row/insertion order.

Because the hash is also the solver's dataset_hash execution-dedup key (solver.py), existing databases will re-run each execution once after upgrading; results are unaffected. The committed example catalog_hash values and the dataset-hash regression snapshots have been regenerated to match.

Checklist

Please confirm that this pull request has done the following:

  • Tests added
  • Documentation added (where applicable)
  • Changelog item added to changelog/

DatasetCollection and ExecutionDatasetCollection hashed dataset identifiers
via pandas.util.hash_pandas_object, whose output varies across pandas releases
and platforms. A regression baseline minted in one environment then failed the
CI coupling gate in another, and the solver could re-run executions needlessly.

Hash the sorted slug values with hashlib instead, combining the per-source
digests in a fixed order (and keying on source type to avoid cross-type
collisions). Regenerate the committed example catalog_hash values and the
dataset-hash regression snapshots to match.

Existing databases will re-run each execution once after upgrading because the
dataset_hash changes; results are unaffected.
@codecov

codecov Bot commented Jun 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
core 92.56% <100.00%> (+<0.01%) ⬆️
providers 91.82% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
.../climate-ref-core/src/climate_ref_core/datasets.py 90.12% <100.00%> (+0.37%) ⬆️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lewisjared lewisjared merged commit a272d4a into main Jun 18, 2026
25 checks passed
@lewisjared lewisjared deleted the fix/deterministic-dataset-hash branch June 18, 2026 11:25
lewisjared added a commit that referenced this pull request Jun 18, 2026
… merge

PR #741 (merged into main) changed datasets.hash to a pandas-version-independent
algorithm. Recompute the catalog_hash for the gpp-fluxnet2015, lai-avh15c1 and
mrsos-wangmao cmip6 baselines with the new algorithm so each catalog.yaml
_metadata.hash matches its manifest.json catalog_hash and the recomputed value,
keeping the coupling gate green now that the algorithm has changed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant