fix(core): make dataset hash deterministic across pandas versions#741
Merged
Conversation
DatasetCollection and ExecutionDatasetCollection hashed dataset identifiers via pandas.util.hash_pandas_object, whose output varies across pandas releases and platforms. A regression baseline minted in one environment then failed the CI coupling gate in another, and the solver could re-run executions needlessly. Hash the sorted slug values with hashlib instead, combining the per-source digests in a fixed order (and keying on source type to avoid cross-type collisions). Regenerate the committed example catalog_hash values and the dataset-hash regression snapshots to match. Existing databases will re-run each execution once after upgrading because the dataset_hash changes; results are unaffected.
Codecov Report✅ All modified and coverable lines are covered by tests.
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 3 files with indirect coverage changes 🚀 New features to boost your workflow:
|
lewisjared
added a commit
that referenced
this pull request
Jun 18, 2026
… merge PR #741 (merged into main) changed datasets.hash to a pandas-version-independent algorithm. Recompute the catalog_hash for the gpp-fluxnet2015, lai-avh15c1 and mrsos-wangmao cmip6 baselines with the new algorithm so each catalog.yaml _metadata.hash matches its manifest.json catalog_hash and the recomputed value, keeping the coupling gate green now that the algorithm has changed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
DatasetCollection/ExecutionDatasetCollectionhashed dataset identifiers viapandas.util.hash_pandas_object, whose output varies across pandas releases and platforms. The same committedcatalog.yamltherefore hashed to different values in different environments, which let a regression baseline minted on one machine fail the CI coupling gate on another, and could make the solver re-run executions unnecessarily.This replaces the pandas-based hash with a
hashlibSHA1 digest over the sorted slug values, and combines the per-source digests in a fixed, source-type-keyed order, so the result is stable across pandas versions and platforms and independent of row/insertion order.Because the hash is also the solver's
dataset_hashexecution-dedup key (solver.py), existing databases will re-run each execution once after upgrading; results are unaffected. The committed examplecatalog_hashvalues and the dataset-hash regression snapshots have been regenerated to match.Checklist
Please confirm that this pull request has done the following:
changelog/