fix: prevent DRS re-ingestion from regressing finalised datasets by lewisjared · Pull Request #567 · Climate-REF/climate-ref

lewisjared · 2026-02-24T08:29:03Z

Description

Fixes three issues discovered when reviewing the DRS ingest -> finalise -> re-ingest flow:

pd.NA comparison crash in update_or_create: The != comparison with pd.NA raises TypeError: boolean value of NA is ambiguous when comparing existing DB values against incoming pd.NA from the DRS parser. Fixed by introducing _values_differ() which safely handles pd.NA, np.nan, and None.
DRS re-ingestion regresses finalised datasets: Re-ingesting with the DRS parser would overwrite finalised metadata with pd.NA and set finalised=False, losing all work from finalisation. Fixed by skipping already-finalised datasets during unfinalised (DRS) ingestion, while still adding any new files.
Unbounded memory growth during ingestion: The SQLAlchemy session identity map accumulates all Dataset and DatasetFile ORM objects across the entire ingestion loop without releasing them. For large archives this causes memory to grow to 100GB+. Fixed by calling db.session.expire_all() after each dataset commit.

Also adds an info-level log in ExecutionSolver.solve() to show which diagnostic is being solved.

Checklist

Please confirm that this pull request has done the following:

Tests added
Documentation added (where applicable)
Changelog item added to changelog/

Two bugs caused DRS re-ingestion to corrupt previously-finalised datasets: 1. `update_or_create` crashed with `TypeError: boolean value of NA is ambiguous` when comparing existing DB values against `pd.NA` from the DRS parser. Fixed by introducing `_values_differ()` which safely handles pd.NA, np.nan and None comparisons. 2. Re-ingesting with the DRS parser would overwrite finalised metadata with pd.NA and set `finalised=False`, losing all work from finalisation. Fixed by skipping already-finalised datasets during unfinalised (DRS) ingestion, while still adding any new files.

- Call db.session.expire_all() after each dataset commit in ingest_datasets to release ORM objects from the session identity map. Without this, all Dataset and DatasetFile objects accumulate across the entire ingestion loop, causing unbounded memory growth. - Add info-level log in ExecutionSolver.solve() to show which diagnostic is being solved, making it easier to track progress.

codecov · 2026-02-24T08:36:15Z

Codecov Report

❌ Patch coverage is 84.78261% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
packages/climate-ref/src/climate_ref/database.py	70.58%	5 Missing ⚠️
...kages/climate-ref/src/climate_ref/datasets/base.py	92.59%	2 Missing ⚠️

Flag	Coverage Δ
core	`93.15% <84.78%> (-0.07%)`	⬇️
providers	`89.56% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...s/climate-ref/src/climate_ref/datasets/__init__.py	`94.20% <100.00%> (+0.08%)`	⬆️
packages/climate-ref/src/climate_ref/solver.py	`98.32% <100.00%> (+<0.01%)`	⬆️
...kages/climate-ref/src/climate_ref/datasets/base.py	`97.40% <92.59%> (-1.04%)`	⬇️
packages/climate-ref/src/climate_ref/database.py	`95.52% <70.58%> (-3.64%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* origin/main: (86 commits) Bump version: 0.11.0 → 0.11.1 docs: add changelog for #567 fix: reduce memory during ingestion and add solve logging fix: prevent DRS re-ingestion from regressing finalised datasets Bump version: 0.10.0 → 0.11.0 chore: Update comment chore: upgrade pins for ilamb fix: revert compat=override on open_mfdataset docs: add changelog for #565 chore: Upgrade lockfile and fix some errors chore: add coverage chore: add default separator in alembic fix: time_coder warning chore: Pin to use tas fix(solver): preserve DataCatalog wrapper in apply_dataset_filters fix(tests): use to_frame() when accessing DataCatalog in solver tests docs: Changelog chore: run the finalise in threads chore: clean up chore: add fix changelog entry for PR #561 ...

lewisjared added 3 commits February 24, 2026 19:12

docs: add changelog for #567

f96176c

lewisjared merged commit 49f549e into main Feb 24, 2026
26 checks passed

lewisjared deleted the fix/reingest-finalised-regression branch February 24, 2026 08:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: prevent DRS re-ingestion from regressing finalised datasets#567

fix: prevent DRS re-ingestion from regressing finalised datasets#567
lewisjared merged 3 commits intomainfrom
fix/reingest-finalised-regression

lewisjared commented Feb 24, 2026 •

edited

Loading

Uh oh!

codecov bot commented Feb 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

lewisjared commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

codecov bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

lewisjared commented Feb 24, 2026 •

edited

Loading

codecov bot commented Feb 24, 2026 •

edited

Loading