Skip to content

Comments

fix: prevent DRS re-ingestion from regressing finalised datasets#567

Merged
lewisjared merged 3 commits intomainfrom
fix/reingest-finalised-regression
Feb 24, 2026
Merged

fix: prevent DRS re-ingestion from regressing finalised datasets#567
lewisjared merged 3 commits intomainfrom
fix/reingest-finalised-regression

Conversation

@lewisjared
Copy link
Contributor

@lewisjared lewisjared commented Feb 24, 2026

Description

Fixes three issues discovered when reviewing the DRS ingest -> finalise -> re-ingest flow:

  1. pd.NA comparison crash in update_or_create: The != comparison with pd.NA raises TypeError: boolean value of NA is ambiguous when comparing existing DB values against incoming pd.NA from the DRS parser. Fixed by introducing _values_differ() which safely handles pd.NA, np.nan, and None.

  2. DRS re-ingestion regresses finalised datasets: Re-ingesting with the DRS parser would overwrite finalised metadata with pd.NA and set finalised=False, losing all work from finalisation. Fixed by skipping already-finalised datasets during unfinalised (DRS) ingestion, while still adding any new files.

  3. Unbounded memory growth during ingestion: The SQLAlchemy session identity map accumulates all Dataset and DatasetFile ORM objects across the entire ingestion loop without releasing them. For large archives this causes memory to grow to 100GB+. Fixed by calling db.session.expire_all() after each dataset commit.

Also adds an info-level log in ExecutionSolver.solve() to show which diagnostic is being solved.

Checklist

Please confirm that this pull request has done the following:

  • Tests added
  • Documentation added (where applicable)
  • Changelog item added to changelog/

Two bugs caused DRS re-ingestion to corrupt previously-finalised datasets:

1. `update_or_create` crashed with `TypeError: boolean value of NA is
   ambiguous` when comparing existing DB values against `pd.NA` from the
   DRS parser. Fixed by introducing `_values_differ()` which safely
   handles pd.NA, np.nan and None comparisons.

2. Re-ingesting with the DRS parser would overwrite finalised metadata
   with pd.NA and set `finalised=False`, losing all work from
   finalisation. Fixed by skipping already-finalised datasets during
   unfinalised (DRS) ingestion, while still adding any new files.
- Call db.session.expire_all() after each dataset commit in
  ingest_datasets to release ORM objects from the session identity map.
  Without this, all Dataset and DatasetFile objects accumulate across
  the entire ingestion loop, causing unbounded memory growth.

- Add info-level log in ExecutionSolver.solve() to show which
  diagnostic is being solved, making it easier to track progress.
@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 84.78261% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
packages/climate-ref/src/climate_ref/database.py 70.58% 5 Missing ⚠️
...kages/climate-ref/src/climate_ref/datasets/base.py 92.59% 2 Missing ⚠️
Flag Coverage Δ
core 93.15% <84.78%> (-0.07%) ⬇️
providers 89.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...s/climate-ref/src/climate_ref/datasets/__init__.py 94.20% <100.00%> (+0.08%) ⬆️
packages/climate-ref/src/climate_ref/solver.py 98.32% <100.00%> (+<0.01%) ⬆️
...kages/climate-ref/src/climate_ref/datasets/base.py 97.40% <92.59%> (-1.04%) ⬇️
packages/climate-ref/src/climate_ref/database.py 95.52% <70.58%> (-3.64%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lewisjared lewisjared merged commit 49f549e into main Feb 24, 2026
26 checks passed
@lewisjared lewisjared deleted the fix/reingest-finalised-regression branch February 24, 2026 08:36
lewisjared added a commit that referenced this pull request Feb 25, 2026
* origin/main: (86 commits)
  Bump version: 0.11.0 → 0.11.1
  docs: add changelog for #567
  fix: reduce memory during ingestion and add solve logging
  fix: prevent DRS re-ingestion from regressing finalised datasets
  Bump version: 0.10.0 → 0.11.0
  chore: Update comment
  chore: upgrade pins for ilamb
  fix: revert compat=override on open_mfdataset
  docs: add changelog for #565
  chore: Upgrade lockfile and fix some errors
  chore: add coverage
  chore: add default separator in alembic
  fix: time_coder warning
  chore: Pin to use tas
  fix(solver): preserve DataCatalog wrapper in apply_dataset_filters
  fix(tests): use to_frame() when accessing DataCatalog in solver tests
  docs: Changelog
  chore: run the finalise in threads
  chore: clean up
  chore: add fix changelog entry for PR #561
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant