fix: prevent DRS re-ingestion from regressing finalised datasets#567
Merged
lewisjared merged 3 commits intomainfrom Feb 24, 2026
Merged
fix: prevent DRS re-ingestion from regressing finalised datasets#567lewisjared merged 3 commits intomainfrom
lewisjared merged 3 commits intomainfrom
Conversation
Two bugs caused DRS re-ingestion to corrupt previously-finalised datasets: 1. `update_or_create` crashed with `TypeError: boolean value of NA is ambiguous` when comparing existing DB values against `pd.NA` from the DRS parser. Fixed by introducing `_values_differ()` which safely handles pd.NA, np.nan and None comparisons. 2. Re-ingesting with the DRS parser would overwrite finalised metadata with pd.NA and set `finalised=False`, losing all work from finalisation. Fixed by skipping already-finalised datasets during unfinalised (DRS) ingestion, while still adding any new files.
- Call db.session.expire_all() after each dataset commit in ingest_datasets to release ORM objects from the session identity map. Without this, all Dataset and DatasetFile objects accumulate across the entire ingestion loop, causing unbounded memory growth. - Add info-level log in ExecutionSolver.solve() to show which diagnostic is being solved, making it easier to track progress.
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
lewisjared
added a commit
that referenced
this pull request
Feb 25, 2026
* origin/main: (86 commits) Bump version: 0.11.0 → 0.11.1 docs: add changelog for #567 fix: reduce memory during ingestion and add solve logging fix: prevent DRS re-ingestion from regressing finalised datasets Bump version: 0.10.0 → 0.11.0 chore: Update comment chore: upgrade pins for ilamb fix: revert compat=override on open_mfdataset docs: add changelog for #565 chore: Upgrade lockfile and fix some errors chore: add coverage chore: add default separator in alembic fix: time_coder warning chore: Pin to use tas fix(solver): preserve DataCatalog wrapper in apply_dataset_filters fix(tests): use to_frame() when accessing DataCatalog in solver tests docs: Changelog chore: run the finalise in threads chore: clean up chore: add fix changelog entry for PR #561 ...
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes three issues discovered when reviewing the DRS ingest -> finalise -> re-ingest flow:
pd.NAcomparison crash inupdate_or_create: The!=comparison withpd.NAraisesTypeError: boolean value of NA is ambiguouswhen comparing existing DB values against incomingpd.NAfrom the DRS parser. Fixed by introducing_values_differ()which safely handlespd.NA,np.nan, andNone.DRS re-ingestion regresses finalised datasets: Re-ingesting with the DRS parser would overwrite finalised metadata with
pd.NAand setfinalised=False, losing all work from finalisation. Fixed by skipping already-finalised datasets during unfinalised (DRS) ingestion, while still adding any new files.Unbounded memory growth during ingestion: The SQLAlchemy session identity map accumulates all
DatasetandDatasetFileORM objects across the entire ingestion loop without releasing them. For large archives this causes memory to grow to 100GB+. Fixed by callingdb.session.expire_all()after each dataset commit.Also adds an info-level log in
ExecutionSolver.solve()to show which diagnostic is being solved.Checklist
Please confirm that this pull request has done the following:
changelog/