Skip to content

fix(ingest): keep files absent from current ingest, warn instead of abort#677

Merged
lewisjared merged 2 commits into
mainfrom
fix/streaming-ingest-dataset-split
May 15, 2026
Merged

fix(ingest): keep files absent from current ingest, warn instead of abort#677
lewisjared merged 2 commits into
mainfrom
fix/streaming-ingest-dataset-split

Conversation

@lewisjared
Copy link
Copy Markdown
Contributor

@lewisjared lewisjared commented May 15, 2026

Summary

ref datasets ingest aborts with NotImplementedError("Removing files is not yet supported") whenever the same instance_id is registered twice in one run with a different set of file paths each time.

Honestly this should never happen under the DRS... but it does...

The complete parser reads attributes from file content, not from the path. The following instance_ids have identical files (which conflict with their metadata on ESGF).

  • CMIP6.DAMIP.IPSL.IPSL-CM6A-LR.hist-GHG.r1i1p1f1.Amon.psl.gr.v20180914
  • CMIP6.DAMIP.NASA-GISS.GISS-E2-1-G.hist-GHG.r1i1p1f1.Amon.ts.gn.v20180907

This PR makes register_dataset treat files absent from the current ingest slice as kept-in-place with a warning, instead of raising. This will likely break an execution which will now see both datasets, but I'd rather that than work around this because its just wrong.

Test plan

  • Replaced test_register_dataset_raises_on_removal with test_register_dataset_keeps_absent_files_with_warning asserting the new union-then-warn behaviour
  • pytest packages/climate-ref/tests/unit/datasets/ (203 passed, 1 xfailed)
  • ruff check, ruff format, mypy (pre-commit)

Closes #676

lewisjared added a commit that referenced this pull request May 15, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

Flag Coverage Δ
core 93.10% <100.00%> (+0.01%) ⬆️
providers 91.85% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...kages/climate-ref/src/climate_ref/datasets/base.py 98.75% <100.00%> (+0.59%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lewisjared lewisjared changed the title fix(ingest): merge same-dataset files across streaming chunks fix(ingest): keep files absent from current ingest, warn instead of abort May 15, 2026
…bort

`ref datasets ingest` aborted with
`NotImplementedError("Removing files is not yet supported")` whenever
the same `instance_id` was registered twice in one run with a different
set of file paths each time. The streaming `--chunk-size` path makes
this dramatically more reachable, but it is a property of any catalog
where two distinct paths produce the same `instance_id` -- which
happens routinely with the content-based CMIP6 `complete` parser when
a mirror stores the same netCDF file under two different parent
activity directories (e.g. `/CMIP6/CMIP/IPSL/...` and
`/CMIP6/DAMIP/IPSL/...` for the same DAMIP-published file).

Treat files absent from the current ingest slice as kept-in-place with
a warning instead of raising. This matches the long-standing `TODO`
about preserving datasets that diagnostics may have already used, and
handles the cross-chunk-distance case uniformly -- the failing chunks
can be arbitrarily far apart in `os.walk()` order.

Closes #676
@lewisjared lewisjared force-pushed the fix/streaming-ingest-dataset-split branch from 668c898 to 82fbd00 Compare May 15, 2026 13:06
@lewisjared lewisjared merged commit 2da73bf into main May 15, 2026
26 checks passed
@lewisjared lewisjared deleted the fix/streaming-ingest-dataset-split branch May 15, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ingest: NotImplementedError when files are removed from a dataset directory

1 participant