feat(ingest): stream catalog in chunks to bound ingest memory by lewisjared · Pull Request #674 · Climate-REF/climate-ref

lewisjared · 2026-05-14T06:27:13Z

Summary

New --chunk-size option on ref datasets ingest (and matching ingest_datasets(chunk_size=...) API) that walks the input tree once and parses+validates+registers chunk_size files at a time, flushing the in-memory DataFrame between chunks. Chunks split only at directory boundaries so CMIP6 DRS datasets are never partial when handed to register_dataset.
Streaming is currently CMIP6-only (other adapters fall back to the existing whole-tree path with a warning).
Removes one of the two known 2× peaks in the pipeline by letting build_instance_id skip its defensive .copy() when the caller opts in via copy=False (default stays True so existing no-mutation tests pass).
IngestionStats.__iadd__ lets the streaming path roll per-chunk stats into a single final summary.

Memory impact

Run via the new scripts/benchmark_ingest_memory.py on a synthetic CMIP6 DRS tree of empty .nc files:

files	datasets	chunk_size	baseline peak	streaming peak	ratio
4,000	40	500	11.4 MiB	6.4 MiB	1.8×
20,000	200	2,000	59.3 MiB	12.5 MiB	4.8×
30,000	300	5,000	87.3 MiB	24.6 MiB	3.5×

Wall time is unchanged (within noise). The benchmark forces cmip6_parser = "drs" because synthetic .touch() files cannot be opened by netCDF4.

Test plan

uv run pytest packages/climate-ref/tests/unit — 820 passed, 2 skipped, 2 xfailed
uv run pre-commit run --files <changed> — ruff + mypy clean
New unit tests:
- TestIterDiscoveredChunks — directory-boundary respect, empty roots, single-file inputs
- TestIterBuiltCatalogs — invalid filtering per chunk, empty inputs
- TestCMIP6IterLocalDatasets — asserts streaming yields the same rows as find_local_datasets on the sample CMIP6 archive so the two paths cannot drift
Manual benchmark with scripts/benchmark_ingest_memory.py at 4k / 20k / 30k file counts (table above)

Out of scope (follow-ups)

Extending streaming to CMIP7 / obs4MIPs / PMP adapters (one helper method each; mirrors CMIP6's iter_local_datasets + _enrich_parsed_catalog split).

The dataset ingest pipeline previously held the entire parsed catalog in memory as a single pandas DataFrame, so peak RSS scaled linearly with the number of files in the input tree. A 30k-file archive needed ~87 MiB peak; projecting to 1M+ files would OOM on modest hosts. This change adds a new chunked streaming path that walks the tree once and parses + validates + registers ``chunk_size`` files at a time, flushing the in-memory DataFrame between chunks. Chunks only split at directory boundaries so files belonging to the same CMIP6 dataset (which live in a single DRS version directory) stay together and ``register_dataset`` never sees a partial dataset. User-facing surface: - ``ref datasets ingest --chunk-size N`` enables streaming on the CLI (currently CMIP6 only; non-CMIP6 adapters fall back to the legacy path with a warning). - ``ingest_datasets(..., chunk_size=N)`` is the programmatic equivalent. Internals: - New ``iter_discovered_chunks`` and ``iter_built_catalogs`` generators in ``catalog_builder`` so chunked walking is shared infrastructure. - ``CMIP6DatasetAdapter`` exposes ``iter_local_datasets`` and shares post-parse enrichment with ``find_local_datasets`` via the new ``_enrich_parsed_catalog`` helper. - ``build_instance_id`` now accepts ``copy=False`` so the streaming enrichment path avoids an extra full-DataFrame copy (kills one of the two known 2× peaks). Default stays ``copy=True`` so the no-mutation contract is preserved for existing callers. - ``IngestionStats`` learned ``__iadd__`` so chunked runs can roll their per-chunk stats into a single final summary. Diagnostics: - ``scripts/benchmark_ingest_memory.py`` synthesises a CMIP6 DRS tree of empty .nc files and reports baseline vs streaming peak memory via ``tracemalloc``. On 30k files / 300 datasets / chunk_size=5000 the peak drops from ~87 MiB to ~25 MiB (3.5× lower) with no wall-time regression. Tests: - New coverage for ``iter_discovered_chunks`` (directory-boundary respect, empty roots, single-file inputs) and ``iter_built_catalogs`` (invalid filtering per chunk, empty inputs). - New ``TestCMIP6IterLocalDatasets`` asserts streaming yields exactly the same rows as ``find_local_datasets`` on the sample data so the two paths cannot drift apart.

codecov · 2026-05-14T06:37:24Z

Codecov Report

❌ Patch coverage is 91.78082% with 18 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...ckages/climate-ref/src/climate_ref/cli/datasets.py	75.00%	7 Missing and 3 partials ⚠️
...te-ref/src/climate_ref/datasets/catalog_builder.py	92.59%	1 Missing and 3 partials ⚠️
...ages/climate-ref/src/climate_ref/datasets/cmip6.py	87.50%	0 Missing and 3 partials ⚠️
...ages/climate-ref/src/climate_ref/datasets/utils.py	95.45%	0 Missing and 1 partial ⚠️

Flag	Coverage Δ
core	`93.09% <91.78%> (-0.08%)`	⬇️
providers	`91.85% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...s/climate-ref/src/climate_ref/datasets/__init__.py	`98.07% <100.00%> (+0.97%)`	⬆️
...ages/climate-ref/src/climate_ref/datasets/cmip7.py	`91.17% <100.00%> (+1.34%)`	⬆️
...ages/climate-ref/src/climate_ref/datasets/utils.py	`91.80% <95.45%> (-0.44%)`	⬇️
...ages/climate-ref/src/climate_ref/datasets/cmip6.py	`91.66% <87.50%> (-0.40%)`	⬇️
...te-ref/src/climate_ref/datasets/catalog_builder.py	`96.03% <92.59%> (-2.11%)`	⬇️
...ckages/climate-ref/src/climate_ref/cli/datasets.py	`87.62% <75.00%> (-3.34%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Adds a chunked/streaming ingest path to bound peak memory during dataset ingestion (currently CMIP6-only), with supporting catalog-builder iterators, CLI/API surface, and unit tests.

Changes:

Add chunked discovery/parsing (iter_discovered_chunks, iter_built_catalogs) and a CMIP6 iter_local_datasets(..., chunk_size=...) streaming path.
Extend ingestion pipeline with --chunk-size (CLI) and ingest_datasets(..., chunk_size=...) (API), plus stats accumulation via IngestionStats.__iadd__.
Optimize build_instance_id to avoid iterrows() and allow callers to opt out of a defensive DataFrame copy.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
scripts/benchmark_ingest_memory.py	New standalone benchmark to compare memory/time for whole-tree vs chunked ingest.
ruff.toml	Add per-file ignores for `scripts/*.py` (docstrings, lazy imports, etc.).
packages/climate-ref/tests/unit/datasets/test_cmip6.py	Add tests ensuring CMIP6 streaming yields the same catalog as whole-tree parsing.
packages/climate-ref/tests/unit/datasets/test_catalog_builder.py	Add tests for directory-boundary chunking and per-chunk invalid-row filtering.
packages/climate-ref/src/climate_ref/datasets/utils.py	Add `copy=` flag and optimize `build_instance_id` construction.
packages/climate-ref/src/climate_ref/datasets/cmip6.py	Refactor enrichment into a shared helper and add `iter_local_datasets` streaming API.
packages/climate-ref/src/climate_ref/datasets/catalog_builder.py	Add chunked discovery/parsing iterators and centralize invalid-row filtering.
packages/climate-ref/src/climate_ref/datasets/init.py	Add streaming ingest support (`chunk_size`), refactor ingestion loop, and accumulate stats.
packages/climate-ref/src/climate_ref/cli/datasets.py	Add `--chunk-size` option and implement streaming ingest path in CLI.
changelog/674.feature.md	Document the new `--chunk-size` feature and its memory behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* origin/main: docs: reformat docs: add changelog for PR #673 fix(ingest): stop re-ingesting unchanged CMIP6 files every run

- Clarify benchmark reports tracemalloc Python-allocator peak, not RSS. - Fix benchmark files_per_dataset docstring/impl mismatch (use ceil division). - Drop redundant skip_invalid kwarg on the chunked ingest_datasets call; the catalog is already validated, and the early return ignores it.

Mirror the CMIP6 split: factor the post-parse enrichment out of `find_local_datasets` into `_enrich_parsed_catalog` and add `iter_local_datasets` to `CMIP7DatasetAdapter`. Discovery walks the tree once and chunks flush at directory boundaries so a CMIP7 DRS dataset (which lives in one version directory) is never split across chunks. The CLI `--chunk-size` option now applies to CMIP7 ingest as well as CMIP6. Adapter help text updated accordingly. Adds `TestCMIP7IterLocalDatasets` (matches/whole-tree, non-empty chunks, empty-enriched chunk filtered) to mirror the CMIP6 coverage.

lewisjared added 2 commits May 14, 2026 16:26

chore(changelog): rename news fragment to PR number

baded9b

lewisjared requested a review from Copilot May 14, 2026 08:13

Copilot started reviewing on behalf of lewisjared May 14, 2026 08:13 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread scripts/benchmark_ingest_memory.py Outdated

Comment thread scripts/benchmark_ingest_memory.py Outdated

Comment thread packages/climate-ref/src/climate_ref/cli/datasets.py Outdated

lewisjared added 5 commits May 14, 2026 20:13

Merge remote-tracking branch 'origin/main' into feat/ingest-streaming

6bdbb81

* origin/main: docs: reformat docs: add changelog for PR #673 fix(ingest): stop re-ingesting unchanged CMIP6 files every run

test: add test coverage

e78a3ab

chore: cleanup

190dd2b

lewisjared merged commit 074d56c into main May 14, 2026
26 checks passed

lewisjared deleted the feat/ingest-streaming branch May 14, 2026 12:24

lewisjared mentioned this pull request May 14, 2026

fix(climate-ref): bound ingest memory + bump to REF v0.14.1 lewisjared/home-infra#145

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): stream catalog in chunks to bound ingest memory#674

feat(ingest): stream catalog in chunks to bound ingest memory#674
lewisjared merged 7 commits into
mainfrom
feat/ingest-streaming

lewisjared commented May 14, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lewisjared commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Memory impact

Test plan

Out of scope (follow-ups)

Uh oh!

codecov Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lewisjared commented May 14, 2026 •

edited

Loading

codecov Bot commented May 14, 2026 •

edited

Loading