Skip to content

feat(ingest): stream catalog in chunks to bound ingest memory#674

Merged
lewisjared merged 7 commits into
mainfrom
feat/ingest-streaming
May 14, 2026
Merged

feat(ingest): stream catalog in chunks to bound ingest memory#674
lewisjared merged 7 commits into
mainfrom
feat/ingest-streaming

Conversation

@lewisjared
Copy link
Copy Markdown
Contributor

@lewisjared lewisjared commented May 14, 2026

Summary

  • New --chunk-size option on ref datasets ingest (and matching ingest_datasets(chunk_size=...) API) that walks the input tree once and parses+validates+registers chunk_size files at a time, flushing the in-memory DataFrame between chunks. Chunks split only at directory boundaries so CMIP6 DRS datasets are never partial when handed to register_dataset.
  • Streaming is currently CMIP6-only (other adapters fall back to the existing whole-tree path with a warning).
  • Removes one of the two known 2× peaks in the pipeline by letting build_instance_id skip its defensive .copy() when the caller opts in via copy=False (default stays True so existing no-mutation tests pass).
  • IngestionStats.__iadd__ lets the streaming path roll per-chunk stats into a single final summary.

Memory impact

Run via the new scripts/benchmark_ingest_memory.py on a synthetic CMIP6 DRS tree of empty .nc files:

files datasets chunk_size baseline peak streaming peak ratio
4,000 40 500 11.4 MiB 6.4 MiB 1.8×
20,000 200 2,000 59.3 MiB 12.5 MiB 4.8×
30,000 300 5,000 87.3 MiB 24.6 MiB 3.5×

Wall time is unchanged (within noise). The benchmark forces cmip6_parser = "drs" because synthetic .touch() files cannot be opened by netCDF4.

Test plan

  • uv run pytest packages/climate-ref/tests/unit — 820 passed, 2 skipped, 2 xfailed
  • uv run pre-commit run --files <changed> — ruff + mypy clean
  • New unit tests:
    • TestIterDiscoveredChunks — directory-boundary respect, empty roots, single-file inputs
    • TestIterBuiltCatalogs — invalid filtering per chunk, empty inputs
    • TestCMIP6IterLocalDatasets — asserts streaming yields the same rows as find_local_datasets on the sample CMIP6 archive so the two paths cannot drift
  • Manual benchmark with scripts/benchmark_ingest_memory.py at 4k / 20k / 30k file counts (table above)

Out of scope (follow-ups)

  • Extending streaming to CMIP7 / obs4MIPs / PMP adapters (one helper method each; mirrors CMIP6's iter_local_datasets + _enrich_parsed_catalog split).

The dataset ingest pipeline previously held the entire parsed catalog in
memory as a single pandas DataFrame, so peak RSS scaled linearly with the
number of files in the input tree. A 30k-file archive needed ~87 MiB peak;
projecting to 1M+ files would OOM on modest hosts.

This change adds a new chunked streaming path that walks the tree once and
parses + validates + registers ``chunk_size`` files at a time, flushing the
in-memory DataFrame between chunks. Chunks only split at directory
boundaries so files belonging to the same CMIP6 dataset (which live in a
single DRS version directory) stay together and ``register_dataset`` never
sees a partial dataset.

User-facing surface:
- ``ref datasets ingest --chunk-size N`` enables streaming on the CLI
  (currently CMIP6 only; non-CMIP6 adapters fall back to the legacy path
  with a warning).
- ``ingest_datasets(..., chunk_size=N)`` is the programmatic equivalent.

Internals:
- New ``iter_discovered_chunks`` and ``iter_built_catalogs`` generators in
  ``catalog_builder`` so chunked walking is shared infrastructure.
- ``CMIP6DatasetAdapter`` exposes ``iter_local_datasets`` and shares
  post-parse enrichment with ``find_local_datasets`` via the new
  ``_enrich_parsed_catalog`` helper.
- ``build_instance_id`` now accepts ``copy=False`` so the streaming
  enrichment path avoids an extra full-DataFrame copy (kills one of the
  two known 2× peaks). Default stays ``copy=True`` so the no-mutation
  contract is preserved for existing callers.
- ``IngestionStats`` learned ``__iadd__`` so chunked runs can roll their
  per-chunk stats into a single final summary.

Diagnostics:
- ``scripts/benchmark_ingest_memory.py`` synthesises a CMIP6 DRS tree of
  empty .nc files and reports baseline vs streaming peak memory via
  ``tracemalloc``. On 30k files / 300 datasets / chunk_size=5000 the peak
  drops from ~87 MiB to ~25 MiB (3.5× lower) with no wall-time regression.

Tests:
- New coverage for ``iter_discovered_chunks`` (directory-boundary respect,
  empty roots, single-file inputs) and ``iter_built_catalogs`` (invalid
  filtering per chunk, empty inputs).
- New ``TestCMIP6IterLocalDatasets`` asserts streaming yields exactly the
  same rows as ``find_local_datasets`` on the sample data so the two paths
  cannot drift apart.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 14, 2026

Codecov Report

❌ Patch coverage is 91.78082% with 18 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...ckages/climate-ref/src/climate_ref/cli/datasets.py 75.00% 7 Missing and 3 partials ⚠️
...te-ref/src/climate_ref/datasets/catalog_builder.py 92.59% 1 Missing and 3 partials ⚠️
...ages/climate-ref/src/climate_ref/datasets/cmip6.py 87.50% 0 Missing and 3 partials ⚠️
...ages/climate-ref/src/climate_ref/datasets/utils.py 95.45% 0 Missing and 1 partial ⚠️
Flag Coverage Δ
core 93.09% <91.78%> (-0.08%) ⬇️
providers 91.85% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...s/climate-ref/src/climate_ref/datasets/__init__.py 98.07% <100.00%> (+0.97%) ⬆️
...ages/climate-ref/src/climate_ref/datasets/cmip7.py 91.17% <100.00%> (+1.34%) ⬆️
...ages/climate-ref/src/climate_ref/datasets/utils.py 91.80% <95.45%> (-0.44%) ⬇️
...ages/climate-ref/src/climate_ref/datasets/cmip6.py 91.66% <87.50%> (-0.40%) ⬇️
...te-ref/src/climate_ref/datasets/catalog_builder.py 96.03% <92.59%> (-2.11%) ⬇️
...ckages/climate-ref/src/climate_ref/cli/datasets.py 87.62% <75.00%> (-3.34%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a chunked/streaming ingest path to bound peak memory during dataset ingestion (currently CMIP6-only), with supporting catalog-builder iterators, CLI/API surface, and unit tests.

Changes:

  • Add chunked discovery/parsing (iter_discovered_chunks, iter_built_catalogs) and a CMIP6 iter_local_datasets(..., chunk_size=...) streaming path.
  • Extend ingestion pipeline with --chunk-size (CLI) and ingest_datasets(..., chunk_size=...) (API), plus stats accumulation via IngestionStats.__iadd__.
  • Optimize build_instance_id to avoid iterrows() and allow callers to opt out of a defensive DataFrame copy.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
scripts/benchmark_ingest_memory.py New standalone benchmark to compare memory/time for whole-tree vs chunked ingest.
ruff.toml Add per-file ignores for scripts/*.py (docstrings, lazy imports, etc.).
packages/climate-ref/tests/unit/datasets/test_cmip6.py Add tests ensuring CMIP6 streaming yields the same catalog as whole-tree parsing.
packages/climate-ref/tests/unit/datasets/test_catalog_builder.py Add tests for directory-boundary chunking and per-chunk invalid-row filtering.
packages/climate-ref/src/climate_ref/datasets/utils.py Add copy= flag and optimize build_instance_id construction.
packages/climate-ref/src/climate_ref/datasets/cmip6.py Refactor enrichment into a shared helper and add iter_local_datasets streaming API.
packages/climate-ref/src/climate_ref/datasets/catalog_builder.py Add chunked discovery/parsing iterators and centralize invalid-row filtering.
packages/climate-ref/src/climate_ref/datasets/init.py Add streaming ingest support (chunk_size), refactor ingestion loop, and accumulate stats.
packages/climate-ref/src/climate_ref/cli/datasets.py Add --chunk-size option and implement streaming ingest path in CLI.
changelog/674.feature.md Document the new --chunk-size feature and its memory behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/benchmark_ingest_memory.py Outdated
Comment thread scripts/benchmark_ingest_memory.py Outdated
Comment thread packages/climate-ref/src/climate_ref/cli/datasets.py Outdated
* origin/main:
  docs: reformat
  docs: add changelog for PR #673
  fix(ingest): stop re-ingesting unchanged CMIP6 files every run
- Clarify benchmark reports tracemalloc Python-allocator peak, not RSS.
- Fix benchmark files_per_dataset docstring/impl mismatch (use ceil division).
- Drop redundant skip_invalid kwarg on the chunked ingest_datasets call;
  the catalog is already validated, and the early return ignores it.
Mirror the CMIP6 split: factor the post-parse enrichment out of
`find_local_datasets` into `_enrich_parsed_catalog` and add
`iter_local_datasets` to `CMIP7DatasetAdapter`. Discovery walks the
tree once and chunks flush at directory boundaries so a CMIP7 DRS
dataset (which lives in one version directory) is never split across
chunks.

The CLI `--chunk-size` option now applies to CMIP7 ingest as well as
CMIP6. Adapter help text updated accordingly.

Adds `TestCMIP7IterLocalDatasets` (matches/whole-tree, non-empty
chunks, empty-enriched chunk filtered) to mirror the CMIP6 coverage.
@lewisjared lewisjared merged commit 074d56c into main May 14, 2026
26 checks passed
@lewisjared lewisjared deleted the feat/ingest-streaming branch May 14, 2026 12:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants