Skip to content

Feature/base harvester#1

Merged
sray014 merged 87 commits intomainfrom
feature/base-harvester
Apr 9, 2026
Merged

Feature/base harvester#1
sray014 merged 87 commits intomainfrom
feature/base-harvester

Conversation

@slawler
Copy link
Copy Markdown
Member

@slawler slawler commented Mar 21, 2026

v0.1.0 Release

Base Harvester class and implementations added.

Overview

This PR brings cosecha to v0.1.0 with support for gridded and time-series data harvesting, including NetCDF output, Apache Iceberg catalogs, and USGS data retrieval.

Key Features

Data Input Support

  • NWIS Reaper: Hydrological observations from USGS NWIS API
  • NWP Reaper: Numerical weather prediction data (HRRR, GFS, etc.) via herbie-data
  • MRMS Reaper: Multi-Radar/Multi-Sensor System (MRMS) analysis data.
  • Extensible base classes for custom reapers

Data Output Support

  • sow_to_netcdf: Write gridded data to NetCDF format using h5netcdf (pure Python, no system dependencies)
  • sow_to_iceberg: Write time-series data to Apache Iceberg tables with local file-based catalogs and ACID support
  • sow_to_zarr: Gridded data in Zarr format with chunking and compression
  • sow_to_parquet: Time-series data in Parquet format
  • sow_to_icechunk: Gridded data in IceChunk format

Dependencies

  • Using h5py>=3.0 to netcdf feature (required HDF5 backend for h5netcdf)
  • Using herbie-data>=2026.3.0 for NWP data access (fixed package naming issue)
  • Using sqlalchemy>=2.0.47,<3 for Iceberg local catalog support

Testing

  • Added 61 tests
  • All pre-commit checks passing

Notes

This is the first major release. No migration needed for new users.

Closes

  • Fixes #N/A (initial release)

Taher Chegini added 15 commits April 8, 2026 16:40
Add a multi-source flood analysis notebook using Hurricane Beryl
(July 2024, Houston TX) that demonstrates all three reapers together:
USGS streamflow, MRMS radar precipitation, and HRRR forecast data.
Remove the old standalone NWIS and HRRR example notebooks. Update
mkdocs nav and examples index with the new notebook.
Remove SpatialBoundsError (never raised). Stop re-exporting individual
exceptions from reaping and top-level packages since the exceptions
module is already exported directly.
Rename utils.py to _utils.py and logging.py to _logging.py since
they are internal implementation details. Public access to
configure_logger remains via cosecha.__init__.
@cheginit
Copy link
Copy Markdown
Collaborator

cheginit commented Apr 8, 2026

Here's a brief summary of changes that I made:

Design & API consistency

  • Standardized time parameters: public APIs accept str only, parsed via pd.to_datetime() at __init__, pd.Timestamp internally.
  • Added _ensure_data() to ReaperBase with @overload return types, eliminating duplicated guard clauses across 5 sow_to_* methods.
  • Unified exception hierarchy: no raw ValueError escapes reaper code — all mapped to ReaperError, DateRangeError, or APIError.
  • Shared to_180() utility replaces herbie accessor and MRMS private method; handles 1D and 2D grids.
  • Used load_catalog as context manager in sow_to_iceberg to prevent SQLite connection leaks.
  • Aligned __init__.py exports; deleted dead data_models.py module.

Logging

  • Replaced Rich-based logging with plain logging.StreamHandler; dropped rich dependency.

Type checking

  • 0 pyright strict-mode errors (down from 7). _ensure_data() and to_180() use @overload for type narrowing without assert or cast.

Testing

  • Coverage: 47% → 96% (128 tests: 125 non-network + 3 network).
  • New test files for utils, logging, and base reaper classes. Extended all three reaper test files.
  • Added @pytest.mark.network integration tests for NWIS, NWP (HRRR), and MRMS (S3).
  • Fixed stale coverage data caused by --cov-append; cleaned up warning filters.

CI

  • Full matrix (3 OS × 2 Python) runs non-network tests; network tests run once on ubuntu-latest / py314.

Documentation

  • Added a new example for Hurricane Beryl, TX.
  • Fixed missing/incorrect docstring parameters and types. Removed 22 trivial comments; verified all markdown and docs are up to date.

Taher Chegini added 4 commits April 8, 2026 17:35
ecCodes C library was missing on Windows CI because it wasn't in the
pixi conda dependencies. Also broadened except clauses in conftest.py
to catch RuntimeError from cfgrib when ecCodes is unavailable.
Updated README with installation guidance for ecCodes.
PyArrowFileIO misparses Windows drive letters (D:) as URI schemes.
FsspecFileIO delegates to fsspec which handles native OS paths correctly
on all platforms.
Use fsspec for parquet writes, pass URI strings through to xarray for
zarr, and parse S3 URIs into bucket/prefix for icechunk storage.
Add moto-based S3 tests for parquet and zarr.
@cheginit cheginit force-pushed the feature/base-harvester branch from 5a25b0a to 7e35d20 Compare April 8, 2026 22:07
FsspecFileIO also parses drive letters as URI schemes. Use
Path.as_uri() to produce file:/// URIs that resolve correctly
on all platforms.
@sray014 sray014 merged commit e69354b into main Apr 9, 2026
14 checks passed
@sray014 sray014 deleted the feature/base-harvester branch April 9, 2026 20:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants