Skip to content

WIP: perf(waterdata): Optional parallel chunk processing in multi_value_chunked (draft)#278

Draft
thodson-usgs wants to merge 21 commits into
DOI-USGS:mainfrom
thodson-usgs:chunker-async-experiment
Draft

WIP: perf(waterdata): Optional parallel chunk processing in multi_value_chunked (draft)#278
thodson-usgs wants to merge 21 commits into
DOI-USGS:mainfrom
thodson-usgs:chunker-async-experiment

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

Draft on top of #276. Adds an opt-in max_workers kwarg to @multi_value_chunked; when set > 1, sub-requests run concurrently in a ThreadPoolExecutor. Default (None) keeps the existing sequential behavior, including the mid-call quota guard and QuotaExhausted.partial_frame resumability.

Opening draft to capture the benchmark and the truncation issue surfaced under load; not ready to merge.

Benchmark — 671 CAMELS sites × get_field_measurements, 337,808 rows, ~14 paginated chunks

config mean ± stdev notes
sync 82.1s ± 12.1s reference
workers=2 100.2s unstable; one run truncated by 429
workers=4 51.5s ± 2.8s ~1.6x; one run lightly truncated
workers=8 crashed 429 mid-flight, unusable

The ceiling is the per-IP rate budget USGS enforces, not local CPU. Parallelism helps modestly on pagination-bound workloads and not at all on the common metadata case (already <1.5s).

Tradeoffs documented in the docstring

Parallel mode forfeits the mid-call quota guard (workers race past the floor before any one observes the crossing) and QuotaExhausted.partial_frame resumability (executor.map discards completed-but-uncollected results when any chunk raises). Sequential remains the default.

Open issue blocking merge: silent truncation under parallel load

Two of the parallel runs above returned fewer rows than the sync baseline (257,634 and 330,174 vs 337,808) with no exception raised. _walk_pages returns whatever rows it accumulated when a paginated request hits 429 mid-stream; the chunker then combines those partial frames into a full-looking result. This pre-existed serially but is much easier to trigger under parallelism.

Before this can merge we need:

  • _walk_pages to surface mid-pagination 429s instead of returning partial rows silently, or
  • The parallel branch to detect partial-pagination responses and raise.

I lean toward fixing _walk_pages — it's the same class of silent-truncation bug PR #273 fixed for the non-429 case.

Refs #276

thodson-usgs and others added 20 commits May 14, 2026 08:24
The OGC API now supports comma-separated values for fields like
monitoring_location_id, parameter_code, and statistic_id, making POST+CQL2
unnecessary for most services. Update _construct_api_requests to join list
params with commas and use GET for daily, continuous, latest-daily,
latest-continuous, field-measurements, time-series-metadata, and
channel-measurements.

The monitoring-locations endpoint does not yet support comma-separated GET
parameters (returns 400); it retains the POST+CQL2 path. Closes DOI-USGS#210.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…nit tests

Style alignment with the rest of `waterdata/utils.py`:

- Hoist `_cql2_required_services` from a function-local lowercase `set`
  to a module-level `_CQL2_REQUIRED_SERVICES = frozenset(...)` to match
  the convention of `_DATE_RANGE_PARAMS`, `_NO_NORMALIZE_PARAMS`,
  `_MONITORING_LOCATION_ID_RE`, etc.
- Drop the "Legacy path:" prefix in the inline comment. POST/CQL2 is
  still the current and required path for monitoring-locations — the
  API team hasn't promised to add comma-GET there. Rephrased the two
  branches symmetrically ("POST with CQL2 JSON" / "GET with comma-
  separated values") so neither reads as deprecated.

New unit tests:

- `test_construct_api_requests_single_value_stays_get` — confirms a
  scalar `monitoring_location_id="USGS-..."` still produces a clean GET
  with no `%2C`, i.e. existing single-site callers see no change.
- `test_construct_api_requests_numeric_list_joins_with_str` — pins down
  that `water_year=[2020, 2021]` reaches the URL as `water_year=2020%2C2021`,
  exercising the `str(x) for x in v` generator that exists specifically
  to handle non-string list params (without it, `",".join` on a list of
  ints would TypeError).
- `test_construct_api_requests_two_element_date_list_becomes_interval` —
  pins down the contract that a two-element date list (`time=["2024-01-01",
  "2024-01-31"]`) is interpreted as start/end of an OGC datetime interval
  (joined with `/`), NOT as two discrete dates. The OGC `datetime`
  parameter doesn't support "these N specific dates" — that would
  require a CQL filter. Test exists so this semantic choice can't be
  silently changed.
Wraps _fetch_once with a cartesian-product chunker that sits OUTSIDE
@filters.chunked. Splits multi-value list params (monitoring_location_id,
parameter_code, statistic_id, etc.) across sub-requests so each URL fits
the server's ~8 KB byte limit.

Coordination with @filters.chunked: the planner's URL probe substitutes
the filter with its longest top-level OR-clause via _filter_aware_probe_args,
modeling the per-sub-request URL the inner filter chunker will actually
emit. Without this coordination, a long OR-filter plus multi-value lists
triggered premature RequestTooLarge even when the combined chunkers would
have made things fit.

Two safety guards:
- max_chunks=1000 cap on cartesian-product size (matches USGS API hourly
  quota; raises RequestTooLarge with the actual count when exceeded).
- QuotaExhausted abort: between sub-requests, reads x-ratelimit-remaining;
  if below quota_safety_floor (default 50), raises with the partial frame
  and chunk offset so callers can resume instead of crashing into a
  mid-call HTTP 429.

30 unit tests cover the planner, filter-aware coordination, the cap, and
the quota-aware abort. Live tests in /tmp verify a 3-dim equivalence case
(chunked == unchunked, 16 sub-requests, all axes split), 6 edge-case
stress scenarios, and 3 mv/filter composition regimes.

Depends on DOI-USGS#273 (paginated silent-truncation fix) — this PR multiplies
the frequency at which the silent-truncation bug class would have
surfaced. Merge order: DOI-USGS#273 -> DOI-USGS#233 -> this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The filter chunker (`filters.chunked`) splits a filter into chunks each
≤ the per-sub-request budget, but bails (returns the full filter
unchanged) when ANY single OR-clause exceeds the budget. So the smallest
filter size the inner chunker can guarantee to emit per sub-request is
bounded below by the LARGEST single clause, not the smallest.

The original implementation used `min(parts)` to model the chunker's
output floor. For filters with uniform clause sizes (all my prior
tests), min == max and the bug was hidden. For filters with lopsided
clauses — e.g. `id='1' OR id='abcdef…long-string'` — using `min` would
let the planner falsely declare a plan feasible. The inner chunker
would then bail on the large clause, the real per-sub-request URL
would carry the full filter, and the request would 414 server-side.

Switch to `max(parts, key=len)`. If singleton+max-clause fits the URL
limit, the inner chunker's budget is ≥ max(parts), so all clauses fit
individually and chunking succeeds. If singleton+max-clause doesn't
fit, the planner correctly raises `RequestTooLarge` instead of
producing an unservable plan.

Regression test: `test_plan_chunks_probes_with_max_clause_not_min`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The motivating user story for this PR is the same one R `dataRetrieval`
covers in #870: pull a long monitoring-location list from one getter,
feed it to another. Before chunking this fails with HTTP 414 once the
URL grows past the server's ~8 KB limit; after it transparently fans out.

- chunking.py: prepend a docstring example showing the Ohio-stream-sites
  → daily-discharge chained call, so readers landing on the module file
  see the motivating scenario immediately.
- api.py get_daily: add the same chained example to the Examples block
  (where similar single-site and multi-site examples already live), so
  the most-used getter's docstring shows what just became possible.
- NEWS.md: user-visible entry framing the change in terms of "this now
  works" — chained queries, transparent chunking, max_chunks cap, and
  QuotaExhausted resume. References R PR #870 as the analogous change.

No code changes; pure docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
….prod

Three small simplifications, no behavior change:

- Extract _chunk_bytes(chunk) helper for len(",".join(map(str, chunk))).
  Used in both _worst_case_args and _plan_chunks; the helper documents
  the cost model the planner compares chunks under.
- Name the magic sentinel 10**9 as _QUOTA_UNKNOWN. _read_remaining
  returns it on missing/malformed x-ratelimit-remaining headers; having
  one definition prevents the value from drifting between branches.
- Use math.prod for the cartesian-product cardinality calculation in
  _plan_chunks (max_chunks check) and the wrapper (quota-floor loop
  bound). Replaces an open-coded multiply-loop in two places.

All 25 chunker tests and 88 filter tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	NEWS.md
#	dataretrieval/waterdata/utils.py
#	tests/waterdata_test.py
PR 233 routes monitoring-locations through POST with the multi-value
list embedded in a CQL2 JSON body — the URL stays ~200 bytes
regardless of how many sites are passed. The chunker was probing
url length only, so it concluded "no need to split" for any number
of monitoring_location_id values and let the request go out
unchunked. The server then rejected it with HTTP 403 ("Query request
denied. Possible reasons include query exceeding server limits")
once the CQL2 body grew past its own server-side limit.

Empirical: get_monitoring_locations(monitoring_location_id=[671
CAMELS gauges], properties=[...]) failed; bisection on 100 / 250 /
500 / 671 sites showed the boundary between 100 (PASS) and 250
(FAIL).

Add a small _request_bytes() helper that sums URL and body lengths,
and route both planner probes (the initial "fits?" check and the
greedy-halving loop) through it. For GET routes (body is None) this
reduces to the previous URL-only sizing — no behavior change. For
POST routes, the body bytes now drive the chunking decision.

The test _FakeReq fixture grows a body slot defaulting to None to
keep its GET-shape contract while satisfying the new probe.

Verified against live API: the same four monitoring-locations calls
now succeed (100 / 250 / 500 / 671 sites).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Annotate ``req`` as ``requests.PreparedRequest`` (the only caller flow
  is ``build_request(...).prepare()``; ``requests`` is already imported).
- ``_cql2_param`` returns ``str``, which ``requests.Request(data=...)``
  carries through to ``req.body``. The hot path on POST routes was
  ``str(body).encode("utf-8")``; ``str(<str>)`` is a no-op, so drop it
  and let ``body.encode("utf-8")`` allocate once.
- Trim docstring: replaces the rotting "PR 233" / "currently only
  monitoring-locations" anchors with a behavioral description that
  doesn't rely on which routes happen to be POST today.

No behavior change. 30 chunker unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… current quota

The ``multi_value_chunked`` decorator reads ``x-ratelimit-remaining``
from the response returned by ``fetch_once(sub_args)`` to honor its
documented ``QuotaExhausted`` safety floor. That response was two
layers stale:

1. ``_walk_pages`` captured ``initial_response = resp`` before
   pagination and returned it, so any sub-request with N > 1 pages
   bubbled up only the first page's headers — the loop already kept
   overwriting ``resp`` each iteration; we just weren't returning the
   latest.
2. ``_combine_chunk_responses`` returned ``responses[0]`` with summed
   ``elapsed``, so when ``filters.chunked`` fanned out a long
   OR-filter into N sub-chunks the outer wrapper only saw the first
   sub-chunk's headers.

Composed, the staleness gap per outer chunk was
``inner_chunks × pages_per_inner_chunk − 1`` HTTP requests of quota
consumption the chunker was blind to. For the canonical workload
(chained query, long site list, paginated filter) that gap easily
exceeds the default floor of 50, so the guard never tripped — users
hit ``RuntimeError("429: Too many requests...")`` from
``_raise_for_non_200`` instead of the structured ``QuotaExhausted``
with ``partial_frame``/``completed_chunks`` they were promised.

Fix both layers: ``_walk_pages`` returns the latest ``resp`` (which
the loop was already maintaining), and ``_combine_chunk_responses``
returns ``responses[-1]`` (with ``elapsed`` summed onto it instead
of onto ``responses[0]``). Both changes match
``QuotaExhausted.partial_response``'s docstring ("metadata for the
last successful sub-request"). Same fix applied to the parallel
pagination loop in the stats helper for consistency.

No behavior change for single-page mocked tests (initial == latest).
209 waterdata unit tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``_filter_aware_probe_args`` substituted the LONGEST raw OR-clause
into the URL probe, but the inner ``filters._effective_filter_budget``
computes its bail floor as ``len(longest) * max(per_clause_encoding_
ratio)`` — the worst per-call ratio across all clauses, not the ratio
of the longest one. Under lopsided encoding (e.g. a long alphanumeric
clause alongside short clauses heavy in ``%27`` / ``%2C`` / non-ASCII),
``encoding_ratio_max`` exceeds ``ratio_of_longest`` and the planner
could approve a plan the inner chunker then refuses to emit, leaving
the actual URL over the limit.

Mirror the inner chunker's model: synthesize an ASCII probe clause of
length ``ceil(len(longest) * encoding_ratio_max)``. ASCII has 1:1
URL encoding, so the URL builder sees exactly the bail-floor byte
count and the planner's check coincides with the inner chunker's
bail condition.

Dormant in practice for typical USGS CQL filters (``field='value'``
encoding ratios all cluster between 1.16 and 1.67), but the
docstring claimed a categorical guarantee that was technically false.
This restores that guarantee.

Test ``test_filter_aware_probe_args_substitutes_longest_or_clause``
was renamed and rewritten to verify the new contract: the probe
filter is a synthetic ASCII string whose length matches the bail
floor model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… fixes

Five fixes from the PR review:

- ``_plan_chunks`` checks ``total > max_chunks`` inside the halving
  loop now: each split only grows the cartesian product, so once the
  cap is crossed it can never come back under. Continuing to halve
  the URL just wastes work.
- ``_plan_chunks``'s ``max_chunks`` default becomes ``int | None =
  None`` and resolves to ``_DEFAULT_MAX_CHUNKS`` at call time. The
  previous ``max_chunks: int = _DEFAULT_MAX_CHUNKS`` bound the constant
  at module-import time, defeating the documented monkeypatch path
  for direct callers (the wrapper already resolved lazily, but
  ``_plan_chunks`` direct calls saw the import-time value).
- ``_chunk_bytes`` docstring no longer claims the URL-encoded comma
  overhead is "constant per chunk" — it scales with ``2 * (len -
  1)``. The function still uses raw ``,`` length because the planner
  only needs a monotone comparator across dims, but the wording was
  wrong.
- ``QuotaExhausted.partial_response`` docstring now says "last
  completed sub-request" to match the bug_001 fix in
  ``_combine_chunk_responses``.
- Module-level docstring drops the chained-query example (duplicated
  from ``get_daily``'s docstring) and points readers there.

No behavior change for existing callers. 209 waterdata tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- ``_chunk_bytes`` now returns URL-encoded length via ``quote_plus``
  instead of raw ``,``-join length. The function is the planner's
  biggest-chunk comparator and indirect URL contribution estimate;
  values containing URL-special chars (``%``, ``+``, ``/``, ``&``,
  etc.) expand under encoding and could mis-rank chunks under the
  raw-length form. For typical USGS multi-value workloads
  (alphanumeric IDs and codes) the two are equal, but the encoded
  form is always correct and matches what ``_request_bytes`` sees.
- ``filters.chunked``'s docstring now says "last chunk's URL/headers"
  to match what ``_combine_chunk_responses`` returns after the
  bug_001 fix, with a note about why (rate-limit state).
- Module docstring rewrapped so identifiers (``filters._effective_
  filter_budget``, ``per-clause encoding ratio``) don't break across
  line endings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ive elapsed

The bug_001 fix made paginated and filter-chunked calls return the
LAST response, which gave the chunker's QuotaExhausted guard access
to current x-ratelimit-remaining — but it also clobbered
BaseMetadata.url with a pagination cursor or final sub-chunk URL,
harming reproducibility for users who inspect md.url to re-issue or
debug a query.

Split the fields:

- ``md.url`` — first (original-request) response's URL, unchanged
  from pre-bug_001 behavior; matches the user's submitted query.
- ``md.header`` — last completed sub-request/page's headers, so
  ``x-ratelimit-remaining`` reflects current quota state and the
  chunker's QuotaExhausted guard works correctly. This is a
  behavior change from pre-bug_001 (used to be first-page headers).
- ``md.query_time`` (response.elapsed) — cumulative across all
  pages/sub-requests, not just one. Also a small behavior change.

Implemented in three sites:

- ``_walk_pages`` (utils.py): track ``initial_response`` and
  ``total_elapsed``; on exit, copy ``resp.headers`` and
  ``total_elapsed`` onto ``initial_response`` and return it.
- The parallel pagination loop in the stats helper: same pattern.
- ``_combine_chunk_responses`` (filters.py): return ``responses[0]``
  but with ``headers`` from ``responses[-1]`` and summed ``elapsed``.

NEWS entry updated to call out the metadata-behavior change.
Docstrings on ``QuotaExhausted.partial_response`` and
``filters.chunked`` refreshed to match.

All 209 waterdata tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`filters._WATERDATA_URL_BYTE_LIMIT` was split across a line break
(`filters._WATERDATA_` / `URL_BYTE_LIMIT`), which renders awkwardly
in Sphinx and copies poorly. Rewrap the line so the identifier sits
intact on a single line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five targeted cleanups from review, no behavior change:

- Drop the duplicate ``_FetchOnce`` TypeVar in chunking.py; import the
  one already defined in filters.py. The two had identical bodies.
- Extract ``_max_per_clause_encoding_ratio(parts)`` in filters.py.
  Both ``_effective_filter_budget`` and the outer
  ``_filter_aware_probe_args`` need the same worst-case ratio
  formula; pinning it in one place keeps them from drifting.
- Replace the manual ``best: tuple | None`` sentinel + nested-loop
  scan in ``_plan_chunks`` with a generator + ``max(..., key=...,
  default=None)``. Removes the sentinel, the conditional-update
  branch, and the post-loop ``if best is None`` check.
- Extract ``_finalize_paginated_response`` in utils.py so the
  4-line "carry last page's headers + cumulative elapsed onto the
  initial response" pattern lives in one spot instead of duplicated
  across ``_walk_pages`` and the stats helper.
- Tighten parametrized type hints from ``dict[str, list]`` to
  ``dict[str, list[Any]]`` (and the planner's return type) per
  PEP 585.

Also trimmed the 17-line ``_filter_aware_probe_args`` docstring to
9 lines; the substance is preserved, the prose is leaner.

All 209 waterdata tests pass; ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two strays left after the previous /simplify pass:

- ``_chunk_bytes(chunk: list)`` → ``chunk: list[Any]``. The function
  calls ``map(str, chunk)``, so elements just need ``__str__``;
  ``list[Any]`` matches what the planner actually passes (mixed
  ``list[str]`` for IDs/codes and ``list[int]`` for ``water_year``).
- ``_finalize_paginated_response(..., total_elapsed)`` had no
  annotation. The caller accumulates ``resp.elapsed`` (a
  ``datetime.timedelta``); add ``total_elapsed: timedelta`` and
  import ``timedelta`` alongside ``datetime`` in utils.py.

No behavior change; 209 tests pass and ruff is clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… build

Three corrections from PR review:

- ``RequestTooLarge``'s docstring said the second irreducible case is
  "any chunkable filter reduced to its smallest top-level OR-clause."
  The planner actually probes at the inner chunker's bail-floor size,
  which is bounded below by the LONGEST clause (after URL-encoding),
  not the shortest. Rewrite the case to describe what the planner
  actually does.
- ``test_plan_chunks_coordinates_with_filter_chunker``'s docstring
  said the planner models per-sub-request URL as
  ``worst-dim-chunk + shortest-clause``. Same direction error;
  corrected to ``longest-clause-after-encoding`` with the rationale
  (inner chunker's bail floor, not its happy-path output).
- ``_fake_build`` test fixture used raw ``len(",".join(...))`` for
  list params, but the real ``_construct_api_requests`` URL goes
  through ``quote_plus``. For the all-alphanumeric values these
  tests use, the gap is 2 bytes per comma — small but enough to let
  a test pass against the fake while production would have a
  larger URL. Pull ``quote_plus`` into the fake so its byte count
  matches what the chunker's ``_request_bytes`` actually measures.

No behavior change to the production chunker; 209 waterdata tests
pass with no other tunings.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five docstring fixes, no behavior change:

1. ``_plan_chunks`` docstring described the first ``RequestTooLarge``
   case as "filter (if any) is already at its smallest OR-clause" —
   same wrong-direction error already fixed on ``RequestTooLarge``.
   The planner probes at the inner chunker's BAIL floor (longest
   clause's URL contribution), not the smallest. Rewrite to "the
   smallest reducible plan" with the bail-floor clarification.
2. The ``RequestTooLarge`` raised inside the halving loop carried
   a matching wrong-direction phrase ("any chunkable filter reduced
   to one OR-clause"). Rewrite to "any chunkable filter at the
   inner chunker's bail-floor size" and broaden the user advice to
   include "shorten the filter".
3. ``_chunk_bytes`` docstring claimed it's "indirectly used as the
   URL contribution estimate" — that's not what the code does; the
   function is a comparator only. Trim the misleading sentence and
   keep the rationale for ``quote_plus`` over raw join length.
4. ``_worst_case_args`` docstring's "with the filter already reduced
   to its filter-chunker floor" was oblique. Rewrite to make the
   chain explicit: caller passes ``probe_args`` (already through
   ``_filter_aware_probe_args``), and this function uses each dim's
   largest chunk against that.
5. The ``_DEFAULT_MAX_CHUNKS`` module comment said it's "read lazily
   in the wrapper" — stale; ``_plan_chunks`` now also resolves
   lazily. Update to "both the decorator wrapper and ``_plan_chunks``".

209 waterdata tests pass; ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Draft. Adds `max_workers` to `@multi_value_chunked`; when set > 1
sub-requests run concurrently in a ``ThreadPoolExecutor``. Default
(``None``) keeps the existing sequential behavior, including the
mid-call quota guard and ``QuotaExhausted.partial_frame``
resumability. Parallel mode forfeits both: workers can race past the
floor before any one observes the crossing, and any chunk failure
discards completed-but-uncollected results.

Benchmarked on 671 CAMELS sites x ``get_field_measurements``
(337,808 rows, ~14 paginated chunks):
- sync:        82.1s +/- 12.1s
- workers=2:  100.2s (unstable; one run truncated by 429)
- workers=4:   51.5s +/- 2.8s (~1.6x; one run lightly truncated)
- workers=8:   crashed on 429 mid-flight

Open question for follow-up: under parallelism we observe partial
data being returned silently when one chunk hits a paginated 429 --
``_walk_pages`` returns the rows it already has rather than
surfacing the truncation. This pre-existed but parallelism makes
it easier to trigger. Investigate as part of any future merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@thodson-usgs
Copy link
Copy Markdown
Collaborator Author

Design note: partial-frame propagation from _walk_pages

#279 makes _walk_pages and get_stats_data raise on any mid-pagination failure instead of silently returning truncated data. That closes the silent-data-loss bug at the source, but it also means any mid-pagination 429 (or 5xx, or network error) hit by a chunk in this PR's parallel branch will lose all pages collected within that chunk, and the chunker's existing QuotaExhausted.partial_frame will only carry data from chunks that completed cleanly before the failure.

For most workloads that's fine — a chunk is at most a few thousand rows × a handful of pages, and the user reruns the failed chunk on retry. But for the heavy pagination cases this PR's benchmark targets (671 sites × get_field_measurements, ~24k rows per chunk paginating across 5-10 pages), losing a chunk's near-complete paginated walk on a final-page 429 is a non-trivial regression in recoverable work.

Suggested follow-up before this PR lands

Promote _walk_pages's plain RuntimeError to a richer exception that carries the partial state, then have the chunker translate it:

# utils.py
class PaginatedRequestError(RuntimeError):
    def __init__(self, msg, *, partial_frame, last_response):
        super().__init__(msg)
        self.partial_frame = partial_frame
        self.last_response = last_response

# in _walk_pages' except block:
raise PaginatedRequestError(
    _paginated_failure_message(len(dfs), e),
    partial_frame=pd.concat(dfs, ignore_index=True),
    last_response=initial_response,
) from e

In this PR's parallel wrapper, catch PaginatedRequestError per chunk and feed its partial_frame into the QuotaExhausted(partial_frame=...) payload — same shape the existing quota guard produces between chunks, so callers downstream have one place to look for partial results.

This is purely additive: existing except RuntimeError handlers still catch it (it inherits RuntimeError), and the message + __cause__ chain are unchanged. The .partial_frame attribute is just available for callers who want to use it.

This is the kind of "design when there's a concrete consumer" extension that #279 deliberately left out. This PR is that consumer.

@thodson-usgs thodson-usgs changed the title perf(waterdata): Optional parallel chunk processing in multi_value_chunked (draft) WIP: perf(waterdata): Optional parallel chunk processing in multi_value_chunked (draft) May 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant