Skip to content

waterdata: tighten stats + OGC pagination (geometry, KeyError, non-200)#255

Draft
thodson-usgs wants to merge 4 commits intoDOI-USGS:mainfrom
thodson-usgs:fix/stats-pagination
Draft

waterdata: tighten stats + OGC pagination (geometry, KeyError, non-200)#255
thodson-usgs wants to merge 4 commits intoDOI-USGS:mainfrom
thodson-usgs:fix/stats-pagination

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

@thodson-usgs thodson-usgs commented May 3, 2026

Summary

Three correctness fixes to the two pagination loops in `dataretrieval.waterdata.utils`.

  • `get_stats_data` dropped geometry on continuation pages. Page 1 honored `GEOPANDAS` but pages 2..N hard-coded `geopd=False`. With geopandas installed, a multi-page stats response started as a `GeoDataFrame` and pages 2..N came back as plain `DataFrame`s; `pd.concat` then silently downgraded the result and the caller lost geometry / CRS. Use `geopd=GEOPANDAS` on every page.
  • `get_stats_data` raised `KeyError` on responses without a `next` key. Some terminal responses simply omit it. Switched to `body.get("next")`, which produces `None` and exits the loop cleanly.
  • Both pagination loops swallowed non-200 continuation pages silently. A 4xx/5xx page whose body happened to JSON-decode could be appended as data and pagination would quietly stop — the caller got a partial result with no warning. Added an explicit `if status_code != 200` raise inside each loop so the existing log-and-truncate handler fires deliberately rather than incidentally.

Test plan

  • Three new regression tests in `tests/waterdata_utils_test.py`: `_walk_pages` raises and stops on a 500-status continuation; `get_stats_data` accepts a response without a `next` key; `get_stats_data` truncates cleanly on a 503 continuation.
  • Existing tests in the file still pass.
  • Full local test suite passes (229 tests, with `API_USGS_PAT` set).

Related PRs

Other open PRs in this bug-review series that touch dataretrieval/waterdata/utils.py (different functions, no functional conflicts):

thodson-usgs and others added 2 commits May 3, 2026 15:23
…-200

Three correctness fixes to the two pagination loops in
dataretrieval.waterdata.utils.

* `get_stats_data` honored `GEOPANDAS` for the first page but
  hard-coded `geopd=False` on every continuation page. With geopandas
  installed, a multi-page stats response started as a `GeoDataFrame`
  and pages 2..N came back as plain `DataFrame`s; `pd.concat` then
  silently downgraded the result and the caller lost geometry / CRS.
  Use `geopd=GEOPANDAS` on every page.

* `get_stats_data` indexed `body["next"]` directly, raising `KeyError`
  on responses without that key (some terminal responses simply omit
  it). Switch to `body.get("next")`, which produces `None` and exits
  the loop cleanly.

* Both `get_stats_data`'s in-loop request and `_walk_pages`'s in-loop
  request returned the response without checking `status_code`. A 4xx
  or 5xx page whose body happened to JSON-decode could be appended as
  data, then pagination quietly stopped — the caller got a partial
  result with no warning. Add an explicit `if status_code != 200`
  raise inside each loop so the existing log-and-truncate handler
  fires deliberately rather than incidentally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Both pagination loops now had four call sites repeating
``if resp.status_code != 200: raise RuntimeError(_error_body(resp))``.
Move that into a one-line helper alongside ``_error_body`` and call
it from every site.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens pagination behavior in dataretrieval.waterdata.utils, specifically for WaterData stats and OGC fetches, so paginated responses preserve expected structure and stop more intentionally on continuation-page failures.

Changes:

  • Refactors non-200 response checks into _raise_if_not_ok() and applies it to initial and continuation-page requests.
  • Updates get_stats_data() to tolerate missing "next" keys and to preserve GEOPANDAS handling across all pages.
  • Adds regression tests around missing "next" handling and non-200 continuation pages.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
dataretrieval/waterdata/utils.py Adjusts pagination/error handling in _walk_pages() and get_stats_data(), including stats continuation parsing.
tests/waterdata_utils_test.py Adds regression tests for pagination edge cases in _walk_pages() and get_stats_data().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +342 to +345
def _raise_if_not_ok(resp: requests.Response) -> None:
"""Raise ``RuntimeError(_error_body(resp))`` for any non-200 response."""
if resp.status_code != 200:
raise RuntimeError(_error_body(resp))
body = resp.json()
all_dfs.append(_handle_stats_nesting(body, geopd=False))
next_token = body["next"]
all_dfs.append(_handle_stats_nesting(body, geopd=GEOPANDAS))
Comment thread tests/waterdata_utils_test.py Outdated
Comment on lines +87 to +93
def test_walk_pages_raises_on_non_200_in_loop():
"""`_walk_pages` must surface a non-200 mid-loop, not silently truncate.

Regression: previously any non-200 page was appended (with whatever
body it had) and pagination quietly stopped because `_get_resp_data`
or `_next_req_url` raised inside the bare except. The user got a
partial result with no warning.
thodson-usgs and others added 2 commits May 4, 2026 10:12
Per copilot review on PR DOI-USGS#255:

- _error_body: catch JSONDecodeError when a 4xx/5xx returns plain text
  or HTML. Previously, _raise_if_not_ok -> _error_body -> resp.json()
  raised JSONDecodeError on non-JSON bodies, defeating the in-loop
  status check.
- Tests:
  - Rename test_walk_pages_raises_on_non_200_in_loop to
    test_walk_pages_truncates_on_non_200_continuation; the assertion
    verifies log-and-truncate, not raise.
  - New test_get_stats_data_preserves_geometry_across_pages exercises
    the GEOPANDAS=True continuation path so a regression to geopd=False
    on page 2..N is caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	tests/waterdata_utils_test.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants