Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
257 changes: 257 additions & 0 deletions demos/USGS_WaterData_ContinuousData_Examples.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "d664492b",
"metadata": {},
"source": [
"# Continuous Data\n",
"\n",
"Continuous data are collected by automated sensors, typically at a fixed\n",
"15-minute interval (you may also hear them called \"instantaneous values\" or\n",
"\"IV\"). They are described by parameter name and parameter code, and retrieved\n",
"with `get_continuous`.\n",
"\n",
"This notebook covers the two things that matter when a continuous pull gets\n",
"large: `dataretrieval` **chunks big requests for you** and can **resume** a pull\n",
"that was interrupted partway through, and the one case you still handle yourself\n",
"— the service's 3-year-per-request time limit."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e7e06e81",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"from dataretrieval import waterdata\n",
"\n",
"site = \"USGS-0208458892\""
]
},
{
"cell_type": "markdown",
"id": "b0136bd1",
"metadata": {},
"source": [
"## What continuous data are available?\n",
"\n",
"Filter the combined metadata to `data_type=\"Continuous values\"` to see which\n",
"time series a site offers and how far back each goes:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6f8a9d87",
"metadata": {},
"outputs": [],
"source": [
"continuous_available, _ = waterdata.get_combined_metadata(\n",
" monitoring_location_id=site,\n",
" data_type=\"Continuous values\",\n",
")\n",
"avail = continuous_available[[\"parameter_code\", \"parameter_name\", \"begin\", \"end\"]]\n",
"avail.sort_values(\"parameter_code\").reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"id": "fdaa8150",
"metadata": {},
"source": [
"## Large requests are chunked for you\n",
"\n",
"Any list-valued argument — a long list of monitoring locations, several parameter\n",
"codes, a complex CQL filter — can push a single request URL past the server's\n",
"~8 KB limit. `dataretrieval` handles this automatically: it splits the query into\n",
"URL-sized sub-requests, issues them, and recombines (and de-duplicates) the\n",
"results into one frame. **You never need to loop over sites yourself** — request\n",
"everything in one call.\n",
"\n",
"For example, asking for several parameter codes at once just returns one combined\n",
"long-format frame:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6bc05102",
"metadata": {},
"outputs": [],
"source": [
"multi, _ = waterdata.get_continuous(\n",
" monitoring_location_id=site,\n",
" parameter_code=[\"00095\", \"00010\"], # specific conductance + water temperature\n",
" time=\"2024-07-01/2024-07-02\",\n",
")\n",
"multi.groupby(\"parameter_code\")[\"value\"].agg([\"count\", \"min\", \"max\"])"
]
},
{
"cell_type": "markdown",
"id": "353ad4ec",
"metadata": {},
"source": [
"## Resilient pulls: resume after an interruption\n",
"\n",
"A large request becomes many sub-requests under the hood, so a long pull can be\n",
"interrupted partway through by a rate limit (HTTP 429) or a transient server\n",
"error (HTTP 5xx). Rather than discard the work already done, `dataretrieval`\n",
"raises a `ChunkInterrupted` that **preserves the completed sub-requests** and\n",
"lets you continue:\n",
"\n",
"- `QuotaExhausted` (429) and `ServiceInterrupted` (5xx) both subclass\n",
" `ChunkInterrupted`.\n",
"- `exc.partial_frame` holds whatever completed before the failure.\n",
"- `exc.retry_after` is the server's suggested wait (when provided).\n",
"- `exc.call.resume()` re-issues **only the still-pending** sub-requests and\n",
" returns the full `(data, metadata)`.\n",
"\n",
"The pattern below waits out the interruption and resumes until the pull\n",
"finishes. (In normal conditions the request completes on the first try and the\n",
"`except` block never runs.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e2e9ddff",
"metadata": {},
"outputs": [],
"source": [
"import time\n",
"\n",
"from dataretrieval.waterdata.chunking import ChunkInterrupted\n",
"\n",
"try:\n",
" sensor_data, _ = waterdata.get_continuous(\n",
" monitoring_location_id=site,\n",
" parameter_code=\"00095\",\n",
" time=\"2024-07-01/2024-07-08\",\n",
" )\n",
"except ChunkInterrupted as exc:\n",
" print(\n",
" f\"interrupted after {exc.completed_chunks}/{exc.total_chunks} chunks; resuming\"\n",
" )\n",
" while True:\n",
" time.sleep(exc.retry_after or 5 * 60) # honor Retry-After, else back off\n",
" try:\n",
" sensor_data, _ = exc.call.resume()\n",
" break\n",
" except ChunkInterrupted as again:\n",
" exc = again\n",
"\n",
"print(f\"{len(sensor_data):,} rows\")\n",
"sensor_data[[\"time\", \"parameter_code\", \"value\", \"approval_status\"]].head()"
]
},
{
"cell_type": "markdown",
"id": "397e87b5",
"metadata": {},
"source": [
"## The 3-year window: the one axis you split yourself\n",
"\n",
"There is one limit the library does **not** chunk for you: the continuous service\n",
"returns at most **3 years of data per request**, and a time window is not a\n",
"list-shaped axis it can fan out. (With no `time` argument the service returns the\n",
"latest year; continuous data also has no geometry column and ignores bounding-box\n",
"queries.)\n",
"\n",
"So a multi-year, single-site pull is the one place you still split by time. The\n",
"service is most efficient one calendar year at a time, so build a list of yearly\n",
"windows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bd26d199",
"metadata": {},
"outputs": [],
"source": [
"# Split [start, end] into per-calendar-year (start, end) date strings.\n",
"def year_chunks(start, end):\n",
" start, end = pd.Timestamp(start), pd.Timestamp(end)\n",
" edges = pd.to_datetime([f\"{y}-01-01\" for y in range(start.year + 1, end.year + 1)])\n",
" starts = [start, *edges]\n",
" ends = [*(edges - pd.Timedelta(days=1)), end]\n",
" return [\n",
" (s.strftime(\"%Y-%m-%d\"), e.strftime(\"%Y-%m-%d\")) for s, e in zip(starts, ends)\n",
" ]\n",
"\n",
"\n",
"# Covering a full multi-year record (no data downloaded here):\n",
"pd.DataFrame(year_chunks(\"2012-10-01\", \"2025-09-30\"), columns=[\"start\", \"end\"])"
]
},
{
"cell_type": "markdown",
"id": "3bc4f40f",
"metadata": {},
"source": [
"Then request each window and concatenate. (We use a short two-window span here so\n",
"the notebook runs quickly; widen the dates for a full period of record.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "01ebb4a0",
"metadata": {},
"outputs": [],
"source": [
"chunks = year_chunks(\"2023-10-01\", \"2024-03-31\")\n",
"\n",
"frames = []\n",
"for start, end in chunks:\n",
" part, _ = waterdata.get_continuous(\n",
" monitoring_location_id=site,\n",
" parameter_code=\"00095\",\n",
" time=f\"{start}/{end}\",\n",
" )\n",
" frames.append(part)\n",
"\n",
"por = pd.concat(frames, ignore_index=True)\n",
"print(\n",
" f\"{len(por):,} rows from {len(chunks)} windows, \"\n",
" f\"{por['time'].min()} -> {por['time'].max()}\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "e2487bf4",
"metadata": {},
"source": [
"Wrap each window's call in the resume pattern above for an unattended,\n",
"restart-safe pull. USGS also expects to offer a direct full-period-of-record\n",
"download before the legacy NWIS services are decommissioned, which may make\n",
"time-window splitting unnecessary — check the documentation for updates.\n",
"\n",
"## More help\n",
"\n",
"- Documentation: <https://doi-usgs.github.io/dataretrieval-python/>\n",
"- Chunking and resume internals: `dataretrieval.waterdata.chunking`\n",
"- Issues / questions: <https://github.com/DOI-USGS/dataretrieval-python/issues>\n",
"- Equivalent R article: [Continuous Data](https://doi-usgs.github.io/dataRetrieval/articles/continuous_pr.html)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading
Loading