Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -21,13 +21,14 @@ Depends:
Encoding: UTF-8
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.3
Suggests:
Suggests:
aws.s3,
bookdown,
ecmwfr,
ggplot2,
ggrepel,
gt,
kableExtra,
Kendall,
knitr,
rmarkdown,
Expand Down
10 changes: 4 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,10 @@ pak::pak("NewGraphEnvironment/cd")
```r
library(cd)

# Load a STAC catalog and an area of interest
catalog <- cd_catalog(
system.file("extdata", "example_catalog.json", package = "cd")
)
# Load the live STAC catalog and an example area of interest
catalog <- cd_catalog()
aoi <- sf::st_read(
system.file("extdata", "example_aoi.gpkg", package = "cd"),
system.file("extdata", "example_aoi_kotl.gpkg", package = "cd"),
quiet = TRUE
)

Expand All @@ -46,4 +44,4 @@ cd_compare(ts, window_a = 1956:1960, window_b = 1951:1955)
## Links

- [Function reference](https://newgraphenvironment.github.io/cd/reference/)
- [Vignette](https://newgraphenvironment.github.io/cd/articles/) (coming soon)
- [Vignette: Climate Departure Analysis for a Watershed](https://newgraphenvironment.github.io/cd/articles/climate-departure.html)
109 changes: 10 additions & 99 deletions planning/active/findings.md
Original file line number Diff line number Diff line change
@@ -1,108 +1,19 @@
# Findings: EDH migration
# Findings: Vignette tmax/tmin + map clipping

## EDH API basics (from #35 research and tonight's benchmark)
_Populated during Phase 1 inspection and as work progresses._

**Zarr URL:** `https://data.earthdatahub.destine.eu/era5/reanalysis-era5-land-no-antartica-v0.zarr`
## Vignette chunk inventory

**Auth:** Personal access token, passed inline as URL password:
```
https://edh:<EDH_TOKEN>@data.earthdatahub.destine.eu/...
```
Alternative: `storage_options={"client_kwargs":{"trust_env":True}}` to read from `.netrc`.
_TBD — list each figure/table chunk, what it shows, what context layers it uses._

**Quota:** 500,000 requests/month per user (confirmed in EDH getting-started docs).
## Watershed group lookup

**Coord conventions (from live Zarr inspection):**
- Dimensions: `valid_time`, `latitude`, `longitude`
- `latitude`: descending from ~90 to -60 (confirmed by working slice `slice(60, 48)`)
- `longitude`: **0 to 359.9** (0-360 convention, NOT -180 to 180)
- For BC bbox `c(60, -140, 48, -114)` → lon slice must be `220` to `246`
- `valid_time`: hourly, from 1950-01-01T00:00 to current month last day
- Variable names: CF-style short names (`t2m`, `tp`, `d2m`, `swvl1-4`, `u10`, `v10`, etc.)
_TBD — fwapg query or helper used to resolve the WSG polygon for the AOI._

**50 variables available in single Zarr store:**
`asn, d2m, e, es, evabs, evaow, evatc, evavt, fal, lai_hv, lai_lv, lblt, licd, lict, lmld, lmlt, lshf, ltlt, pev, ro, rsn, sd, sde, sf, skt, slhf, smlt, snowc, sp, src, sro, sshf, ssr, ssrd, ssro, stl1, stl2, stl3, stl4, str, strd, swvl1, swvl2, swvl3, swvl4, t2m, tp, tsn, u10, v10`
## Ecological angles — which 2-3 to ship

**Relevant variable mapping for cd package:**
| cd variable | EDH variable(s) | Units | Notes |
|---|---|---|---|
| tmean | `t2m` | K | Hourly; monthly mean computed client-side |
| tmax | `t2m` | K | Hourly; daily max → monthly mean of daily max |
| tmin | `t2m` | K | Hourly; daily min → monthly mean of daily min |
| prcp | `tp` | m | Hourly total precipitation; sum over month, × 1000 for mm |
| dewpoint | `d2m` | K | Hourly 2m dewpoint |
| soil_moisture | `swvl1`, `swvl2`, `swvl3`, `swvl4` | m³/m³ | 4 vertical layers |
| vpd | derived | hPa | From t2m + d2m (Tetens) |
| rh | derived | % | From t2m + d2m |
_TBD — decide after inspecting what the existing narrative already covers._

## Performance (single month BC bbox)
## Map cleanup notes

- Open Zarr store: 2.6-3.0s (metadata only)
- Subset + materialize 1 month × BC bbox × hourly t2m: 14.9-15.9s
- In-memory size: 92.9 MB for 744 hours × 120 lat × 260 lon float64

## Implications for pipeline

- No rate limits means we can pull **all months in parallel** via asyncio/dask if wanted
- Zarr chunking is independent of month boundaries, so small time-slice pulls are efficient
- Can pull all 7 cd variables from the **same dataset open** — no need for separate fetches per variable

## R/Zarr tooling options

- **Reticulate + Python xarray** — proven; what `test_edh_era5_land.py` uses. Simple, works.
- **stars::read_mdim()** — GDAL zarr driver. Requires GDAL built with zarr, auth via URL password likely works.
- **pizzarr** — young R-native zarr package, may not support remote HTTPS stores.

Decision deferred to Phase 3 after Phase 2 unblocks the data.

## Known limitation: UTC-day aggregation for tmax/tmin

Both the existing R pipeline (`pipeline_tmax_tmin_hourly.R`) and the new EDH
Python script compute daily max/min over UTC-day windows, not local-time days.
For BC (UTC-7/-8) the local-afternoon tmax peak straddles UTC day boundaries,
introducing a small systematic bias vs. a local-time daily aggregation.

- CDS's `derived-era5-land-daily-statistics` product accepts a `time_zone`
parameter and would fix this — but we abandoned it in #33 due to its
~2-hour-per-job queue time.
- Fixing properly requires shifting hourly timestamps by the local UTC offset
before `.resample("1D")`, or doing per-pixel local-time aggregation
(computationally expensive and overkill for a regional dataset).
- For now: the EDH script replicates the R pipeline's existing behavior so
outputs are directly comparable. Addressing this bias is a follow-up issue
(cd package methodology).

## Precipitation accumulation (ERA5-Land tp)

The hourly `tp` variable has `GRIB_stepType=accum` — it is a **running
accumulation** from 01:00 UTC reset to 00:00 UTC next day. Naive
`.sum()` over hourly values gives a monthly precip ~8× too high.

**Solution:** use EDH's daily product
`https://data.earthdatahub.destine.eu/era5/era5-land-daily-utc-v1.zarr`
which pre-computes daily totals correctly. Monthly precip = sum of
daily totals.

**Why not fix the hourly sum:** writing accumulation logic ourselves
is bug-prone. The daily product is what ECMWF designed for this
exact case.

**Caveat:** daily product only has 14 variables (t2m, d2m, swvl1-2, tp,
wind, radiation, etc.) — no swvl3 or swvl4. So soil moisture stays on
the hourly product. tmean and dewpoint could use either; we use hourly
for consistency and because the 0.99 correlation (vs CDS) was already
validated.

## EDH product catalogue (what's available for cd package)

| Product | Zarr URL | Variables | Use |
|---|---|---|---|
| Hourly ERA5-Land | `reanalysis-era5-land-no-antartica-v0.zarr` | 50 | tmax/tmin, tmean, dewpoint, soil_moisture |
| Daily ERA5-Land (UTC) | `era5-land-daily-utc-v1.zarr` | 14 | prcp only |
| Monthly ERA5 (not Land) | — | - | NOT useful (ERA5 parent, 31 km) |

## Out-of-scope confirmations

- GEE has ERA5-Land but commercial license blocks us
- AWS / ARCO / WeatherBench all serve ERA5 (31 km), not ERA5-Land
- CDS-beta is same backend, same rate limits
_TBD — per-figure cleanup decisions (clip-to-WSG, mapgl fitBounds, etc.)_
103 changes: 9 additions & 94 deletions planning/active/progress.md
Original file line number Diff line number Diff line change
@@ -1,100 +1,15 @@
# Progress: EDH migration
# Progress: Vignette tmax/tmin + map clipping

## Session 2026-04-12 (evening)

**Context from prior day:** Got rate-limited twice by CDS. Researched alternatives (#35), benchmarked EDH Zarr at 5× faster with no rate limits and same data. Filed #36 for migration. Pivoting now.
## Session 2026-04-14 (branch + PWF baseline)

**Completed:**
- Branched off main: `36-edh-migration`
- Stashed and restored `scripts/test_edh_era5_land.py` on new branch
- Set up PWF files (this document, task_plan.md, findings.md)
- Branched `39-vignette-tmax-tmin-maps` off main
- Archived #36 PWF to `planning/archive/2026-04-issue-36-edh-migration/` with README
- Set up #39 PWF (this file, task_plan.md, findings.md)

**Next:**
- Update climate-update.yml GHA to use EDH (replace CDS fetcher with
scripts/backfill_edh_all.py + stage 3)
- Write pipeline_update_edh (EDH incremental update, replaces
CDS-based pipeline_update.R)
- Land the PR

## Session 2026-04-12 (Stage 3 to S3)

**Completed:**
- Wrote `scripts/pipeline_stage3_edh.R` — slim Stage-3-only against
`data/backfill/monthly/` (avoids CDS fetch in existing pipeline_backfill.R)
- Built 35 COGs locally: 7 variables × (annual + 4 seasons) × 76 years
- STAC catalog built, 35 items
- S3 sync pushed all 35 COGs to s3://stac-era5-land (178 MB total,
replacing the April 6-7 CDS-era COGs)
- Catalog.json unchanged (byte-identical) because filenames + extents
+ metadata didn't change — legitimate sync behavior, not a bug
- Verified end-to-end via /vsicurl: tmean_annual reads back with correct
76 years, CRS, extent. BC annual mean 1950 = -1.42 C, 2024 = +1.85 C
(3.3 C warming in 74 years — the climate departure signal the package
is built to show, now live on S3 from EDH data)

## Session 2026-04-12 (unified backfill run)

**Completed:**
- Full 1950-2025 unified backfill of 5 non-tmax/tmin variables via EDH in ~4 hours
(10:19 - 14:20 wall time)
- 532 monthly TIFs total (7 vars × 76 years), all on matching EPSG:4326 120×260 grid
- CDS-era files safely preserved in `data/backfill/monthly/_cds_backup/` (375 files)
- Two transient errors handled well:
- 1989: TimeoutError → retry-with-backoff caught it, completed on attempt 2
- 2008: ClientPayloadError → retry catch didn't include aiohttp class; outer handler
skipped the year and continued. Filled later via `--year 2008` rerun.
- QA confirmed: all grids aligned, tmin<=tmean<=tmax with zero violations,
physically plausible values, soil_moisture composite correct.

## Session 2026-04-12 (overnight run)

**Completed:**
- Full 1950-2025 backfill via EDH, ~1h 53min unattended (02:16 - 04:09)
- 75 year-files written (1951-2025) plus 1950 from earlier test = 76 years
- Avg ~90s per year, no failures
- Regenerated 2024 from EDH (was a CDS leftover from April 7 with different methodology)
- Filed #37 for the UTC-day aggregation bias (not blocking #36)
- All 76 years × 2 variables × 12 months × BC bbox GeoTIFFs in data/backfill/monthly/

**Validation (R/terra spot check):**
- 1950 Jul tmax up to 29.7 °C — historically realistic
- 2000 Jul tmax up to 28.9 °C
- 2024 Jul tmax up to 34.6 °C — matches heat dome era
- 2025 Jul tmax up to 31.9 °C
- All years: 12 layers, Jan..Dec names, EPSG:4326, BC bbox extent

## Session 2026-04-12 (morning QA)

**Completed:**
- Wrote `scripts/qa_monthly.R` — checks grid alignment, time coverage,
physical sanity across all variables
- Wrote `scripts/probe_edh_vars.py` — pulls one month of each EDH
variable, compares to CDS, validates aggregation semantics
- **Discovered grid mismatch:** CDS-era vars on 121×261 grid with no CRS
tag, vs EDH-era tmax/tmin on 120×260 EPSG:4326. Blocks release.
- **Validated EDH semantics:**
- tmean (hourly t2m mean): 0.99 correlation with CDS, 0.43 °C mean abs diff
- dewpoint: plausible BC winter values
- soil_moisture: all 4 depths match CDS composite closely
- prcp: hourly `tp` is accumulation (`GRIB_stepType=accum`) — need
EDH daily product for correct monthly precip
- **Decision:** two-Zarr approach, extend backfill to all variables so
the whole dataset is internally grid-aligned

**Next:**
- Commit QA + probe scripts on the branch (durable tooling)
- Extend backfill script to all 7 cd variables
- Regenerate `data/backfill/monthly/` entirely
- Re-run QA → R Stage 3 → PR

**Commits this session:**
- `0014bf2` — Add planning docs for EDH migration
- `7ef03cb` — Add EDH Zarr benchmark test script
- _pending_ — Add EDH-based tmax/tmin backfill script
- Phase 1: inspect vignette + list every figure chunk
- Decide the 2-3 tmax/tmin angles
- Pull watershed group polygon

**Verified 2026-04-12 02:11:**
- EDH Zarr benchmark: 15.9s / month BC t2m (5× faster than CDS)
- EDH full-year backfill: 114.3s / year BC tmax/tmin as GeoTIFF
- Output drop-in replacement for existing R Stage 2: 12 layers Jan..Dec,
EPSG:4326, °C, BC bbox — `terra::rast()` reads cleanly with expected names
- Realistic BC values for 1950: Jan tmax -30.5 to 0.7°C, Jul tmax -1.2 to 29.7°C
**Commits this session:** _to be appended_
Loading
Loading