Wire cd_cache into the COG read path (kill recurring S3 egress)#77
Open
NewGraphEnvironment wants to merge 6 commits into
Open
Wire cd_cache into the COG read path (kill recurring S3 egress)#77NewGraphEnvironment wants to merge 6 commits into
NewGraphEnvironment wants to merge 6 commits into
Conversation
Phase 1 of #76. New cd_cache_fetch() downloads a remote http(s) COG once to the cd cache dir (filename = hash(url) + ext, sidecar .meta JSON with url/etag/size/downloaded_at) and returns a local path; subsequent calls read locally. Freshness via HTTP HEAD ETag, falling back to Content-Length size when the host returns no ETag; no validator at all re-downloads (conservative, documented). Local and non-http (s3://, /vsi) hrefs pass through unchanged for GDAL to read directly. Robustness: download to a temp file in the cache dir, validate byte size against Content-Length, atomic intra-filesystem rename (guarded against file.rename failure), so a truncated download is never served. options(cd.cache_revalidate = FALSE) skips the HEAD for a fully-offline fast path; a failed HEAD with a cached copy present serves the cache. curl added to Imports (HEAD + download); withr to Suggests (test opts). 20 CI-safe tests via local_mocked_bindings — no real network. Full suite FAIL 0 / 219 PASS. /code-check 3 rounds clean (fixed: etag->size fallback for header-poor hosts, file.rename failure guard). Read path wiring (cd_crop/cd_extract) follows in Phase 2. Relates to #76
Phase 2 of #76. cd_crop() gains cache = TRUE (default): remote http(s) hrefs are resolved through cd_cache_fetch() before terra::rast(), so a report or vignette rebuild reads each COG from the local cache instead of re-pulling from S3. Local paths pass through unchanged. cd_extract() threads the same cache argument down to cd_crop(). Backward-compatible — default TRUE, and cache = TRUE / FALSE produce identical output for local COGs (asserted in both test files). The isTRUE(cache) guard also no-ops cleanly on NA/NULL. Full suite FAIL 0 / 206 PASS. Relates to #76
Phase 3 of #76. Adds a README "Caching" section explaining the default cache = TRUE behavior, cd_cache_info()/cd_cache_clear(), refresh and the cd.cache_revalidate offline switch, plus the GDAL /vsicurl/ env-var stopgap for raw terra reads. Confirmed the egress kill against the live S3 catalog: first read of prcp_annual.tif pulls 5.26 MB; the second read is a ~1 KB HEAD with no re-download (0.04 s), and with options(cd.cache_revalidate = FALSE) the read is fully offline (0 s). Recorded in planning findings. devtools::document() clean; codetools reports no issues for the new functions. The remaining --as-cran NOTEs/WARNING are pre-existing (planning/ not in .Rbuildignore, vignette checks) and unrelated to this work. NEWS + version bump 0.3.2 -> 0.4.0 (minor: new exported cd_cache_fetch(), new cache args on cd_crop/cd_extract, new curl dependency). Fixes #76
The unit tests in test-cd_cache_fetch.R mock cd_remote_head / cd_remote_download, so the real curl HEAD + ETag parsing + download path had no committed regression test (only a one-off manual smoke test). This adds an integration test that pulls a real annual COG from the live catalog, asserting: a real S3 ETag is captured, the advertised Content-Length equals the bytes written, a warm read revalidates via HEAD without rewriting the file, the cached file is a valid COG structurally identical to a direct /vsicurl read (caching serves real, not stale or partial data), and the cd.cache_revalidate=FALSE offline path serves the cache with no network. Guarded by skip_on_cran() + skip_if_offline() against the S3 host and isolated in withr::local_tempdir(), so CI never pulls from S3 and a user cache is never touched. Runs only on a local online devtools::test(). Full suite FAIL 0 / 214 PASS. Relates to #76
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wires the orphaned
cd_cachemodule into the consumer COG read path so repeated extractions, report renders, and vignette rebuilds read each COG from a local on-disk cache instead of re-pulling from S3 on every call — killing the dominant recurring S3 egress driver.cd_cache_fetch()— downloads a remote http(s) COG once to the cd cache (filename =hash(url)+ ext, sidecar.metaJSON with S3 ETag/size/timestamp), revalidates freshness via a cheap HTTP HEAD (ETag, falling back to Content-Length when a host returns no ETag), and serves the local copy on a hit. Download → temp file → size-validate against Content-Length → atomic rename (guarded againstfile.renamefailure), so a truncated file is never served. Failed HEAD with a cached copy serves the cache;options(cd.cache_revalidate = FALSE)skips the HEAD for offline work.cd_crop()/cd_extract()gaincache = TRUE(default) — remote reads route through the cache; local paths pass through unchanged. Backward-compatible (output identical tocache = FALSEfor local COGs)./vsicurl/stopgap note;curladded to Imports; version bump 0.3.2 → 0.4.0.Egress confirmation (live S3)
prcp_annual.tif)cd.cache_revalidate = FALSE)Related Issues
Test plan
FAIL 0 / 206 PASS(20 new CI-safe cache tests vialocal_mocked_bindings— no real network)/code-check3 rounds on the core (2 fixes: ETag→size fallback,file.renameguard); round 3 cleanlintrclean;codetoolsclean on all new functionscache = TRUE/FALSEproduce identical output for local COGsNotes
/gh-pr-mergeshould detect it and skip the double-bump (tag v0.4.0 on the existing commit).planning/is not in.Rbuildignore, which is the source of several pre-existing--as-cranNOTEs.🤖 Generated with Claude Code