Skip to content

Wire cd_cache into the COG read path (kill recurring S3 egress)#77

Open
NewGraphEnvironment wants to merge 6 commits into
mainfrom
76-wire-cd-cache-read-path
Open

Wire cd_cache into the COG read path (kill recurring S3 egress)#77
NewGraphEnvironment wants to merge 6 commits into
mainfrom
76-wire-cd-cache-read-path

Conversation

@NewGraphEnvironment

Copy link
Copy Markdown
Owner

Summary

Wires the orphaned cd_cache module into the consumer COG read path so repeated extractions, report renders, and vignette rebuilds read each COG from a local on-disk cache instead of re-pulling from S3 on every call — killing the dominant recurring S3 egress driver.

  • New cd_cache_fetch() — downloads a remote http(s) COG once to the cd cache (filename = hash(url) + ext, sidecar .meta JSON with S3 ETag/size/timestamp), revalidates freshness via a cheap HTTP HEAD (ETag, falling back to Content-Length when a host returns no ETag), and serves the local copy on a hit. Download → temp file → size-validate against Content-Length → atomic rename (guarded against file.rename failure), so a truncated file is never served. Failed HEAD with a cached copy serves the cache; options(cd.cache_revalidate = FALSE) skips the HEAD for offline work.
  • cd_crop() / cd_extract() gain cache = TRUE (default) — remote reads route through the cache; local paths pass through unchanged. Backward-compatible (output identical to cache = FALSE for local COGs).
  • Docs + release — README "Caching" section + GDAL /vsicurl/ stopgap note; curl added to Imports; version bump 0.3.2 → 0.4.0.

Egress confirmation (live S3)

Read Time Network
1st (prcp_annual.tif) 0.8 s full 5.26 MB download
2nd (HEAD revalidate) 0.04 s ~1 KB HEAD, no re-download
3rd (cd.cache_revalidate = FALSE) 0.000 s zero network

Related Issues

Test plan

  • Full suite FAIL 0 / 206 PASS (20 new CI-safe cache tests via local_mocked_bindings — no real network)
  • /code-check 3 rounds on the core (2 fixes: ETag→size fallback, file.rename guard); round 3 clean
  • lintr clean; codetools clean on all new functions
  • Live S3 smoke test confirms the egress kill (table above)
  • cache = TRUE / FALSE produce identical output for local COGs

Notes

  • Version bump 0.3.2 → 0.4.0 is already on the branch, so /gh-pr-merge should detect it and skip the double-bump (tag v0.4.0 on the existing commit).
  • Out of scope, surfaced for a follow-up: planning/ is not in .Rbuildignore, which is the source of several pre-existing --as-cran NOTEs.

🤖 Generated with Claude Code

Phase 1 of #76. New cd_cache_fetch() downloads a remote http(s) COG once
to the cd cache dir (filename = hash(url) + ext, sidecar .meta JSON with
url/etag/size/downloaded_at) and returns a local path; subsequent calls
read locally. Freshness via HTTP HEAD ETag, falling back to
Content-Length size when the host returns no ETag; no validator at all
re-downloads (conservative, documented). Local and non-http (s3://, /vsi)
hrefs pass through unchanged for GDAL to read directly.

Robustness: download to a temp file in the cache dir, validate byte size
against Content-Length, atomic intra-filesystem rename (guarded against
file.rename failure), so a truncated download is never served.
options(cd.cache_revalidate = FALSE) skips the HEAD for a fully-offline
fast path; a failed HEAD with a cached copy present serves the cache.

curl added to Imports (HEAD + download); withr to Suggests (test opts).
20 CI-safe tests via local_mocked_bindings — no real network. Full suite
FAIL 0 / 219 PASS. /code-check 3 rounds clean (fixed: etag->size
fallback for header-poor hosts, file.rename failure guard).

Read path wiring (cd_crop/cd_extract) follows in Phase 2.

Relates to #76
Phase 2 of #76. cd_crop() gains cache = TRUE (default): remote http(s)
hrefs are resolved through cd_cache_fetch() before terra::rast(), so a
report or vignette rebuild reads each COG from the local cache instead
of re-pulling from S3. Local paths pass through unchanged. cd_extract()
threads the same cache argument down to cd_crop().

Backward-compatible — default TRUE, and cache = TRUE / FALSE produce
identical output for local COGs (asserted in both test files). The
isTRUE(cache) guard also no-ops cleanly on NA/NULL.

Full suite FAIL 0 / 206 PASS.

Relates to #76
Phase 3 of #76. Adds a README "Caching" section explaining the default
cache = TRUE behavior, cd_cache_info()/cd_cache_clear(), refresh and
the cd.cache_revalidate offline switch, plus the GDAL /vsicurl/ env-var
stopgap for raw terra reads.

Confirmed the egress kill against the live S3 catalog: first read of
prcp_annual.tif pulls 5.26 MB; the second read is a ~1 KB HEAD with no
re-download (0.04 s), and with options(cd.cache_revalidate = FALSE) the
read is fully offline (0 s). Recorded in planning findings.

devtools::document() clean; codetools reports no issues for the new
functions. The remaining --as-cran NOTEs/WARNING are pre-existing
(planning/ not in .Rbuildignore, vignette checks) and unrelated to this
work.

NEWS + version bump 0.3.2 -> 0.4.0 (minor: new exported cd_cache_fetch(),
new cache args on cd_crop/cd_extract, new curl dependency).

Fixes #76
The unit tests in test-cd_cache_fetch.R mock cd_remote_head /
cd_remote_download, so the real curl HEAD + ETag parsing + download path
had no committed regression test (only a one-off manual smoke test).

This adds an integration test that pulls a real annual COG from the live
catalog, asserting: a real S3 ETag is captured, the advertised
Content-Length equals the bytes written, a warm read revalidates via HEAD
without rewriting the file, the cached file is a valid COG structurally
identical to a direct /vsicurl read (caching serves real, not stale or
partial data), and the cd.cache_revalidate=FALSE offline path serves the
cache with no network.

Guarded by skip_on_cran() + skip_if_offline() against the S3 host and
isolated in withr::local_tempdir(), so CI never pulls from S3 and a
user cache is never touched. Runs only on a local online devtools::test().
Full suite FAIL 0 / 214 PASS.

Relates to #76
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Wire cd_cache into the COG read path so repeated builds read locally (kill recurring S3 egress)

1 participant