Refactor AmeriFlux workflow to delegate CF fallback handling to metgapfill_with_fallback (refs #3605)#3793
Conversation
|
Just to clarify the scope up front: this PR only covers the CF-level milestone of #3605. The idea here is to keep Anything dataset-specific (like deciding between ERA5 vs ERA5-Land, downloading data, converting to CF, or running gap-filling) is intentionally left out for now and deferred to follow-up work (e.g. wiring this into I’d really appreciate feedback on:
Very happy to revise...Thanks! |
| #' Merge CF-compliant meteorological NetCDF files | ||
| #' | ||
| #' Creates a new CF-compliant NetCDF file by merging selected variables | ||
| #' from a secondary CF NetCDF file into a primary CF NetCDF file. |
There was a problem hiding this comment.
It's not clear to me from this description what "primary" and "secondary" mean -- can you describe it more precisely in terms of what happens to the values from each one? From the code below it seems like this function fills NAs in primary_cf with matching values from secondary_cf, where "matching" is determined by variable name and timestamp -- is that correct?
Might also be worth calling the function something different -- "merge" often implies a join-like operation where whole columns get taken from one dataset or the other, while this only fills where explicitly missing. By analogy to dplyr::coalesce() and the "null coalescing operator" %||%, maybe this could be something like coalesce_na_cf_met? Better ideas welcome.
There was a problem hiding this comment.
Thanks for pointing this out , you’re right that “primary” and “secondary” were ambiguous in the earlier description.
I’ve clarified the documentation to explicitly state that:
- values from primary_cf are preserved wherever they are non-missing,
- missing values in primary_cf are filled using matching values from secondary_cf,
- matching is determined by variable name and timestamp intersection.
I’ve also renamed the helper from merge_cf_met_files() to coalesce_na_cf_met() to better reflect the intended semantics (i.e., null-coalescing behaviour rather than a join-like merge)
Let me know if the current wording better captures the behavior
| # TODO(#3605): align CF time axes using PEcAn.utils::cf2datetime() | ||
| # TODO(#3605): error on non-overlapping time axes | ||
| # TODO(#3605): consider aggregation/repeat logic in future PR |
There was a problem hiding this comment.
As you know if you got this far 😁, netCDF formatting has dragons in it. Further complications to consider:
- Do these TODOs include handling cases where the time intervals aren't identical? e.g. half-hourly file A filled from hourly file B?
- Is this function intended to support files with multiple dimensions, e.g. spatial coordinates or ensemble dimensions in addition to time? If so my (untested) hunch is that the existing way of indexing on time might not be enough to uniquely identify a missing value.
There was a problem hiding this comment.
Agreed-
To clarify current scope:
- align_time = TRUE assumes identical temporal resolution in both files.
- It converts CF time to POSIXct, takes the intersection of timestamps, and subsets both files accordingly.
- It does not perform resampling, aggregation, interpolation, or time-range extension.
Handling differing temporal resolutions (e.g., half-hourly vs hourly) or temporal extension is intentionally out of scope for this helper and would require a more explicit design decision.
Similarly, this implementation currently assumes variables are one-dimensional along time only. Files with additional spatial, ensemble, or depth dimensions are not supported and are explicitly documented as out of scope.
I’ve updated the documentation to make these assumptions explicit rather than implied..
| #' Orchestrates coverage checking and conditional merging of a fallback | ||
| #' CF NetCDF file into a primary CF NetCDF file. No files are modified | ||
| #' in place. |
There was a problem hiding this comment.
Same comment as above about "primary" not being clear in context
| # ---- fallback required → ensure clean output path | ||
| if (file.exists(out_file)) { | ||
| file.remove(out_file) | ||
| } |
There was a problem hiding this comment.
Duplicate of lines 36-39 above?
| #' @param fallback_cf character. Path to fallback CF NetCDF file | ||
| #' @param out_file character. Path to output CF NetCDF file | ||
| #' @param coverage_threshold numeric. Minimum acceptable coverage (0–1) | ||
| #' @param align_time logical. Whether to align CF time axes before merging |
There was a problem hiding this comment.
Say more about what aligning means here.
(A good rule for documenting function arguments is to avoid using words that are part of the argument name when you write the argument description. This isn't an absolute but practicing it helps prevent circular definitions)
|
|
||
| # ---- enforce test contract: out_file must NOT exist | ||
| if (file.exists(out_file)) { | ||
| file.remove(out_file) |
There was a problem hiding this comment.
| file.remove(out_file) | |
| file.remove(out_file) |
| # ---- NO fallback required → return primary, DO NOTHING ELSE | ||
| if (length(fill_vars) == 0) { | ||
| return(primary_cf) | ||
| } |
There was a problem hiding this comment.
Would it make sense to copy primary_cf to out_file here? I think many users would be surprised if they provide a destination path and find it unused after the function reports success.
There was a problem hiding this comment.
I agree returning success without creating the requested output file would be surprising from an API perspective.
I’ve updated the behaviour so that when no fallback is required, the function copies primary_cf to out_file and returns the output path. This keeps the API consistent regardless of whether filling occurs.
| @@ -0,0 +1,52 @@ | |||
| context("metgapfill_with_fallback") | |||
There was a problem hiding this comment.
Nit: context is deprecated on the grounds it's better to encode that information in informative filenames -- as you've already done.
| context("metgapfill_with_fallback") |
| ) | ||
|
|
||
| # ---- verify behavior | ||
| expect_identical(result, primary) |
There was a problem hiding this comment.
| expect_identical(result, primary) | |
| # Note: this checks for identical _paths_ and doesn't compare file contents | |
| expect_identical(result, primary) |
| expect_identical(result, primary) | ||
| expect_false(file.exists(out)) | ||
| } | ||
| ) |
There was a problem hiding this comment.
The above seems like a fair test of the case where no filling is needed -- though as per my above comments, I'd favor an API change that would make out need to exist and have identical contents to primary.
I do recommend adding at least some test cases where values do get filled, since that's the bulk of the code paths and the part most subject to corner cases in need of testing.
|
Thanks for working on this! I left some comments above that include nitpicks, but high-level my suggestions relate to your middle question ("Whether the metgapfill_with_fallback() interface feels right and flexible enough for future ERA5 / ERA5-Land integration"): I think my answer is broadly yes but I suggest framing it not as specific to ERA5 but about care to define what files are supported now, which are planned, and which are out of scope:
|
Thanks, I’ll push a follow up update to this PR shortly with the suggested tweaks and clearer docs. Yes, the intent here is to keep this CF-level helper conservative and explicit about its scope...rather than tying it to ERA5 specifically. I agree it’s important to be clear out what’s supported today versus what’s intentionally out of scope for now. Based on how things are implemented at the moment, this is the scope I had in mind:
|
d47561c to
25ce40b
Compare
|
Hi @infotroph thanks again for the feedback. I’ve pushed a round of updates addressing the points raised, and wanted to briefly summarise the current state What changed
At this stage the helper is intentionally conservative and scoped to:
|
|
One question for the next stage of #3605: |
|
hi mentors, Thanks again for the feedback here!!! |
Refactor AmeriFlux workflow to use metgapfill_with_fallback for CF-level fallback (refs #3605)
Description
This PR implements the core CF-level refactor described in #3605 by introducing
metgapfill_with_fallback()as the single reusable entry point for coverage-basedfallback handling and CF NetCDF merging.
Dataset-specific fallback preparation (e.g. ERA5 vs ERA5-Land download, conversion,
and gap-filling orchestration) is intentionally deferred to higher-level workflows
(e.g.
met.process()), keeping this helper CF-pure, reusable, and testable.Specifically, this PR:
merge_cf_met_files()) with optional time alignment.metgapfill_with_fallback()that:AmeriFlux_met_ensemble()into a thin workflow that:metgapfill_with_fallback()once,All ERA5-specific logic (download, ERA5 vs ERA5-Land selection, conversion, and
gap-filling) is explicitly out of scope for this PR and will be handled in
subsequent stages of the #3605 refactor.
Files changed
modules/data.atmosphere/R/metgapfill_with_fallback.Rmodules/data.atmosphere/R/Ameriflux_met_ensemble.RTests added / updated
tests/testthat/test-metgapfill_with_fallback.Runchanged when coverage is sufficient.
Motivation and Context
Issue #3605 identified that the AmeriFlux meteorological workflow relied on
inline NetCDF manipulation and duplicated fallback logic, making it difficult
to maintain and reuse across workflows.
This PR delivers the first milestone of #3605 by:
Refs: #3605
Related preparatory work: #3789
Review Time Estimate
Types of changes
Checklist:
external services and may be skipped or fail in minimal environments.