Skip to content

Refactor AmeriFlux workflow to delegate CF fallback handling to metgapfill_with_fallback (refs #3605)#3793

Closed
Akash-paluvai wants to merge 7 commits intoPecanProject:developfrom
Akash-paluvai:GH-3605-metgapfill-with-fallback
Closed

Refactor AmeriFlux workflow to delegate CF fallback handling to metgapfill_with_fallback (refs #3605)#3793
Akash-paluvai wants to merge 7 commits intoPecanProject:developfrom
Akash-paluvai:GH-3605-metgapfill-with-fallback

Conversation

@Akash-paluvai
Copy link
Contributor

@Akash-paluvai Akash-paluvai commented Jan 23, 2026

Refactor AmeriFlux workflow to use metgapfill_with_fallback for CF-level fallback (refs #3605)

Description

This PR implements the core CF-level refactor described in #3605 by introducing
metgapfill_with_fallback() as the single reusable entry point for coverage-based
fallback handling and CF NetCDF merging
.

Dataset-specific fallback preparation (e.g. ERA5 vs ERA5-Land download, conversion,
and gap-filling orchestration) is intentionally deferred to higher-level workflows
(e.g. met.process()), keeping this helper CF-pure, reusable, and testable.

Specifically, this PR:

  • Adds a CF-safe merge helper (merge_cf_met_files()) with optional time alignment.
  • Implements metgapfill_with_fallback() that:
    • checks CF variable coverage,
    • determines which variables require fallback filling,
    • merges fallback CF data when required,
    • never edits input files in place.
  • Refactors AmeriFlux_met_ensemble() into a thin workflow that:
    • converts AmeriFlux CSV → CF,
    • calls metgapfill_with_fallback() once,
    • generates ensembles from the returned CF file.

All ERA5-specific logic (download, ERA5 vs ERA5-Land selection, conversion, and
gap-filling) is explicitly out of scope for this PR and will be handled in
subsequent stages of the #3605 refactor.

Files changed

  • modules/data.atmosphere/R/metgapfill_with_fallback.R
  • modules/data.atmosphere/R/Ameriflux_met_ensemble.R

Tests added / updated

  • tests/testthat/test-metgapfill_with_fallback.R
    • Verifies that fallback logic is triggered only when CF coverage is insufficient.
    • Verifies that no output file is created and the primary CF file is returned
      unchanged when coverage is sufficient.

Motivation and Context

Issue #3605 identified that the AmeriFlux meteorological workflow relied on
inline NetCDF manipulation and duplicated fallback logic, making it difficult
to maintain and reuse across workflows.

This PR delivers the first milestone of #3605 by:

  • extracting CF-level fallback logic into a reusable helper,
  • eliminating ad-hoc NetCDF surgery in AmeriFlux workflows,
  • preventing premature or duplicated gap-filling,
  • establishing a clean architectural boundary between:
    • CF-level operations (this PR), and
    • dataset-specific fallback preparation (future work).

Refs: #3605
Related preparatory work: #3789

Review Time Estimate

  • Immediately
  • Within one week
  • When possible

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change adding reusable CF-level fallback handling)
  • Refactor (architecture-aligned restructuring, no API breakage)
  • Breaking change

Checklist:

  • My change requires a change to the documentation.
  • My name is in the list of CITATION.cff
  • I agree that PEcAn Project may distribute my contribution under any or all of:
    • the same license as the existing code,
    • and/or the BSD 3-clause license.
  • I have updated the CHANGELOG.md.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
    • Note: some existing tests depend on optional packages, credentials, or live
      external services and may be skipped or fail in minimal environments.

@Akash-paluvai
Copy link
Contributor Author

Hi @infotroph @mdietze

Just to clarify the scope up front: this PR only covers the CF-level milestone of #3605. The idea here is to keep metgapfill_with_fallback() focused on coverage checking and CF NetCDF merging only.

Anything dataset-specific (like deciding between ERA5 vs ERA5-Land, downloading data, converting to CF, or running gap-filling) is intentionally left out for now and deferred to follow-up work (e.g. wiring this into met.process()), in line with the staged refactor discussed in #3605.

I’d really appreciate feedback on:

Very happy to revise...Thanks!

#' Merge CF-compliant meteorological NetCDF files
#'
#' Creates a new CF-compliant NetCDF file by merging selected variables
#' from a secondary CF NetCDF file into a primary CF NetCDF file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not clear to me from this description what "primary" and "secondary" mean -- can you describe it more precisely in terms of what happens to the values from each one? From the code below it seems like this function fills NAs in primary_cf with matching values from secondary_cf, where "matching" is determined by variable name and timestamp -- is that correct?

Might also be worth calling the function something different -- "merge" often implies a join-like operation where whole columns get taken from one dataset or the other, while this only fills where explicitly missing. By analogy to dplyr::coalesce() and the "null coalescing operator" %||%, maybe this could be something like coalesce_na_cf_met? Better ideas welcome.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out , you’re right that “primary” and “secondary” were ambiguous in the earlier description.
I’ve clarified the documentation to explicitly state that:

  • values from primary_cf are preserved wherever they are non-missing,
  • missing values in primary_cf are filled using matching values from secondary_cf,
  • matching is determined by variable name and timestamp intersection.

I’ve also renamed the helper from merge_cf_met_files() to coalesce_na_cf_met() to better reflect the intended semantics (i.e., null-coalescing behaviour rather than a join-like merge)

Let me know if the current wording better captures the behavior

Comment on lines +25 to +27
# TODO(#3605): align CF time axes using PEcAn.utils::cf2datetime()
# TODO(#3605): error on non-overlapping time axes
# TODO(#3605): consider aggregation/repeat logic in future PR
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you know if you got this far 😁, netCDF formatting has dragons in it. Further complications to consider:

  • Do these TODOs include handling cases where the time intervals aren't identical? e.g. half-hourly file A filled from hourly file B?
  • Is this function intended to support files with multiple dimensions, e.g. spatial coordinates or ensemble dimensions in addition to time? If so my (untested) hunch is that the existing way of indexing on time might not be enough to uniquely identify a missing value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed-
To clarify current scope:

  • align_time = TRUE assumes identical temporal resolution in both files.
  • It converts CF time to POSIXct, takes the intersection of timestamps, and subsets both files accordingly.
  • It does not perform resampling, aggregation, interpolation, or time-range extension.

Handling differing temporal resolutions (e.g., half-hourly vs hourly) or temporal extension is intentionally out of scope for this helper and would require a more explicit design decision.

Similarly, this implementation currently assumes variables are one-dimensional along time only. Files with additional spatial, ensemble, or depth dimensions are not supported and are explicitly documented as out of scope.
I’ve updated the documentation to make these assumptions explicit rather than implied..

Comment on lines +3 to +5
#' Orchestrates coverage checking and conditional merging of a fallback
#' CF NetCDF file into a primary CF NetCDF file. No files are modified
#' in place.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above about "primary" not being clear in context

Comment on lines +54 to +57
# ---- fallback required → ensure clean output path
if (file.exists(out_file)) {
file.remove(out_file)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate of lines 36-39 above?

#' @param fallback_cf character. Path to fallback CF NetCDF file
#' @param out_file character. Path to output CF NetCDF file
#' @param coverage_threshold numeric. Minimum acceptable coverage (0–1)
#' @param align_time logical. Whether to align CF time axes before merging
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say more about what aligning means here.

(A good rule for documenting function arguments is to avoid using words that are part of the argument name when you write the argument description. This isn't an absolute but practicing it helps prevent circular definitions)


# ---- enforce test contract: out_file must NOT exist
if (file.exists(out_file)) {
file.remove(out_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
file.remove(out_file)
file.remove(out_file)

Comment on lines +49 to +52
# ---- NO fallback required → return primary, DO NOTHING ELSE
if (length(fill_vars) == 0) {
return(primary_cf)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to copy primary_cf to out_file here? I think many users would be surprised if they provide a destination path and find it unused after the function reports success.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree returning success without creating the requested output file would be surprising from an API perspective.
I’ve updated the behaviour so that when no fallback is required, the function copies primary_cf to out_file and returns the output path. This keeps the API consistent regardless of whether filling occurs.

@@ -0,0 +1,52 @@
context("metgapfill_with_fallback")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: context is deprecated on the grounds it's better to encode that information in informative filenames -- as you've already done.

Suggested change
context("metgapfill_with_fallback")

)

# ---- verify behavior
expect_identical(result, primary)
Copy link
Member

@infotroph infotroph Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expect_identical(result, primary)
# Note: this checks for identical _paths_ and doesn't compare file contents
expect_identical(result, primary)

expect_identical(result, primary)
expect_false(file.exists(out))
}
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above seems like a fair test of the case where no filling is needed -- though as per my above comments, I'd favor an API change that would make out need to exist and have identical contents to primary.

I do recommend adding at least some test cases where values do get filled, since that's the bulk of the code paths and the part most subject to corner cases in need of testing.

@infotroph
Copy link
Member

Thanks for working on this! I left some comments above that include nitpicks, but high-level my suggestions relate to your middle question ("Whether the metgapfill_with_fallback() interface feels right and flexible enough for future ERA5 / ERA5-Land integration"): I think my answer is broadly yes but I suggest framing it not as specific to ERA5 but about care to define what files are supported now, which are planned, and which are out of scope:

  • Which parts of the CF conventions need to be followed for this merge to work right? Do any of these need to be enforced internally or is it sufficient to tell users the function is designed for CF met and untested beyond that?
  • Are the TODOs about "non-overlapping" timestamps recording intent to support files with differing timesteps in the same time interval (e.g. A half-hourly, B hourly, both covering the year 2015), extending time ranges (e.g. A ends in 2015, 2016 data gets pulled in from B), or something else? Will there be limits on how much they can differ?
  • How about files with non-time dimensions? If supported at all, do the two files have to have identical dimensions or can they differ?
  • Will var continue to be required to contain only variables that exist in both files with identical names and units? That seems right, just worth documenting it explicitly.

@Akash-paluvai
Copy link
Contributor Author

Thanks for working on this! I left some comments above that include nitpicks, but high-level my suggestions relate to your middle question ("Whether the metgapfill_with_fallback() interface feels right and flexible enough for future ERA5 / ERA5-Land integration"): I think my answer is broadly yes but I suggest framing it not as specific to ERA5 but about care to define what files are supported now, which are planned, and which are out of scope:

  • Which parts of the CF conventions need to be followed for this merge to work right? Do any of these need to be enforced internally or is it sufficient to tell users the function is designed for CF met and untested beyond that?
  • Are the TODOs about "non-overlapping" timestamps recording intent to support files with differing timesteps in the same time interval (e.g. A half-hourly, B hourly, both covering the year 2015), extending time ranges (e.g. A ends in 2015, 2016 data gets pulled in from B), or something else? Will there be limits on how much they can differ?
  • How about files with non-time dimensions? If supported at all, do the two files have to have identical dimensions or can they differ?
  • Will var continue to be required to contain only variables that exist in both files with identical names and units? That seems right, just worth documenting it explicitly.

Thanks, I’ll push a follow up update to this PR shortly with the suggested tweaks and clearer docs.

Yes, the intent here is to keep this CF-level helper conservative and explicit about its scope...rather than tying it to ERA5 specifically. I agree it’s important to be clear out what’s supported today versus what’s intentionally out of scope for now.

Based on how things are implemented at the moment, this is the scope I had in mind:

  • CF conventions: the inputs are expected to be CF-compliant meteorological NetCDF files as they’re produced by existing pecan workflows basically a valid time variable and consistent variable naming. Beyond what’s needed for time handling and accessing variables, behavior is untested right now, so I think documenting that expectation is enough for the time being rather than enforcing broader CF checks in the code.

  • Time handling: align_time currently assumes the two files have the same temporal resolution and some overlapping timestamps. What it does today is convert CF time to POSIXct, take the intersection of timestamps, and subset both files. It doesn’t do any resampling, aggregation, interpolation, or time-range extension, and it doesn’t try to handle differing timesteps. The TODOs are meant as notes for possible future work, not to suggest those cases are supported now. I’ll make that clearer.

  • Non-time dimensions: for now this helper is really intended for time-only variables. Files with extra dimensions (like spatial grids, ensembles, or depth) are out of scope and untested, and I’ll document that explicitly rather than leaving it ambiguous.

  • Variables (vars): yes, the intent is that vars only includes variables that exist in both files with the same names (and assumed compatible units), and that only missing values in the primary file get filled. I agree that’s the right constraint and I’ll call it out clearly in the docs.

@Akash-paluvai Akash-paluvai force-pushed the GH-3605-metgapfill-with-fallback branch from d47561c to 25ce40b Compare February 24, 2026 21:07
@Akash-paluvai
Copy link
Contributor Author

Hi @infotroph thanks again for the feedback. I’ve pushed a round of updates addressing the points raised, and wanted to briefly summarise the current state

What changed

  • Renamed merge_cf_met_files() → coalesce_na_cf_met() to better reflect the intended null-coalescing behaviour rather than a join-style merge.
  • Clarified the documentation to more explicitly describe how values from primary_cf and secondary_cf are handled.
  • Made the API consistent: when no fallback is needed, the function now copies primary_cf to out_file and returns the output path.
  • Removed the duplicated output-file handling and simplified the surrounding logic.
  • Clarified what align_time currently does and documented its present scope/limitations (identical temporal resolution, time-only variables, no interpolation or resampling).
  • Expanded tests to cover both the no-fill case and the fill-required case to ensure missing values are correctly coalesced.

At this stage the helper is intentionally conservative and scoped to:

  • Filling only missing values in primary_cf
  • Matching by variable name and overlapping timestamps
  • Time-only variables with identical temporal resolution

@Akash-paluvai
Copy link
Contributor Author

One question for the next stage of #3605:
Should this CF-level helper stay strictly minimal (pure coalescing only), or do you see it eventually handling controlled time-resolution differences (e.g., hourly fallback filling half-hourly gaps)?
I want to be sure the current interface aligns with the intended long term direction before building the next layer.

@Akash-paluvai
Copy link
Contributor Author

hi mentors,
All changes from this PR are now included in #3844, so I’m closing this one to avoid duplicate review and keep the discussion in a single place.

Thanks again for the feedback here!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants