Refactor AmeriFlux workflow to delegate CF fallback handling to metgapfill_with_fallback (refs #3605) by Akash-paluvai · Pull Request #3793 · PecanProject/pecan

Akash-paluvai · 2026-01-23T19:34:44Z

Refactor AmeriFlux workflow to use metgapfill_with_fallback for CF-level fallback (refs #3605)

Description

This PR implements the core CF-level refactor described in #3605 by introducing
metgapfill_with_fallback() as the single reusable entry point for coverage-based
fallback handling and CF NetCDF merging.

Dataset-specific fallback preparation (e.g. ERA5 vs ERA5-Land download, conversion,
and gap-filling orchestration) is intentionally deferred to higher-level workflows
(e.g. met.process()), keeping this helper CF-pure, reusable, and testable.

Specifically, this PR:

Adds a CF-safe merge helper (merge_cf_met_files()) with optional time alignment.
Implements metgapfill_with_fallback() that:
- checks CF variable coverage,
- determines which variables require fallback filling,
- merges fallback CF data when required,
- never edits input files in place.
Refactors AmeriFlux_met_ensemble() into a thin workflow that:
- converts AmeriFlux CSV → CF,
- calls metgapfill_with_fallback() once,
- generates ensembles from the returned CF file.

All ERA5-specific logic (download, ERA5 vs ERA5-Land selection, conversion, and
gap-filling) is explicitly out of scope for this PR and will be handled in
subsequent stages of the #3605 refactor.

Files changed

modules/data.atmosphere/R/metgapfill_with_fallback.R
modules/data.atmosphere/R/Ameriflux_met_ensemble.R

Tests added / updated

tests/testthat/test-metgapfill_with_fallback.R
- Verifies that fallback logic is triggered only when CF coverage is insufficient.
- Verifies that no output file is created and the primary CF file is returned
  unchanged when coverage is sufficient.

Motivation and Context

Issue #3605 identified that the AmeriFlux meteorological workflow relied on
inline NetCDF manipulation and duplicated fallback logic, making it difficult
to maintain and reuse across workflows.

This PR delivers the first milestone of #3605 by:

extracting CF-level fallback logic into a reusable helper,
eliminating ad-hoc NetCDF surgery in AmeriFlux workflows,
preventing premature or duplicated gap-filling,
establishing a clean architectural boundary between:
- CF-level operations (this PR), and
- dataset-specific fallback preparation (future work).

Refs: #3605
Related preparatory work: #3789

Review Time Estimate

Immediately
Within one week
When possible

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change adding reusable CF-level fallback handling)
Refactor (architecture-aligned restructuring, no API breakage)
Breaking change

Checklist:

My change requires a change to the documentation.
My name is in the list of CITATION.cff
I agree that PEcAn Project may distribute my contribution under any or all of:
- the same license as the existing code,
- and/or the BSD 3-clause license.
I have updated the CHANGELOG.md.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
- Note: some existing tests depend on optional packages, credentials, or live
  external services and may be skipped or fail in minimal environments.

Akash-paluvai · 2026-01-23T19:39:29Z

Hi @infotroph @mdietze

Just to clarify the scope up front: this PR only covers the CF-level milestone of #3605. The idea here is to keep metgapfill_with_fallback() focused on coverage checking and CF NetCDF merging only.

Anything dataset-specific (like deciding between ERA5 vs ERA5-Land, downloading data, converting to CF, or running gap-filling) is intentionally left out for now and deferred to follow-up work (e.g. wiring this into met.process()), in line with the staged refactor discussed in #3605.

I’d really appreciate feedback on:

Whether this separation of responsibilities matches the intended architecture for Compose AmeriFlux met ensemble from modular PEcAn steps with ERA5/ERA5-Land fallback #3605
Whether the metgapfill_with_fallback() interface feels right and flexible enough for future ERA5 / ERA5-Land integration
Anything I should simplify or adjust to keep the AmeriFlux workflow as thin and reusable as possible

Very happy to revise...Thanks!

modules/data.atmosphere/R/merge_cf_met_files.R

infotroph · 2026-01-29T18:03:19Z

modules/data.atmosphere/R/merge_cf_met_files.R

+#' Merge CF-compliant meteorological NetCDF files
+#'
+#' Creates a new CF-compliant NetCDF file by merging selected variables
+#' from a secondary CF NetCDF file into a primary CF NetCDF file.


It's not clear to me from this description what "primary" and "secondary" mean -- can you describe it more precisely in terms of what happens to the values from each one? From the code below it seems like this function fills NAs in primary_cf with matching values from secondary_cf, where "matching" is determined by variable name and timestamp -- is that correct?

Might also be worth calling the function something different -- "merge" often implies a join-like operation where whole columns get taken from one dataset or the other, while this only fills where explicitly missing. By analogy to dplyr::coalesce() and the "null coalescing operator" %||%, maybe this could be something like coalesce_na_cf_met? Better ideas welcome.

Thanks for pointing this out , you’re right that “primary” and “secondary” were ambiguous in the earlier description.
I’ve clarified the documentation to explicitly state that:

values from primary_cf are preserved wherever they are non-missing,

missing values in primary_cf are filled using matching values from secondary_cf,

matching is determined by variable name and timestamp intersection.

I’ve also renamed the helper from merge_cf_met_files() to coalesce_na_cf_met() to better reflect the intended semantics (i.e., null-coalescing behaviour rather than a join-like merge)

Let me know if the current wording better captures the behavior

infotroph · 2026-01-29T18:25:34Z

modules/data.atmosphere/R/merge_cf_met_files.R

+  # TODO(#3605): align CF time axes using PEcAn.utils::cf2datetime()
+  # TODO(#3605): error on non-overlapping time axes
+  # TODO(#3605): consider aggregation/repeat logic in future PR


As you know if you got this far 😁, netCDF formatting has dragons in it. Further complications to consider:

Do these TODOs include handling cases where the time intervals aren't identical? e.g. half-hourly file A filled from hourly file B?

Is this function intended to support files with multiple dimensions, e.g. spatial coordinates or ensemble dimensions in addition to time? If so my (untested) hunch is that the existing way of indexing on time might not be enough to uniquely identify a missing value.

Agreed-
To clarify current scope:

align_time = TRUE assumes identical temporal resolution in both files.

It converts CF time to POSIXct, takes the intersection of timestamps, and subsets both files accordingly.

It does not perform resampling, aggregation, interpolation, or time-range extension.

Handling differing temporal resolutions (e.g., half-hourly vs hourly) or temporal extension is intentionally out of scope for this helper and would require a more explicit design decision.

Similarly, this implementation currently assumes variables are one-dimensional along time only. Files with additional spatial, ensemble, or depth dimensions are not supported and are explicitly documented as out of scope.
I’ve updated the documentation to make these assumptions explicit rather than implied..

infotroph · 2026-01-29T19:00:23Z