Skip to content

lnk_config: config bundle loader for pipeline variants #37

@NewGraphEnvironment

Description

@NewGraphEnvironment

Problem

The habitat classification pipeline has dozens of knobs — rules YAML, parameters_fresh.csv, wsg_species_presence.csv, override CSV filenames, break source order, cluster params, connected-waterbody rules per species. These are currently scattered: some in inst/extdata/, some in data-raw/compare_bcfishpass.R, some hardcoded in script logic.

We're about to build pipeline variants (bcfishpass-matching validation config, newgraph defaults, min-spawn, channel-type-first breaking). Each variant needs its own bundle of these knobs. Without a config abstraction, every variant becomes a new script with copy-paste drift.

This is the foundation for a proper _targets.R pipeline — targets can't parallelize across variants cleanly if the config is scattered across the filesystem.

Proposed function

lnk_config(name_or_path)
  • name_or_path: either a bundled config name ("bcfishpass", "default") or an absolute path to a custom config directory

Returns: an lnk_config S3 list with named slots:

list(
  name              = "bcfishpass",
  dir               = "<path to config dir>",
  rules_yaml        = "<path to rules.yaml>",           # built artifact, consumed by frs_habitat_classify
  dimensions_csv    = "<path to dimensions.csv>",       # source of rules.yaml, traceability
  parameters_fresh  = tibble(...),                      # loaded
  wsg_species       = tibble(...),                      # loaded
  observation_excl  = tibble(...),                      # loaded
  overrides         = list(                             # each CSV loaded as a tibble
    modelled_fixes       = tibble(...),
    pscis_barrier_status = tibble(...),
    pscis_xref           = tibble(...),
    barriers_definite    = tibble(...)
  ),
  break_order       = c("observations", "gradient_minimal", "habitat_endpoints", "crossings"),
  cluster_params    = list(three_phase = TRUE, distance_cap = ...),
  spawn_connected   = list(SK = list(gradient_max = 0.05, ...))
)

Directory layout (convention)

inst/extdata/configs/<name>/
├── config.yaml                              # top-level manifest, points to all below
├── rules.yaml                               # built from dimensions.csv
├── dimensions.csv                           # source of rules.yaml
├── parameters_fresh.csv
├── wsg_species_presence.csv
├── observation_exclusions.csv
├── overrides/
│   ├── user_modelled_crossing_fixes.csv
│   ├── user_pscis_barrier_status.csv
│   ├── pscis_modelledcrossings_streams_xref.csv
│   └── user_barriers_definite.csv
└── README.md                                # describes the variant, what it's for

config.yaml is the manifest — every other file is relative to the config dir. Custom configs portable: drop a directory anywhere, pass the absolute path.

Abstraction notes

Alternatives considered:

  • Hardcode config in data-raw/compare_bcfishpass.R — current state, doesn't scale to variants
  • Single monolithic YAML — one file, no external CSVs. Rejected: overrides are naturally tabular and large (21k rows of modelled crossing fixes). YAML is wrong format
  • Separate loaders per config piece (lnk_config_rules(), lnk_config_overrides(), ...) — rejected: users shouldn't have to wire five loaders together
  • Include schema naming (working_<aoi>) in the config object — rejected: schema names are pipeline concerns, not config concerns. Pipeline decides naming; config stays reusable across AOIs

Key design decisions:

  • Return is a list, not an environment or R6 — simple, inspectable in RStudio
  • Bundles live in inst/extdata/configs/ — available via system.file() after install
  • CSVs are loaded eagerly into tibbles (not paths) — pipeline steps don't need to know file layout
  • Custom configs: pass an absolute path, same return shape

Execution checklist

Tests required

  • lnk_config("bcfishpass") returns expected list shape with all documented slots
  • lnk_config("default") returns expected list shape
  • Missing manifest → clear error pointing at the missing file
  • Missing referenced file → clear error identifying which config slot is broken
  • Custom path (absolute) works end-to-end
  • All tibbles have expected columns (parameters_fresh, wsg_species, each override CSV)
  • Invalid config.yaml (malformed, missing required keys) → fails validation with useful message

Example must show

  • Why — one object representing a complete pipeline configuration, swappable for variants
  • Howcfg <- lnk_config("bcfishpass"), inspect cfg$rules_yaml and cfg$overrides$modelled_fixes
  • Wires into — show the rules_yaml passed to frs_habitat_classify(), and overrides passed to lnk_load()

Not in scope

Cross-refs

Versions

  • fresh: 0.13.8
  • link: main (0.1.0, target 0.2.0)
  • bcfishpass: ea3c5d8
  • fwapg: Docker (FWA 20240830)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions