Skip to content

Add lnk_load_overrides(config) source-agnostic API consuming crate::crt_ingest() #65

@NewGraphEnvironment

Description

@NewGraphEnvironment

Resolution / approach update — 2026-04-29

The original proposal (preserved below) added a parallel lnk_load_overrides() alongside the existing lnk_config(). Inventorying the current code surfaced significant overlap — lnk_config() already reads every override CSV via read.csv() and exposes them as cfg$overrides$X / cfg$habitat_classification / cfg$observation_exclusions / cfg$wsg_species. Two functions reading the same files would create parallel APIs with the same job.

Updated approach: decompose into manifest + materialized data, single PR, single bump (v0.18.0). link has zero external R-code consumers (verified — only @seealso doc references in fresh; rtj refs are archived planning), so no backwards-compat shim is needed.

What lnk_config() becomes

Manifest-only. Returns paths, provenance, cfg$files entries with {source, path, canonical_schema}. No data frames in the result. Cheap to call. lnk_config_verify() and lnk_stamp() run without parsing CSVs.

What lnk_load_overrides() does

Exported. Takes a config object (or name/path), returns named list of canonical-shape tibbles. Routes registered entries through crate::crt_ingest(source, file_name, path) (today: bcfp/user_habitat_classification). Unregistered entries fall through to plain CSV read until crate adds their schemas (one issue per file, follow-up).

Config schema (per-entry source/schema declarations)

# inst/extdata/configs/default/config.yaml
name: default
files:
  rules_yaml:
    path: rules.yaml
  dimensions_csv:
    path: dimensions.csv
  parameters_fresh:
    path: parameters_fresh.csv
  user_habitat_classification:
    source: bcfp                                  # crate-registered
    path: overrides/user_habitat_classification.csv
    canonical_schema: bcfp/user_habitat_classification
  user_barriers_definite:
    source: bcfp                                  # not yet in crate registry — falls through to plain CSV
    path: overrides/user_barriers_definite.csv
  # ... rest of overrides
extends: null   # supported in resolver; bundled configs don't use it
provenance:
  # unchanged from current schema — already byte/shape checksums per file

Pipeline knobs (break_order, cluster, spawn_connected, apply_habitat_overlay) stay where they are.

Pipeline phase migration

Each lnk_pipeline_* phase that reads cfg$overrides$X or cfg$habitat_classification becomes a phase that takes a loaded object alongside cfg. ~25 reference points across 8 files (lnk_pipeline_{load,break,classify,connect,prepare,species}, lnk_stamp, lnk_config_verify) + tests + targets + vignette.

Why not staged across two PRs

link is its own only consumer. A backwards-compat shim during transition would be dead code on arrival — written and removed in the same week. CLAUDE.md guidance: don't write shims when you can just change the code.

Safety bar

  • Pre-flight on a single WSG (~100s) before full tar_make
  • Full tar_make() on 5 WSGs × 2 configs (~20 min) — bit-identical rollup vs pre-refactor baseline is the merge gate
  • lnk_config_verify() + manifest-shape tests catch config-load failures
  • Pipeline phase signature change is mechanical — same data, different access point

Acceptance criteria (revised)

  • DESCRIPTION declares crate (>= 0.0.1) in Imports (or Remotes if not yet on registry)
  • lnk_config() returns manifest-only object — no data frames in cfg$overrides$X, cfg$habitat_classification, etc.
  • lnk_load_overrides(cfg) exported, returns named list of canonical tibbles
  • user_habitat_classification routes through crt_ingest("bcfp", "user_habitat_classification", path) and shields callers from upstream long↔wide pivots
  • Other registered files (when crate adds them) plug in by config edit alone — no link R code change
  • Config schema supports extends: (project configs inherit defaults) and per-entry overrides
  • All lnk_pipeline_* phases take loaded alongside cfg and use it
  • lnk_config_verify() + lnk_stamp() work without parsing data CSVs
  • Tests cover: manifest loads correctly without data; load_overrides dispatches via crate; project config with extends + overrides; missing files / mis-shape inputs throw fail-loud
  • No source name (bcfp, nge, local) appears in link's R code — only in YAML configs
  • tar_make() rollup bit-identical vs pre-refactor baseline on all 5 WSGs

Out of scope (defer to follow-up issues)

  • Crate schemas for the other ~9 bcfp-sourced CSVs (one issue per file as canonical-shape decisions concretize). They fall through to plain CSV read until then.
  • nge and local source families (when project-experimental configs surface real need)
  • Type-aware variant matching (crate v0.1.x roadmap)
  • Lazy / per-WSG loading (crossings.csv is parsed twice today across 2 configs — possible follow-up since the manifest/data split makes it trivial)

Original proposal (preserved for audit trail)

Problem

When smnorris reshapes a bcfishpass override CSV (long→wide, type changes, column renames), link's processing code is shape-fragile — every consumer of the file knows the upstream shape directly. Yesterday's user_habitat_classification.csv reshape rippled into fresh's API (fresh#176, #177) and threatens cached report output. Beyond that, link's API today couples to bcfp by hardcoding paths/shapes — but link is meant to be source-agnostic. We want to add new data types (e.g. NGE-curated user_habitat_known) and swap experimental files (e.g. project-local user_barriers_definite) without changing link's code.

Proposed Solution

Adopt a config-driven, source-agnostic API in link that delegates per-file ingest to crate's source-explicit dispatcher (crate::crt_ingest(source, file_name, path)). link's code never names a source; the source comes from config metadata.

Public API

lnk_load_overrides(config = "default")
# config: name of bundled config OR path to a config YAML file
# Returns: named list of canonical-shape tibbles, one per file in the resolved config

Return contract: named list of canonical-shape tibbles (one per file in resolved config). Caller decides whether to write to a DB. lnk_load_overrides() is pure-R-side, testable without a DB connection. No DB writes happen inside this function.

Bundled config schema

# inst/extdata/configs/default/config.yaml
name: default
files:
  user_barriers_definite:
    source: bcfp
    path_relative: overrides/user_barriers_definite.csv     # bundled, relative to config dir
    canonical_schema: bcfp/user_barriers_definite           # crate schema this file conforms to
  user_habitat_classification:
    source: bcfp
    path_relative: overrides/user_habitat_classification.csv
    canonical_schema: bcfp/user_habitat_classification
  # ... other bcfp-sourced files

Project / experimental config example

A project repo (e.g. restoration_wedzin_kwa_2024) supplies its own config. Supports extends: to inherit the default, and overrides: to swap or add entries:

# wedzin_kwa_2024/configs/experimental.yaml
name: experimental_wedzin_kwa
extends: default                                            # inherit all bcfp entries from link's default
overrides:
  user_barriers_definite:
    source: local                                           # OUR experimental table
    path: data/wedzin_kwa/barriers_definite_v3.csv          # absolute or relative to config file
    canonical_schema: bcfp/user_barriers_definite           # validate against same canonical shape
  user_habitat_known:                                       # NEW logical entry, doesn't exist in default
    source: nge                                             # NGE-curated source family
    path: data/wedzin_kwa/habitat_field_2024.csv
    canonical_schema: nge/user_habitat_known                # crate schema (lands when this domain is added to crate)

lnk_load_overrides("wedzin_kwa_2024/configs/experimental.yaml") resolves extends, applies overrides, dispatches each entry via crate::crt_ingest(source, file_name, path), returns named list of canonical tibbles.

Wrangling stays in project / consumer code

Once lnk_load_overrides() returns canonical-shape tibbles, project scripts wrangle them with plain dplyr/tidyr — combining experimental + bcfp tables, dedup, semi-joins, AOI filters. No special "merge" framework in link; pure data composition on canonical shapes.

Implementation outline

  • R/lnk_load_overrides.R — exported, resolves config (incl. extends/overrides), iterates files, calls crate::crt_ingest()
  • R/lnk_config.R — internal helpers for config resolution (parse YAML, resolve extends, apply overrides, validate paths exist)
  • R/lnk_source.R — internal .lnk_source_resolve(entry) — given a config entry, returns the source location handle (today: a file path resolved from path or path_relative; forward-compatible with S3 URLs, postgres connections, anything crate::crt_ingest() accepts in future). Not exported. Reserves the lnk_source_* namespace for future siblings (lnk_source_list(config), lnk_source_validate(entry), lnk_source_check(config)) that ship as project-experimental work surfaces real need.
  • inst/extdata/configs/default/config.yaml — bundled default config listing bcfp entries
  • inst/extdata/configs/bcfishpass/config.yaml — existing bundle; updated to new schema (still source: bcfp for everything; mirrors upstream verbatim for regression testing)
  • DESCRIPTION — adds crate (>= 0.0.1) to Imports (or Remotes until crate is on registry)
  • Tests: load default config → returns expected files in canonical shape; load synthetic experimental config (extends + overrides + new file type) → returns merged result

Why no source name appears in link's R code

That's the load-bearing property. Adding a new source (NGE, lab, provincial) means:

  • crate adds the schema YAML + adapter for the new source
  • Project config references source: <new-source> in its YAML
  • link's R code does NOT change

Link is fully source-agnostic at the API level. Source knowledge lives in:

  1. Crate (the adapter code)
  2. Configs (data, in link's bundle for default + project repos for experimental)
  3. NEVER in link's R code

Acceptance criteria

  • DESCRIPTION declares crate (>= 0.0.1) in Imports (or Remotes if not yet on registry)
  • lnk_load_overrides() resolves bundled configs by name (config = "default") and arbitrary paths (config = "/path/to/config.yaml")
  • Config resolver supports extends: (inherit entries from another config) and overrides: (replace or add entries)
  • lnk_load_overrides() dispatches each entry via crate::crt_ingest(source, file_name, path) and returns named list of canonical tibbles
  • Bundled inst/extdata/configs/default/config.yaml lists all bcfp-sourced files with source: bcfp + canonical_schema: bcfp/<file>
  • Bundled inst/extdata/configs/bcfishpass/config.yaml updated to new schema (retains regression-test purpose)
  • Existing direct read.csv() calls of user_habitat_classification.csv in link's R/ are replaced by access through lnk_load_overrides()$user_habitat_classification
  • Tests cover: default config loads correctly; synthetic project config with extends + overrides + new file type loads correctly; missing files / mis-shape inputs throw fail-loud
  • No format / species_col shape-aware parameters needed in link code that consumed bcfp-sourced files (those moved to crate's adapter; link sees only canonical)
  • No source name (bcfp, nge, local) appears in link's R code — only in YAML configs

Scope

First-instance integration: user_habitat_classification.csv. Other ~9 bcfp-sourced CSVs in default config conform to the same lnk_load_overrides pattern automatically (they're config entries, not code changes), but their crate schema YAMLs land as separate follow-up issues (one per file as canonical-shape decisions concretize).

Dependencies / coordination

Context / related

  • Comms thread (architectural design): comms/crate/20260427_fresh_bcfishpass_csv_consumers.md
  • Implementation plan thread (forthcoming): comms/crate/20260427_bcfp_ingest_impl_plan.md
  • Crate boundary doc: crate/CLAUDE.md (Boundary with rfp section — link consumes canonical, crate owns canonicalization framework)
  • Path E architectural choice (config-driven source-agnostic API in link, source-explicit dispatcher in crate) — settled in comms thread after considering Path C (link calls per-source) and Path D (crate owns sync)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions