Resolution / approach update — 2026-04-29
The original proposal (preserved below) added a parallel lnk_load_overrides() alongside the existing lnk_config(). Inventorying the current code surfaced significant overlap — lnk_config() already reads every override CSV via read.csv() and exposes them as cfg$overrides$X / cfg$habitat_classification / cfg$observation_exclusions / cfg$wsg_species. Two functions reading the same files would create parallel APIs with the same job.
Updated approach: decompose into manifest + materialized data, single PR, single bump (v0.18.0). link has zero external R-code consumers (verified — only @seealso doc references in fresh; rtj refs are archived planning), so no backwards-compat shim is needed.
What lnk_config() becomes
Manifest-only. Returns paths, provenance, cfg$files entries with {source, path, canonical_schema}. No data frames in the result. Cheap to call. lnk_config_verify() and lnk_stamp() run without parsing CSVs.
What lnk_load_overrides() does
Exported. Takes a config object (or name/path), returns named list of canonical-shape tibbles. Routes registered entries through crate::crt_ingest(source, file_name, path) (today: bcfp/user_habitat_classification). Unregistered entries fall through to plain CSV read until crate adds their schemas (one issue per file, follow-up).
Config schema (per-entry source/schema declarations)
# inst/extdata/configs/default/config.yaml
name: default
files:
rules_yaml:
path: rules.yaml
dimensions_csv:
path: dimensions.csv
parameters_fresh:
path: parameters_fresh.csv
user_habitat_classification:
source: bcfp # crate-registered
path: overrides/user_habitat_classification.csv
canonical_schema: bcfp/user_habitat_classification
user_barriers_definite:
source: bcfp # not yet in crate registry — falls through to plain CSV
path: overrides/user_barriers_definite.csv
# ... rest of overrides
extends: null # supported in resolver; bundled configs don't use it
provenance:
# unchanged from current schema — already byte/shape checksums per file
Pipeline knobs (break_order, cluster, spawn_connected, apply_habitat_overlay) stay where they are.
Pipeline phase migration
Each lnk_pipeline_* phase that reads cfg$overrides$X or cfg$habitat_classification becomes a phase that takes a loaded object alongside cfg. ~25 reference points across 8 files (lnk_pipeline_{load,break,classify,connect,prepare,species}, lnk_stamp, lnk_config_verify) + tests + targets + vignette.
Why not staged across two PRs
link is its own only consumer. A backwards-compat shim during transition would be dead code on arrival — written and removed in the same week. CLAUDE.md guidance: don't write shims when you can just change the code.
Safety bar
- Pre-flight on a single WSG (~100s) before full tar_make
- Full
tar_make() on 5 WSGs × 2 configs (~20 min) — bit-identical rollup vs pre-refactor baseline is the merge gate
lnk_config_verify() + manifest-shape tests catch config-load failures
- Pipeline phase signature change is mechanical — same data, different access point
Acceptance criteria (revised)
Out of scope (defer to follow-up issues)
- Crate schemas for the other ~9 bcfp-sourced CSVs (one issue per file as canonical-shape decisions concretize). They fall through to plain CSV read until then.
nge and local source families (when project-experimental configs surface real need)
- Type-aware variant matching (crate v0.1.x roadmap)
- Lazy / per-WSG loading (
crossings.csv is parsed twice today across 2 configs — possible follow-up since the manifest/data split makes it trivial)
Original proposal (preserved for audit trail)
Problem
When smnorris reshapes a bcfishpass override CSV (long→wide, type changes, column renames), link's processing code is shape-fragile — every consumer of the file knows the upstream shape directly. Yesterday's user_habitat_classification.csv reshape rippled into fresh's API (fresh#176, #177) and threatens cached report output. Beyond that, link's API today couples to bcfp by hardcoding paths/shapes — but link is meant to be source-agnostic. We want to add new data types (e.g. NGE-curated user_habitat_known) and swap experimental files (e.g. project-local user_barriers_definite) without changing link's code.
Proposed Solution
Adopt a config-driven, source-agnostic API in link that delegates per-file ingest to crate's source-explicit dispatcher (crate::crt_ingest(source, file_name, path)). link's code never names a source; the source comes from config metadata.
Public API
lnk_load_overrides(config = "default")
# config: name of bundled config OR path to a config YAML file
# Returns: named list of canonical-shape tibbles, one per file in the resolved config
Return contract: named list of canonical-shape tibbles (one per file in resolved config). Caller decides whether to write to a DB. lnk_load_overrides() is pure-R-side, testable without a DB connection. No DB writes happen inside this function.
Bundled config schema
# inst/extdata/configs/default/config.yaml
name: default
files:
user_barriers_definite:
source: bcfp
path_relative: overrides/user_barriers_definite.csv # bundled, relative to config dir
canonical_schema: bcfp/user_barriers_definite # crate schema this file conforms to
user_habitat_classification:
source: bcfp
path_relative: overrides/user_habitat_classification.csv
canonical_schema: bcfp/user_habitat_classification
# ... other bcfp-sourced files
Project / experimental config example
A project repo (e.g. restoration_wedzin_kwa_2024) supplies its own config. Supports extends: to inherit the default, and overrides: to swap or add entries:
# wedzin_kwa_2024/configs/experimental.yaml
name: experimental_wedzin_kwa
extends: default # inherit all bcfp entries from link's default
overrides:
user_barriers_definite:
source: local # OUR experimental table
path: data/wedzin_kwa/barriers_definite_v3.csv # absolute or relative to config file
canonical_schema: bcfp/user_barriers_definite # validate against same canonical shape
user_habitat_known: # NEW logical entry, doesn't exist in default
source: nge # NGE-curated source family
path: data/wedzin_kwa/habitat_field_2024.csv
canonical_schema: nge/user_habitat_known # crate schema (lands when this domain is added to crate)
lnk_load_overrides("wedzin_kwa_2024/configs/experimental.yaml") resolves extends, applies overrides, dispatches each entry via crate::crt_ingest(source, file_name, path), returns named list of canonical tibbles.
Wrangling stays in project / consumer code
Once lnk_load_overrides() returns canonical-shape tibbles, project scripts wrangle them with plain dplyr/tidyr — combining experimental + bcfp tables, dedup, semi-joins, AOI filters. No special "merge" framework in link; pure data composition on canonical shapes.
Implementation outline
R/lnk_load_overrides.R — exported, resolves config (incl. extends/overrides), iterates files, calls crate::crt_ingest()
R/lnk_config.R — internal helpers for config resolution (parse YAML, resolve extends, apply overrides, validate paths exist)
R/lnk_source.R — internal .lnk_source_resolve(entry) — given a config entry, returns the source location handle (today: a file path resolved from path or path_relative; forward-compatible with S3 URLs, postgres connections, anything crate::crt_ingest() accepts in future). Not exported. Reserves the lnk_source_* namespace for future siblings (lnk_source_list(config), lnk_source_validate(entry), lnk_source_check(config)) that ship as project-experimental work surfaces real need.
inst/extdata/configs/default/config.yaml — bundled default config listing bcfp entries
inst/extdata/configs/bcfishpass/config.yaml — existing bundle; updated to new schema (still source: bcfp for everything; mirrors upstream verbatim for regression testing)
DESCRIPTION — adds crate (>= 0.0.1) to Imports (or Remotes until crate is on registry)
- Tests: load default config → returns expected files in canonical shape; load synthetic experimental config (extends + overrides + new file type) → returns merged result
Why no source name appears in link's R code
That's the load-bearing property. Adding a new source (NGE, lab, provincial) means:
- crate adds the schema YAML + adapter for the new source
- Project config references
source: <new-source> in its YAML
- link's R code does NOT change
Link is fully source-agnostic at the API level. Source knowledge lives in:
- Crate (the adapter code)
- Configs (data, in link's bundle for default + project repos for experimental)
- NEVER in link's R code
Acceptance criteria
Scope
First-instance integration: user_habitat_classification.csv. Other ~9 bcfp-sourced CSVs in default config conform to the same lnk_load_overrides pattern automatically (they're config entries, not code changes), but their crate schema YAMLs land as separate follow-up issues (one per file as canonical-shape decisions concretize).
Dependencies / coordination
Context / related
- Comms thread (architectural design):
comms/crate/20260427_fresh_bcfishpass_csv_consumers.md
- Implementation plan thread (forthcoming):
comms/crate/20260427_bcfp_ingest_impl_plan.md
- Crate boundary doc:
crate/CLAUDE.md (Boundary with rfp section — link consumes canonical, crate owns canonicalization framework)
- Path E architectural choice (config-driven source-agnostic API in link, source-explicit dispatcher in crate) — settled in comms thread after considering Path C (link calls per-source) and Path D (crate owns sync)
Resolution / approach update — 2026-04-29
The original proposal (preserved below) added a parallel
lnk_load_overrides()alongside the existinglnk_config(). Inventorying the current code surfaced significant overlap —lnk_config()already reads every override CSV viaread.csv()and exposes them ascfg$overrides$X/cfg$habitat_classification/cfg$observation_exclusions/cfg$wsg_species. Two functions reading the same files would create parallel APIs with the same job.Updated approach: decompose into manifest + materialized data, single PR, single bump (v0.18.0). link has zero external R-code consumers (verified — only
@seealsodoc references in fresh; rtj refs are archived planning), so no backwards-compat shim is needed.What
lnk_config()becomesManifest-only. Returns paths, provenance,
cfg$filesentries with{source, path, canonical_schema}. No data frames in the result. Cheap to call.lnk_config_verify()andlnk_stamp()run without parsing CSVs.What
lnk_load_overrides()doesExported. Takes a config object (or name/path), returns named list of canonical-shape tibbles. Routes registered entries through
crate::crt_ingest(source, file_name, path)(today:bcfp/user_habitat_classification). Unregistered entries fall through to plain CSV read until crate adds their schemas (one issue per file, follow-up).Config schema (per-entry source/schema declarations)
Pipeline knobs (
break_order,cluster,spawn_connected,apply_habitat_overlay) stay where they are.Pipeline phase migration
Each
lnk_pipeline_*phase that readscfg$overrides$Xorcfg$habitat_classificationbecomes a phase that takes aloadedobject alongsidecfg. ~25 reference points across 8 files (lnk_pipeline_{load,break,classify,connect,prepare,species}, lnk_stamp, lnk_config_verify) + tests + targets + vignette.Why not staged across two PRs
link is its own only consumer. A backwards-compat shim during transition would be dead code on arrival — written and removed in the same week. CLAUDE.md guidance: don't write shims when you can just change the code.
Safety bar
tar_make()on 5 WSGs × 2 configs (~20 min) — bit-identical rollup vs pre-refactor baseline is the merge gatelnk_config_verify()+ manifest-shape tests catch config-load failuresAcceptance criteria (revised)
DESCRIPTIONdeclarescrate (>= 0.0.1)inImports(orRemotesif not yet on registry)lnk_config()returns manifest-only object — no data frames incfg$overrides$X,cfg$habitat_classification, etc.lnk_load_overrides(cfg)exported, returns named list of canonical tibblesuser_habitat_classificationroutes throughcrt_ingest("bcfp", "user_habitat_classification", path)and shields callers from upstream long↔wide pivotsextends:(project configs inherit defaults) and per-entry overrideslnk_pipeline_*phases takeloadedalongsidecfgand use itlnk_config_verify()+lnk_stamp()work without parsing data CSVsbcfp,nge,local) appears in link's R code — only in YAML configstar_make()rollup bit-identical vs pre-refactor baseline on all 5 WSGsOut of scope (defer to follow-up issues)
ngeandlocalsource families (when project-experimental configs surface real need)crossings.csvis parsed twice today across 2 configs — possible follow-up since the manifest/data split makes it trivial)Original proposal (preserved for audit trail)
Problem
When smnorris reshapes a bcfishpass override CSV (long→wide, type changes, column renames), link's processing code is shape-fragile — every consumer of the file knows the upstream shape directly. Yesterday's
user_habitat_classification.csvreshape rippled into fresh's API (fresh#176, #177) and threatens cached report output. Beyond that, link's API today couples to bcfp by hardcoding paths/shapes — but link is meant to be source-agnostic. We want to add new data types (e.g. NGE-curateduser_habitat_known) and swap experimental files (e.g. project-localuser_barriers_definite) without changing link's code.Proposed Solution
Adopt a config-driven, source-agnostic API in link that delegates per-file ingest to crate's source-explicit dispatcher (
crate::crt_ingest(source, file_name, path)). link's code never names a source; the source comes from config metadata.Public API
Return contract: named list of canonical-shape tibbles (one per file in resolved config). Caller decides whether to write to a DB.
lnk_load_overrides()is pure-R-side, testable without a DB connection. No DB writes happen inside this function.Bundled config schema
Project / experimental config example
A project repo (e.g.
restoration_wedzin_kwa_2024) supplies its own config. Supportsextends:to inherit the default, andoverrides:to swap or add entries:lnk_load_overrides("wedzin_kwa_2024/configs/experimental.yaml")resolvesextends, appliesoverrides, dispatches each entry viacrate::crt_ingest(source, file_name, path), returns named list of canonical tibbles.Wrangling stays in project / consumer code
Once
lnk_load_overrides()returns canonical-shape tibbles, project scripts wrangle them with plain dplyr/tidyr — combining experimental + bcfp tables, dedup, semi-joins, AOI filters. No special "merge" framework in link; pure data composition on canonical shapes.Implementation outline
R/lnk_load_overrides.R— exported, resolves config (incl. extends/overrides), iterates files, callscrate::crt_ingest()R/lnk_config.R— internal helpers for config resolution (parse YAML, resolve extends, apply overrides, validate paths exist)R/lnk_source.R— internal.lnk_source_resolve(entry)— given a config entry, returns the source location handle (today: a file path resolved frompathorpath_relative; forward-compatible with S3 URLs, postgres connections, anythingcrate::crt_ingest()accepts in future). Not exported. Reserves thelnk_source_*namespace for future siblings (lnk_source_list(config),lnk_source_validate(entry),lnk_source_check(config)) that ship as project-experimental work surfaces real need.inst/extdata/configs/default/config.yaml— bundled default config listing bcfp entriesinst/extdata/configs/bcfishpass/config.yaml— existing bundle; updated to new schema (still source: bcfp for everything; mirrors upstream verbatim for regression testing)DESCRIPTION— addscrate (>= 0.0.1)toImports(orRemotesuntil crate is on registry)Why no source name appears in link's R code
That's the load-bearing property. Adding a new source (NGE, lab, provincial) means:
source: <new-source>in its YAMLLink is fully source-agnostic at the API level. Source knowledge lives in:
Acceptance criteria
DESCRIPTIONdeclarescrate(>= 0.0.1) inImports(orRemotesif not yet on registry)lnk_load_overrides()resolves bundled configs by name (config = "default") and arbitrary paths (config = "/path/to/config.yaml")extends:(inherit entries from another config) andoverrides:(replace or add entries)lnk_load_overrides()dispatches each entry viacrate::crt_ingest(source, file_name, path)and returns named list of canonical tibblesinst/extdata/configs/default/config.yamllists all bcfp-sourced files withsource: bcfp+canonical_schema: bcfp/<file>inst/extdata/configs/bcfishpass/config.yamlupdated to new schema (retains regression-test purpose)read.csv()calls ofuser_habitat_classification.csvin link's R/ are replaced by access throughlnk_load_overrides()$user_habitat_classificationformat/species_colshape-aware parameters needed in link code that consumed bcfp-sourced files (those moved to crate's adapter; link sees only canonical)bcfp,nge,local) appears in link's R code — only in YAML configsScope
First-instance integration:
user_habitat_classification.csv. Other ~9 bcfp-sourced CSVs in default config conform to the samelnk_load_overridespattern automatically (they're config entries, not code changes), but their crate schema YAMLs land as separate follow-up issues (one per file as canonical-shape decisions concretize).Dependencies / coordination
crt_ingest(source, file_name, path)withsource = "bcfp"registered + first-instance handler foruser_habitat_classification)formatparameter goes away once link normalizes via crate at ingest)restoration_wedzin_kwa_2024), crate releases v0.1.x addingsource = "local"andsource = "nge"registrations + corresponding schemasContext / related
comms/crate/20260427_fresh_bcfishpass_csv_consumers.mdcomms/crate/20260427_bcfp_ingest_impl_plan.mdcrate/CLAUDE.md(Boundary with rfp section — link consumes canonical, crate owns canonicalization framework)