You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Post-merge correction: the schema YAML and acceptance criteria sketched below assume long-canonical shape for user_habitat_classification. Plan-mode exploration during implementation surfaced that the actual canonical is wide (one row per segment × species, separate spawning / rearing integer columns):
fresh 0.22.0 explicitly enforces wide-shape input (closes fresh#177; drops the format parameter)
link's SQL schema for working.user_habitat_classification is wide
current upstream (smnorris/bcfishpass post-2026-04-26 reshape) is wide
Other minor scope diffs from plan to ship (informational, not architectural):
Schemas live at inst/extdata/schemas/ (not root schemas/) — runtime accessible via system.file(). Decisions stay at root.
No separate R-CMD-check.yaml workflow (NGE convention: pkgdown.yaml's build_site_github_pages runs R CMD check implicitly; flooded/fresh/gq all match)
Hex sticker shipped in this PR (was originally listed as deferred follow-up)
Problem
Crate is scoped (#1) and has the boundary principle documented in CLAUDE.md, but no R-package scaffolding and no shipped crt_* functions. The bcfishpass user_habitat_classification.csv schema event on 2026-04-26 (long→wide reshape with type change) is the first concrete opportunity to validate crate's role as both the canonical-shape authority AND the executable canonicalization engine. Per the comms-thread Path E decision, crate ships a source-explicit dispatcher (crt_ingest(source, file_name, path)) that link's source-agnostic API calls based on config metadata.
Proposed Solution
Scaffold crate as an R package, ship crt_ingest() with source = "bcfp" and file_name = "user_habitat_classification" as the first-instance handler, plus repo directory structure for schemas and decisions.
dev/dev.R (reproducible setup recipe — see soul r-packages convention)
data-raw/ (registry generation if applicable)
Hex sticker (deferred — open follow-up issue)
Public functions
crt_ingest(source, file_name, path)
# Source-explicit dispatcher.# Looks up (source, file_name) in registry, validates path-supplied file's shape# against schema YAML, dispatches to internal normalize handler, returns canonical tibble.# Throws: if (source, file_name) not in registry, or shape doesn't match any known variant.
crt_files(source=NULL)
# Returns: tibble (source, file_name, normalize_fn, schema_yaml, canonical_cols)# Filterable by source. The registry of (source, file_name) pairs crate knows how to ingest.
Path is required (no NULL default) — caller resolves paths (e.g. link's lnk_load_overrides() reads paths from configs). Decouples crate from any consumer's bundle layout. Path may be a local file, S3 URL, or remote URL — crate is path-source agnostic.
Internal handlers
R/internal_bcfp_user_habitat_classification.R — handles long format (identity passthrough) AND wide format (pivot to canonical long), returns canonical tibble. Not exported. Naming pattern: internal_<source>_<file_name>.R.
Registry
inst/extdata/crate_registry.csv — runtime-read CSV (per design call: schema-as-data should be inspectable as data, not compiled to sysdata.rda). Columns: source, file_name, normalize_fn, schema_yaml, canonical_cols. v0.0.1 ships with one row (source=bcfp, file_name=user_habitat_classification).
Schema YAML format
file: user_habitat_classificationdescription: Per-watershed-group, per-species habitat classification overrides curated upstream by smnorris/bcfishpass.canonical:
shape: longcols:
- { name: watershed_group_code, type: string, required: true }
- { name: blue_line_key, type: integer, required: true }
- { name: species_code, type: string, required: true }
- { name: habitat_class, type: string, required: true }
- { name: comment, type: string, required: false }upstream_variants:
- id: pre-2026-04-26description: Long format with one row per (watershed, blueline, species) tuplecols: [watershed_group_code, blue_line_key, species_code, habitat_class, comment]normalize_fn: identity
- id: 2026-04-26-widedescription: Wide format with one row per (watershed, blueline) and per-species columnsfirst_seen_sha: 40c4a0acols: [watershed_group_code, blue_line_key, SK, ST, BT, ...]normalize_fn: pivot_wide_to_longdecisions:
- decisions/bcfp/20260427_user_habitat_classification_long_canonical.mdupstream_source:
repo: smnorris/bcfishpasspath: data/user_habitat_classification.csv
Source families: v0.0.1 scope
v0.0.1 registers ONE source: bcfp. Source families are the upstream-fetch grouping (matches link's existing CSV file-prefix convention: cabd/dfo files all sync from bcfp upstream, so they're source = bcfp with file_name carrying the cabd/dfo prefix). Future source families arrive as project-experimental needs surface:
source = "local" — degenerate adapter (validates path-supplied file against named canonical schema, no normalization). Lands in v0.1.x when first project-experimental config needs to swap a bundled bcfp file with a local copy.
source = "nge" — NGE-curated data families (e.g. user_habitat_known from field work, user_barriers_extended from project surveys). Lands in v0.1.x.
source = "edna" — eDNA lab returns. Lands when first eDNA work begins (per SRED #28's first-domain plan).
Each new source = registry rows + schema YAMLs + adapter functions. Doesn't require API changes.
Repo directory structure
Concern-first at root, source-second nested (per the structure proposal):
Coordination protocol: draft schema YAML for bcfp/user_habitat_classification to be included in the crate#2 PR description before tagging v0.0.1, so link-Claude can flag any link-side integration concerns before release. (Per impl-plan comms thread reply.)
Implementation plan thread (forthcoming): link/comms/crate/20260427_bcfp_ingest_impl_plan.md
Boundary principle: crate/CLAUDE.md (Boundary with rfp section)
This is crate's first executable code surface — Path E decision (source-explicit dispatcher in crate, source-agnostic config-driven API in link). Validates crate's bet that schema-as-data + canonicalization-engine roles are jointly viable, AND that the dispatcher framework can grow to multiple source families (bcfp, local, nge, edna) without API changes.
Problem
Crate is scoped (#1) and has the boundary principle documented in CLAUDE.md, but no R-package scaffolding and no shipped
crt_*functions. The bcfishpassuser_habitat_classification.csvschema event on 2026-04-26 (long→wide reshape with type change) is the first concrete opportunity to validate crate's role as both the canonical-shape authority AND the executable canonicalization engine. Per the comms-thread Path E decision, crate ships a source-explicit dispatcher (crt_ingest(source, file_name, path)) that link's source-agnostic API calls based on config metadata.Proposed Solution
Scaffold crate as an R package, ship
crt_ingest()withsource = "bcfp"andfile_name = "user_habitat_classification"as the first-instance handler, plus repo directory structure for schemas and decisions.R package scaffolding
DESCRIPTION(Title, Description, Authors@R, MIT license, Imports: chk, cli, dplyr, readr, tibble, tidyr, yaml)NAMESPACE(managed by roxygen2)R/crate-package.R(package-level docs + imports)tests/with testthat 3e_pkgdown.yml(bootstrap 5).github/workflows/R-CMD-check.yaml.github/workflows/pkgdown.yamldev/dev.R(reproducible setup recipe — see soul r-packages convention)data-raw/(registry generation if applicable)Public functions
Path is required (no NULL default) — caller resolves paths (e.g. link's
lnk_load_overrides()reads paths from configs). Decouples crate from any consumer's bundle layout. Path may be a local file, S3 URL, or remote URL — crate is path-source agnostic.Internal handlers
R/internal_bcfp_user_habitat_classification.R— handles long format (identity passthrough) AND wide format (pivot to canonical long), returns canonical tibble. Not exported. Naming pattern:internal_<source>_<file_name>.R.Registry
inst/extdata/crate_registry.csv— runtime-read CSV (per design call: schema-as-data should be inspectable as data, not compiled to sysdata.rda). Columns:source,file_name,normalize_fn,schema_yaml,canonical_cols. v0.0.1 ships with one row (source=bcfp, file_name=user_habitat_classification).Schema YAML format
Source families: v0.0.1 scope
v0.0.1 registers ONE source:
bcfp. Source families are the upstream-fetch grouping (matches link's existing CSV file-prefix convention: cabd/dfo files all sync from bcfp upstream, so they'resource = bcfpwith file_name carrying the cabd/dfo prefix). Future source families arrive as project-experimental needs surface:source = "local"— degenerate adapter (validates path-supplied file against named canonical schema, no normalization). Lands in v0.1.x when first project-experimental config needs to swap a bundled bcfp file with a local copy.source = "nge"— NGE-curated data families (e.g.user_habitat_knownfrom field work,user_barriers_extendedfrom project surveys). Lands in v0.1.x.source = "edna"— eDNA lab returns. Lands when first eDNA work begins (per SRED #28's first-domain plan).Each new source = registry rows + schema YAMLs + adapter functions. Doesn't require API changes.
Repo directory structure
Concern-first at root, source-second nested (per the structure proposal):
Acceptance criteria
R package machinery
R CMD checkpasses with 0 errors, 0 warnings, 0 notesdevtools::document()produces clean NAMESPACElintr::lint_package()passescrt_ingestandcrt_filesin referenceFunctions
crt_ingest("bcfp", "user_habitat_classification", path = <long-format-fixture>)returns canonical-long tibblecrt_ingest("bcfp", "user_habitat_classification", path = <wide-format-fixture>)returns canonical-long tibble (identical output to long-input case)crt_ingest("bcfp", "nonexistent_file", path = ...)throws with diagnosticcrt_ingest("bogus_source", "user_habitat_classification", path = ...)throws with diagnosticcrt_ingest("bcfp", "user_habitat_classification", path = <garbage-shape>)throws shape-not-recognizedcrt_files()returns tibble with at minimum:source,file_name,normalize_fn,schema_yaml,canonical_colscrt_files(source = "bcfp")filters to bcfp-sourced entriesSchemas + decisions
schemas/bcfp/user_habitat_classification.yamlexists with the proposed formatdecisions/bcfp/20260427_user_habitat_classification_long_canonical.mdexists with reasoning (long row-normalized, scales to N species, matches relational SQL patterns, matches fresh overlay expectations)schemas/README.mdanddecisions/README.mdexistRelease
v0.0.1so link can pin via Imports / RemotesDependencies / coordination
lnk_load_overrides()adoption) — link can't ship that until crate v0.0.1 is outbcfp/user_habitat_classificationto be included in the crate#2 PR description before tagging v0.0.1, so link-Claude can flag any link-side integration concerns before release. (Per impl-plan comms thread reply.)Context / related
link/comms/crate/20260427_fresh_bcfishpass_csv_consumers.mdlink/comms/crate/20260427_bcfp_ingest_impl_plan.mdcrate/CLAUDE.md(Boundary with rfp section)