Skip to content

Scaffold crate as R package + ship crt_ingest(source, file_name, path) source-explicit dispatcher (first instance: bcfp/user_habitat_classification) #2

@NewGraphEnvironment

Description

@NewGraphEnvironment

Resolved by PR #3 (merged 2026-04-28). Tagged as v0.0.1. Site live at https://www.newgraphenvironment.com/crate/.

Post-merge correction: the schema YAML and acceptance criteria sketched below assume long-canonical shape for user_habitat_classification. Plan-mode exploration during implementation surfaced that the actual canonical is wide (one row per segment × species, separate spawning / rearing integer columns):

  • fresh 0.22.0 explicitly enforces wide-shape input (closes fresh#177; drops the format parameter)
  • link's SQL schema for working.user_habitat_classification is wide
  • current upstream (smnorris/bcfishpass post-2026-04-26 reshape) is wide

The shipped schema YAML at inst/extdata/schemas/bcfp/user_habitat_classification.yaml and decision log at decisions/bcfp/20260427_user_habitat_classification_wide_canonical.md reflect wide-canonical. The body below is preserved as the planning record; treat the decision log + shipped schema as the source of truth.

Other minor scope diffs from plan to ship (informational, not architectural):

  • Schemas live at inst/extdata/schemas/ (not root schemas/) — runtime accessible via system.file(). Decisions stay at root.
  • No separate R-CMD-check.yaml workflow (NGE convention: pkgdown.yaml's build_site_github_pages runs R CMD check implicitly; flooded/fresh/gq all match)
  • Hex sticker shipped in this PR (was originally listed as deferred follow-up)

Problem

Crate is scoped (#1) and has the boundary principle documented in CLAUDE.md, but no R-package scaffolding and no shipped crt_* functions. The bcfishpass user_habitat_classification.csv schema event on 2026-04-26 (long→wide reshape with type change) is the first concrete opportunity to validate crate's role as both the canonical-shape authority AND the executable canonicalization engine. Per the comms-thread Path E decision, crate ships a source-explicit dispatcher (crt_ingest(source, file_name, path)) that link's source-agnostic API calls based on config metadata.

Proposed Solution

Scaffold crate as an R package, ship crt_ingest() with source = "bcfp" and file_name = "user_habitat_classification" as the first-instance handler, plus repo directory structure for schemas and decisions.

R package scaffolding

  • DESCRIPTION (Title, Description, Authors@R, MIT license, Imports: chk, cli, dplyr, readr, tibble, tidyr, yaml)
  • NAMESPACE (managed by roxygen2)
  • R/crate-package.R (package-level docs + imports)
  • tests/ with testthat 3e
  • _pkgdown.yml (bootstrap 5)
  • .github/workflows/R-CMD-check.yaml
  • .github/workflows/pkgdown.yaml
  • dev/dev.R (reproducible setup recipe — see soul r-packages convention)
  • data-raw/ (registry generation if applicable)
  • Hex sticker (deferred — open follow-up issue)

Public functions

crt_ingest(source, file_name, path)
# Source-explicit dispatcher.
# Looks up (source, file_name) in registry, validates path-supplied file's shape
# against schema YAML, dispatches to internal normalize handler, returns canonical tibble.
# Throws: if (source, file_name) not in registry, or shape doesn't match any known variant.

crt_files(source = NULL)
# Returns: tibble (source, file_name, normalize_fn, schema_yaml, canonical_cols)
# Filterable by source. The registry of (source, file_name) pairs crate knows how to ingest.

Path is required (no NULL default) — caller resolves paths (e.g. link's lnk_load_overrides() reads paths from configs). Decouples crate from any consumer's bundle layout. Path may be a local file, S3 URL, or remote URL — crate is path-source agnostic.

Internal handlers

  • R/internal_bcfp_user_habitat_classification.R — handles long format (identity passthrough) AND wide format (pivot to canonical long), returns canonical tibble. Not exported. Naming pattern: internal_<source>_<file_name>.R.

Registry

  • inst/extdata/crate_registry.csv — runtime-read CSV (per design call: schema-as-data should be inspectable as data, not compiled to sysdata.rda). Columns: source, file_name, normalize_fn, schema_yaml, canonical_cols. v0.0.1 ships with one row (source=bcfp, file_name=user_habitat_classification).

Schema YAML format

file: user_habitat_classification
description: Per-watershed-group, per-species habitat classification overrides curated upstream by smnorris/bcfishpass.
canonical:
  shape: long
  cols:
    - { name: watershed_group_code, type: string,  required: true }
    - { name: blue_line_key,        type: integer, required: true }
    - { name: species_code,         type: string,  required: true }
    - { name: habitat_class,        type: string,  required: true }
    - { name: comment,              type: string,  required: false }
upstream_variants:
  - id: pre-2026-04-26
    description: Long format with one row per (watershed, blueline, species) tuple
    cols: [watershed_group_code, blue_line_key, species_code, habitat_class, comment]
    normalize_fn: identity
  - id: 2026-04-26-wide
    description: Wide format with one row per (watershed, blueline) and per-species columns
    first_seen_sha: 40c4a0a
    cols: [watershed_group_code, blue_line_key, SK, ST, BT, ...]
    normalize_fn: pivot_wide_to_long
decisions:
  - decisions/bcfp/20260427_user_habitat_classification_long_canonical.md
upstream_source:
  repo: smnorris/bcfishpass
  path: data/user_habitat_classification.csv

Source families: v0.0.1 scope

v0.0.1 registers ONE source: bcfp. Source families are the upstream-fetch grouping (matches link's existing CSV file-prefix convention: cabd/dfo files all sync from bcfp upstream, so they're source = bcfp with file_name carrying the cabd/dfo prefix). Future source families arrive as project-experimental needs surface:

  • source = "local" — degenerate adapter (validates path-supplied file against named canonical schema, no normalization). Lands in v0.1.x when first project-experimental config needs to swap a bundled bcfp file with a local copy.
  • source = "nge" — NGE-curated data families (e.g. user_habitat_known from field work, user_barriers_extended from project surveys). Lands in v0.1.x.
  • source = "edna" — eDNA lab returns. Lands when first eDNA work begins (per SRED #28's first-domain plan).

Each new source = registry rows + schema YAMLs + adapter functions. Doesn't require API changes.

Repo directory structure

Concern-first at root, source-second nested (per the structure proposal):

crate/
├── CLAUDE.md
├── README.md
├── DESCRIPTION                         # NEW (R pkg)
├── NAMESPACE                           # NEW (R pkg)
├── R/                                  # NEW
│   ├── crate-package.R
│   ├── crt_ingest.R
│   ├── crt_files.R
│   └── internal_bcfp_user_habitat_classification.R
├── tests/                              # NEW
│   └── testthat/
│       ├── test-crt_ingest.R
│       └── fixtures/bcfp/
│           ├── long_user_habitat_classification.csv
│           └── wide_user_habitat_classification.csv
├── inst/extdata/                       # NEW
│   └── crate_registry.csv
├── schemas/                            # NEW
│   ├── README.md
│   └── bcfp/
│       └── user_habitat_classification.yaml
├── decisions/                          # NEW
│   ├── README.md
│   └── bcfp/
│       └── 20260427_user_habitat_classification_long_canonical.md
├── _pkgdown.yml                        # NEW
├── .github/workflows/                  # NEW
│   ├── R-CMD-check.yaml
│   └── pkgdown.yaml
└── comms/                              # already exists

Acceptance criteria

R package machinery

  • R CMD check passes with 0 errors, 0 warnings, 0 notes
  • devtools::document() produces clean NAMESPACE
  • lintr::lint_package() passes
  • pkgdown site builds clean, lists crt_ingest and crt_files in reference

Functions

  • crt_ingest("bcfp", "user_habitat_classification", path = <long-format-fixture>) returns canonical-long tibble
  • crt_ingest("bcfp", "user_habitat_classification", path = <wide-format-fixture>) returns canonical-long tibble (identical output to long-input case)
  • crt_ingest("bcfp", "nonexistent_file", path = ...) throws with diagnostic
  • crt_ingest("bogus_source", "user_habitat_classification", path = ...) throws with diagnostic
  • crt_ingest("bcfp", "user_habitat_classification", path = <garbage-shape>) throws shape-not-recognized
  • crt_files() returns tibble with at minimum: source, file_name, normalize_fn, schema_yaml, canonical_cols
  • crt_files(source = "bcfp") filters to bcfp-sourced entries

Schemas + decisions

  • schemas/bcfp/user_habitat_classification.yaml exists with the proposed format
  • decisions/bcfp/20260427_user_habitat_classification_long_canonical.md exists with reasoning (long row-normalized, scales to N species, matches relational SQL patterns, matches fresh overlay expectations)
  • schemas/README.md and decisions/README.md exist

Release

  • Tagged release v0.0.1 so link can pin via Imports / Remotes

Dependencies / coordination

Context / related

  • Umbrella: Umbrella: 8-year NGE data consolidation — canonical schemas, dictionary, QC, normalization #1 (8-year NGE data consolidation)
  • Comms thread (architectural design): link/comms/crate/20260427_fresh_bcfishpass_csv_consumers.md
  • Implementation plan thread (forthcoming): link/comms/crate/20260427_bcfp_ingest_impl_plan.md
  • Boundary principle: crate/CLAUDE.md (Boundary with rfp section)
  • This is crate's first executable code surface — Path E decision (source-explicit dispatcher in crate, source-agnostic config-driven API in link). Validates crate's bet that schema-as-data + canonicalization-engine roles are jointly viable, AND that the dispatcher framework can grow to multiple source families (bcfp, local, nge, edna) without API changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions