Skip to content

Config CSV provenance + pipeline run stamps (close the drift loop) #40

@NewGraphEnvironment

Description

@NewGraphEnvironment

Problem

Pipeline outputs drift silently when underlying inputs change — CSV syncs, fwapg refreshes, bcfishobs updates. On 2026-04-22 a 0.4 pp shift in BT rearing diff vs bcfishpass looked like a refactor regression. It turned out the legacy script on the same DB produced identical numbers — the drift was entirely from env state changes between the earlier comparison run (2026-04-15) and today. Without a stamp of all inputs at run time, "what changed?" is unanswerable.

This issue is about closing that loop end-to-end: every config CSV carries provenance; every pipeline run emits a stamp; drift between any two runs is diffable from their stamps alone.

Proposed Solution

Two layers:

1. Config-bundle provenance (at rest)

Extend inst/extdata/configs/<name>/config.yaml with a provenance section per synced file:

provenance:
  overrides/user_modelled_crossing_fixes.csv:
    source: https://github.com/smnorris/bcfishpass
    path: data/user_modelled_crossing_fixes.csv
    upstream_sha: ea3c5d8
    synced: 2026-04-13
    checksum: sha256:ab12cd34...
  rules.yaml:
    generated_from: dimensions.csv
    generated_by: lnk_rules_build
    generator_sha: 8266b52
  dimensions.csv:
    source: link (hand-authored)
    upstream_sha: 8266b52
    synced: 2026-04-13

lnk_config() reads this and exposes cfg$provenance. A lnk_config_verify() helper re-computes checksums on load and warns if any file drifted from its stored hash.

2. Run stamps (at run)

Every pipeline invocation emits a run-stamp object recording:

  • cfg$provenance (the "at rest" state of every input CSV)
  • Software versions + git SHAs: link, fresh, bcfishpass, fwapg
  • DB snapshot hashes: bcfishobs row count, fwa_stream_networks_sp last-vacuum or relfilenode, bcfishpass reference row counts per species
  • AOI + schema + break_order + any species = override
  • Start/end timestamps, elapsed per phase
  • The resulting comparison tibble (if a reference was provided)

Implementation options:

  • Expand lnk_stamp() (lnk_stamp: export model parameters for report appendix #24) from "report appendix markdown" into "runtime reproducibility record." The report-appendix flavour becomes one rendering of the same underlying stamp object (as.markdown(stamp)).
  • A run stamp is emitted as the return value of a forthcoming lnk_pipeline_run() wrapper (not built yet — right now pipelines are composed explicitly via lnk_pipeline_* phase calls).

Scope for a first PR

  1. Add provenance block to inst/extdata/configs/bcfishpass/config.yaml for every file currently tracked. Backfill with the smnorris SHA we know from the research doc (ea3c5d8, synced 2026-04-13).
  2. Add cfg$provenance to the lnk_config() return.
  3. Add lnk_config_verify(cfg) — recomputes sha256 of every provenanced file, reports drift.
  4. Expand lnk_stamp() (reusing lnk_stamp: export model parameters for report appendix #24's scope) to produce a runtime-stamp list that merges cfg$provenance with runtime software + DB snapshot info.
  5. Wire the stamp into the top of data-raw/compare_bcfishpass.R output — every verification log starts with a stamp dump.

Non-goals

  • Full diff-viewer tool — just capturing the data; diffing two stamps is a later concern.
  • CSV auto-sync from upstream — that's a cron/maintenance issue, not a library one.
  • Machine-readable schema that all downstream packages must consume — keep the stamp inspectable as a plain list.

Cross-refs

Versions

  • fresh: 0.14.0
  • link: 0.3.0 (on branch 38-targets-pipeline)
  • bcfishpass: ea3c5d8
  • fwapg: Docker (FWA 20240830, channel_width synced from tunnel 2026-04-13)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions