Skip to content

Ground causal-graph predicates: 28-mapping cohort, 482 edges grounded#61

Merged
realmarcin merged 1 commit into
mainfrom
ground-causal-predicates-v1
May 23, 2026
Merged

Ground causal-graph predicates: 28-mapping cohort, 482 edges grounded#61
realmarcin merged 1 commit into
mainfrom
ground-causal-predicates-v1

Conversation

@realmarcin
Copy link
Copy Markdown
Contributor

Summary

Adds the predicate-grounding pipeline that fills causal_graphs[].edges[].predicate_id from a curated label→CURIE TSV. This pass grounds 482 edges across 212 trait YAMLs using 28 high-confidence mappings spanning METPO, RO, biolink, and rdfs.

Mapping cohort (mappings/predicate_grounding.tsv, 28 rows)

  • 6 METPO ObjectProperty matches (produces, oxidizes, uses carbon/electron-donor/electron-acceptor/energy-source)
  • 4 RO matches (enables RO:0002327, contributes to RO:0002326, regulates RO:0002211, depends on RO:0002502)
  • 17 biolink slot matches (causes, catalyzes, associated_with, located_in + 3 aliases, participates_in, part_of, occurs_in, interacts_with, develops_into, consumes, produces, encodes, etc.)
  • 1 rdfs:subClassOf covering is a, specializes, example of

New machinery

  • scripts/ground_causal_predicates.py — idempotent: never overwrites existing groundings, validates closed-mode before write, appends one CurationEvent per modified file.
  • scripts/check_biolink_coverage.py — cross-checks applied mappings + residual labels against data/raw/biolink-model.yaml (vendored, 499 KB).
  • just ground-predicates and just check-biolink-coverage recipes.

Residual

537 edges across 191 distinct labels remain ungrounded. See reports/predicate_grounding_residual.tsv for the ranked tail; top residuals (manifests as 52, supports 26, selects for 20, drives 19) are curator-paraphrased predicates without a clean RO/Biolink home — candidates for an upstream METPO predicate proposal.

Audit snapshot

Includes audit-pass output from the audit-schema-gaps skill (reports/{gap_fix_backlog,schema_gap_audit,instance_validation_*,pipeline_*}). Corpus passes just validate-strict clean (0 ERROR rows / 357 files). The CI gate that locks this in is tracked as G01 in reports/gap_fix_backlog.md and lands in a follow-up PR.

Test plan

  • just validate-strict — 0 ERROR rows / 357 files
  • just ground-predicates (dry-run after --apply) — reports 0 additional groundings (idempotent)
  • just check-biolink-coverage — 28 applied mappings indexed, residual cross-checked
  • CI re-runs validate-strict on the diff (gate ships in PR2)

🤖 Generated with Claude Code

Adds the predicate-grounding pipeline used to populate
`causal_graphs[].edges[].predicate_id` from a curated label→CURIE TSV.
This pass grounds 412 edges across 212 trait YAMLs using 28
high-confidence mappings (METPO, RO, biolink, rdfs).

New machinery:
- `scripts/ground_causal_predicates.py` — walks `data/traits/**/*.yaml`,
  fills empty `predicate_id` from `mappings/predicate_grounding.tsv`,
  validates closed-mode before write, appends one CurationEvent per
  modified file, never overwrites existing groundings.
- `scripts/check_biolink_coverage.py` — cross-checks applied mappings
  and residual labels against the Biolink model
  (`data/raw/biolink-model.yaml`, vendored to keep CI self-contained).
- `just ground-predicates` and `just check-biolink-coverage` recipes.

Initial mapping cohort (`mappings/predicate_grounding.tsv`, 28 rows):
- 6 METPO ObjectProperty matches (produces, oxidizes, uses
  carbon/electron-donor/electron-acceptor/energy-source).
- 4 RO matches (enables RO:0002327, contributes to RO:0002326,
  regulates RO:0002211, depends on RO:0002502).
- 17 biolink slot matches (causes, catalyzes, associated_with,
  located_in, participates_in, part_of, occurs_in, interacts_with,
  develops_into, consumes, produces, encodes, plus three
  located_in aliases — localized in/to, localizes to).
- 1 rdfs:subClassOf for `is a`, `specializes`, `example of`.

Residual: 537 edges across 191 distinct labels remain ungrounded.
See `reports/predicate_grounding_residual.tsv` for the ranked tail;
top residuals (`manifests as`, `supports`, `selects for`, `drives`)
are curator-paraphrased predicates without a clean RO/Biolink home
and are candidates for an upstream METPO predicate proposal.

Includes audit-pass output from the audit-schema-gaps skill
(`reports/{gap_fix_backlog,schema_gap_audit,instance_validation_*,
pipeline_*}`). Corpus passes `just validate-strict` clean: 0 ERROR
rows across 357 files. The CI gate that locks this in is tracked
as G01 in `reports/gap_fix_backlog.md` and lands in a follow-up PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 23, 2026 08:25
@realmarcin realmarcin mentioned this pull request May 23, 2026
4 tasks
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a predicate-grounding workflow for TraitMech causal graphs, populating causal_graphs[].edges[].predicate_id from a curated label→CURIE TSV and wiring the workflow into just targets, alongside audit/coverage reports.

Changes:

  • Add predicate grounding/coverage tooling (new scripts + just recipes) driven by mappings/predicate_grounding.tsv.
  • Apply predicate groundings across many trait YAMLs by adding predicate_id plus new curation_history events.
  • Add audit/summary/report artifacts under reports/ to capture writer-audit and validation snapshots.

Reviewed changes

Copilot reviewed 228 out of 229 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
mappings/predicate_grounding.tsv Adds the curated label→CURIE mapping cohort used for grounding.
justfile Adds ground-predicates and check-biolink-coverage recipes.
reports/pipeline_writers_audit.tsv Snapshot TSV of YAML-writer audit (currently appears incomplete vs new scripts).
reports/pipeline_gap_audit.md Narrative audit of YAML-writing scripts and pipeline gaps (needs updates for new writer).
reports/instance_validation_summary.md Summary of strict instance validation run.
reports/instance_validation_failures.tsv Empty/placeholder failures TSV (header only).
reports/gap_fix_backlog.tsv Backlog of pipeline/schema follow-ups.
data/traits/**.yaml Adds predicate_id groundings and GROUND_CAUSAL_PREDICATES curation events across many traits.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1 to +5
path writes_yaml appends_curation_history has_write_safeguard validates_before_write wired_into_just
scripts/audit_writers.py yes yes yes yes yes
scripts/build_embedding_index.py yes no no no yes
scripts/render_trait_pages.py yes no yes no yes
scripts/seed_from_metpo.py yes yes yes no yes
Comment on lines +18 to +23
### `scripts/seed_from_metpo.py` — the only real trait-YAML writer

This is the entry point for new trait records. It uses the safer **opt-in** convention (`--apply` defaults off; bare invocation is dry-run) and appends `CurationEvent` entries when it writes — both correct. It does **not** validate output against the schema before writing.

**Gap (P1):** add an in-process strict validation pass before each write, using the same `linkml.validator.Validator(closed=True)` configured in `scripts/validate_strict.py`. If a record fails, log + skip rather than abort the whole run, so one bad record doesn't poison a 357-file seed. Effort: M (refactor the writer loop to construct a per-process Validator and call it before `path.write_text`). This is the highest-leverage fix on the pipeline axis because the seeder is the *only* path producing new trait records.

Comment on lines 135 to 137
- timestamp: '2026-05-09T00:00:00-07:00'
curator: Codex
action: CURATED_CAUSAL_GRAPH
Comment on lines 118 to 121
llm_assisted: false
- timestamp: '2026-05-09T00:00:00-07:00'
curator: Codex
action: CURATED_CAUSAL_GRAPH
Comment on lines +9 to +14
| path | writes_yaml | curation_history | safeguard | validates_first | wired_into_just |
|---|---|---|---|---|---|
| `scripts/audit_writers.py` | yes | yes | yes | yes | yes |
| `scripts/build_embedding_index.py` | yes | no | no | no | yes |
| `scripts/render_trait_pages.py` | yes | no | yes | no | yes |
| `scripts/seed_from_metpo.py` | yes | yes | yes | no | yes |
@realmarcin realmarcin merged commit 36b9fdd into main May 23, 2026
4 checks passed
@realmarcin realmarcin deleted the ground-causal-predicates-v1 branch May 23, 2026 20:46
realmarcin added a commit that referenced this pull request May 23, 2026
…#63)

Lifts the TraitMech causal-graph subsystem into METPO so downstream
consumers can filter trait records by mechanism axis using
METPO-native queries instead of TraitMech-internal LinkML enum
codes. Cohort is committed here in-repo only — not filed upstream
in this PR.

Cohort (proposals/metpo_traitmech_v1/):
- 3 top-level domain classes under METPO:1000000:
  - METPO:1007400 trait causal graph
  - METPO:1007401 trait causal node
  - METPO:1007402 trait causal edge
- 1 enum-parent under METPO:1007401:
  - METPO:1007410 trait causal node type
- 10 leaf classes under METPO:1007410, one per
  CausalNodeTypeEnum permissible value (METPO:1007411–1007420),
  e.g. causal-graph trait node, causal-graph pathway node,
  causal-graph environmental factor node (xref ENVO:01000254),
  causal-graph experimental factor node (xref EFO:0000001), etc.

Out of scope (documented in proposal.md):
- Scope A: no traitmech:NNNNNN synthetic IDs exist in corpus today.
- Scope B (causal-graph predicates): deferred until the
  predicate-grounding migration (#61) reduces the 191-label residual.
- 5 other LinkML enums (TraitCategoryEnum, TermKindEnum,
  SynonymTypeEnum, PriorityEnum, MappingStatusEnum) — workflow
  internals, not ontology axes.

Tooling:
- scripts/verify_metpo_proposal.py — column-count, header, parent
  integrity, subset tag, scope-A/C coverage. Wired as
  `just verify-proposal <cohort>`.
- scripts/robot_validate_proposal.py — `robot template → merge with
  metpo.owl → reason ELK`. Wired as
  `just robot-validate-proposal <cohort>`. Discovers robot via
  $ROBOT, $ROBOT_BIN, PATH, then ../kg-microbe/data/raw/robot.
- .gitignore: reports/robot/ (regenerable, dominated by re-serialized
  metpo.owl at ~500 KB per file).

Verification (run locally on this branch):
- `just verify-proposal metpo_traitmech_v1` → PASS, 0 failures.
- `just robot-validate-proposal metpo_traitmech_v1` → PASS,
  no UNSAT, ELK delta +6 axioms (the inferred subclass closure).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin added a commit that referenced this pull request May 24, 2026
…al-only) (#65)

* Add METPO predicate proposal cohort metpo_traitmech_v2 (8 predicates)

Scope-B proposal: 8 new METPO object properties
(METPO:2007400-METPO:2007407) covering 128 of the 537 residual
causal-edge predicates left by the v1 grounding pass (#61).

Cohort (proposals/metpo_traitmech_v2/):
- manifests as       (METPO:2007400, 52 edges) — state → observable trait
- selects for        (METPO:2007401, 20 edges) — env condition → adapted trait
- feeds electrons into (METPO:2007402, 12 edges) — donor → transport chain
- transfers electrons to (METPO:2007403, 6 edges) — single-step redox
- fixed by           (METPO:2007404,  9 edges) — substrate → fixation pathway
- oxidized to        (METPO:2007405,  8 edges) — substrate → oxidized product
- challenges         (METPO:2007406,  9 edges) — stressor → tolerance trait
- mitigates          (METPO:2007407, 12 edges) — defense → stressor (paired)

Each candidate was checked against RO/Biolink first; rejections
documented per-row in proposal.md (e.g. biolink:manifestation_of has
range `disease` — too narrow; biolink:treats is clinical; etc.).

Subset tag: metpo_traitmech_2026_06. Domain = range = METPO:1007401
(trait causal node, minted in v1). ROBOT/ELK validates clean: delta
+6 axioms, no UNSAT (v1's METPO:1007401 resolves to unnamed external
IRI without v1 merged, which is fine — no error, just preserved
domain/range constraint).

Per-corpus impact (after re-running ground-predicates --apply with
the expanded 38-row mappings TSV):
- Edges grounded:        482 → 618 (+136)
- Edges residual:        537 → 401 (−136)
- Distinct labels:       191 → 181 (−10)

Also adds 2 RO mappings (controls, directs → RO:0002211 regulates)
that match the RO definition of regulation but were not in the v1
mapping cohort.

NOT filed upstream in this PR (per user instruction). Cohort lives
in this repo only; upstream filing path documented in proposal.md.
Mapping TSV notes flag the 8 proposed CURIEs as
"proposed upstream in proposals/metpo_traitmech_v2" so reviewers
know they're pending METPO adoption.

Verified locally:
  - just verify-proposal metpo_traitmech_v2 → PASS (0 failures)
  - just robot-validate-proposal metpo_traitmech_v2 → PASS (ELK +6)
  - just validate-strict → 0 ERROR rows / 357 files

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot review on PR #65

Three fixes per Copilot inline comments:

- TSV row 3 (manifests as): swap definition_source from
  `copiotrophic.yaml` (which has no `manifests as` edge — only
  `selects for`, `supports`, etc.) to
  `nutrient_adaptation.yaml#nutrient_adaptation_life_history_axis`,
  which is the canonical graph where `manifests as` first appears.

- TSV row 9 (challenges): swap definition_source from
  `acidophilic.yaml` (no `challenges` edge — uses `selects for`) to
  `acidotolerant.yaml#acidotolerant_acid_stress_homeostasis`, which
  carries the `acidic_exposure challenges ...` edge directly.

- proposal.md context paragraph: correct grounded counts from the
  incorrect "648 of 1185" to the actual "618 of 1019", matching the
  Corpus Impact table. Verified via `uv run python <<<` count over
  data/traits/**/causal_graphs[].edges[] (total=1019, grounded=618,
  residual=401).

- proposal.md paired-predicate heading: rephrase "the only paired
  pair" → "the only paired predicate set" (removes the redundancy).

Verified: `just verify-proposal metpo_traitmech_v2` → PASS, 0 failures.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin added a commit that referenced this pull request May 24, 2026
Adds the node-grounding pipeline, mirror of the predicate-grounding
work in #61. Fills empty `causal_graphs[].nodes[].grounding` from a
curated (label, node_type) → CURIE TSV.

This pass grounds 77 nodes across trait YAMLs (38% → 45% of
causal-graph nodes are now grounded; 1252 nodes total).

New machinery:
- scripts/ground_causal_nodes.py — walks data/traits/**/*.yaml,
  fills empty `grounding` from mappings/node_grounding.tsv,
  validates closed-mode before write, appends one CurationEvent per
  modified file, never overwrites existing groundings. Keyed on
  (label, node_type) since the same free-text label can refer to
  different ontology classes depending on node type (e.g. "terminal
  electron acceptor" as CHEMICAL vs MOLECULAR_FUNCTION).
- just ground-nodes recipe.

Hardening per the original Copilot review:
- load_mapping validates required headers (label, node_type,
  target_curie) up-front; raises ValueError with a helpful message
  if any are missing (instead of silently producing an empty
  mapping when DictReader returns None for missing keys).
- ground_nodes_in_doc returns a `grounded_keys` counter alongside
  the residual counter, so when a file fails validation the
  just-grounded nodes are re-added to the residual TSV. Without
  this, those nodes were invisible (removed from per-CURIE counts
  but not added to residual, even though the file is rejected and
  the nodes remain ungrounded on disk).
- reports/pipeline_writers_audit.tsv refreshed to include the new
  writer (4 → 5 rows).

Initial mapping cohort (mappings/node_grounding.tsv, 39 rows):
- 14 CHEBI mappings for canonical metabolic chemicals (O2, CO2, CO,
  H2, CH4, methanol, NH3, NO3-, SO4(2-), S(2-), H+, Fe(2+), organic
  carbon, compatible solutes).
- 10 GO-BP mappings for canonical processes (peptidoglycan
  synthesis, methanogenesis, aerobic/anaerobic respiration,
  photosynthesis, N2 fixation, fermentation, C fixation, oxidative
  phosphorylation, cellular pH regulation, response to osmotic
  stress).
- 4 GO-CC mappings for canonical compartments (periplasmic space,
  outer membrane, plasma membrane, cytoplasm).
- 2 GO-MF mappings (kinase activity, oxidoreductase activity).
- 4 GO-BP/pathway mappings (ETC, photosynthetic ETC, Calvin-Benson,
  Wood-Ljungdahl).
- 2 PATO + 3 ENVO env-factor mappings (light intensity, decreased
  temperature, anaerobic + anoxic environment).

Residual: 688 nodes across 511 distinct (label, type) keys remain
ungrounded. See reports/node_grounding_residual.tsv. The largest
clusters are BIOLOGICAL_PROCESS abstractions (proton motive force,
biomass, membrane fluidity) and GENE_OR_PROTEIN families (MreB,
CRT enzymes, RuBisCO, FtsZ) — candidates for either a METPO
node-class proposal cohort or upstream UniProt/PRO grounding.

audit-writers TSV grows from 4 → 5 rows; the new script reports
appends_curation_history + has_write_safeguard +
validates_before_write all `yes` (matches the
ground_causal_predicates contract from #61).

Verified locally:
  - just ground-nodes (dry-run after --apply) → 0 additional groundings (idempotent)
  - header-missing test: TSV with bad headers raises ValueError naming the missing columns
  - just validate-strict → 0 ERROR rows / 357 files
  - just audit-writers → 5 writers, all wired into justfile

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin added a commit that referenced this pull request May 24, 2026
Adds the node-grounding pipeline, mirror of the predicate-grounding
work in #61. Fills empty `causal_graphs[].nodes[].grounding` from a
curated (label, node_type) → CURIE TSV.

This pass grounds 77 nodes across trait YAMLs (38% → 45% of
causal-graph nodes are now grounded; 1252 nodes total).

New machinery:
- scripts/ground_causal_nodes.py — walks data/traits/**/*.yaml,
  fills empty `grounding` from mappings/node_grounding.tsv,
  validates closed-mode before write, appends one CurationEvent per
  modified file, never overwrites existing groundings. Keyed on
  (label, node_type) since the same free-text label can refer to
  different ontology classes depending on node type (e.g. "terminal
  electron acceptor" as CHEMICAL vs MOLECULAR_FUNCTION).
- just ground-nodes recipe.

Hardening per the original Copilot review:
- load_mapping validates required headers (label, node_type,
  target_curie) up-front; raises ValueError with a helpful message
  if any are missing (instead of silently producing an empty
  mapping when DictReader returns None for missing keys).
- ground_nodes_in_doc returns a `grounded_keys` counter alongside
  the residual counter, so when a file fails validation the
  just-grounded nodes are re-added to the residual TSV. Without
  this, those nodes were invisible (removed from per-CURIE counts
  but not added to residual, even though the file is rejected and
  the nodes remain ungrounded on disk).
- reports/pipeline_writers_audit.tsv refreshed to include the new
  writer (4 → 5 rows).

Initial mapping cohort (mappings/node_grounding.tsv, 39 rows):
- 14 CHEBI mappings for canonical metabolic chemicals (O2, CO2, CO,
  H2, CH4, methanol, NH3, NO3-, SO4(2-), S(2-), H+, Fe(2+), organic
  carbon, compatible solutes).
- 10 GO-BP mappings for canonical processes (peptidoglycan
  synthesis, methanogenesis, aerobic/anaerobic respiration,
  photosynthesis, N2 fixation, fermentation, C fixation, oxidative
  phosphorylation, cellular pH regulation, response to osmotic
  stress).
- 4 GO-CC mappings for canonical compartments (periplasmic space,
  outer membrane, plasma membrane, cytoplasm).
- 2 GO-MF mappings (kinase activity, oxidoreductase activity).
- 4 GO-BP/pathway mappings (ETC, photosynthetic ETC, Calvin-Benson,
  Wood-Ljungdahl).
- 2 PATO + 3 ENVO env-factor mappings (light intensity, decreased
  temperature, anaerobic + anoxic environment).

Residual: 688 nodes across 511 distinct (label, type) keys remain
ungrounded. See reports/node_grounding_residual.tsv. The largest
clusters are BIOLOGICAL_PROCESS abstractions (proton motive force,
biomass, membrane fluidity) and GENE_OR_PROTEIN families (MreB,
CRT enzymes, RuBisCO, FtsZ) — candidates for either a METPO
node-class proposal cohort or upstream UniProt/PRO grounding.

audit-writers TSV grows from 4 → 5 rows; the new script reports
appends_curation_history + has_write_safeguard +
validates_before_write all `yes` (matches the
ground_causal_predicates contract from #61).

Verified locally:
  - just ground-nodes (dry-run after --apply) → 0 additional groundings (idempotent)
  - header-missing test: TSV with bad headers raises ValueError naming the missing columns
  - just validate-strict → 0 ERROR rows / 357 files
  - just audit-writers → 5 writers, all wired into justfile

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin added a commit that referenced this pull request May 24, 2026
The grounding pipelines and audit scripts have been load-bearing
infrastructure for the last 7 PRs (#61, #66, #67, #69, #70 — all
of which rewrite causal-graph fields based on these scripts'
output). They had zero unit-test coverage. A silent regression in
idempotency, header validation, or self-suppression would not be
caught by validate-strict (which only checks per-record schema
conformance, not pipeline correctness).

Test counts:
  tests/test_ground_causal_predicates.py    9 tests
  tests/test_ground_causal_nodes.py        12 tests
  tests/test_validate_strict.py            11 tests
  tests/test_audit_writers.py              11 tests
  ---
  total new                                43 tests
  total suite                              54 tests (was 11)

Coverage highlights:

ground_causal_predicates.py:
- load_mapping: basic happy path, conflict detection (same label →
  different CURIEs raises ValueError), incomplete-row skipping,
  missing-file error.
- ground_edges_in_doc: idempotency (second pass = 0 changes),
  existing predicate_id never overwritten, residual counting for
  unmapped labels, empty/missing-predicate edges skipped.

ground_causal_nodes.py:
- All of the predicate suite plus:
- (label, node_type) keyed lookup — same label, different node_types
  map to different CURIEs without aliasing.
- Header validation (Copilot fix from PR #66): TSV with `nodetype`
  / `targetcurie` typo'd headers raises ValueError naming both
  missing columns.
- grounded_keys-on-validation-failure separability (Copilot fix
  from PR #66): caller can union residual + grounded_keys to
  recover the corpus-state residual after rolling back an invalid
  file write.

validate_strict.py:
- classify: parametrized over the 5 categories
  (unexpected_field, missing_required, enum_mismatch,
  pattern_mismatch, other) — the messages must match the actual
  jsonschema phrasings the validator emits.
- validate_one: clean record produces 0 errors; unknown field
  surfaces unexpected_field (the G01 gate behavior); missing
  required field surfaces missing_required; YAML parse error
  surfaces as yaml_parse_error category.
- iter_yaml_files: walks directories, filters .txt, picks up
  nested *.yaml.

audit_writers.py:
- looks_like_yaml_writer: yaml.safe_dump / yaml.dump positive,
  bare .write_text negative, .write_text near .yaml hint positive,
  arbitrary code negative.
- audit: full-safeguards writer flagged yes/yes/yes/yes;
  no-safeguards writer flagged no/no/no; non-writer returns None;
  wired_into_just yes when justfile mentions the script stem.
- Self-suppression (Copilot fix from PR #64): audit_writers.py
  itself returns None even though its own source matches
  yaml.safe_dump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin added a commit that referenced this pull request May 24, 2026
* Add tests for grounding pipeline + audit scripts (+43 tests)

The grounding pipelines and audit scripts have been load-bearing
infrastructure for the last 7 PRs (#61, #66, #67, #69, #70 — all
of which rewrite causal-graph fields based on these scripts'
output). They had zero unit-test coverage. A silent regression in
idempotency, header validation, or self-suppression would not be
caught by validate-strict (which only checks per-record schema
conformance, not pipeline correctness).

Test counts:
  tests/test_ground_causal_predicates.py    9 tests
  tests/test_ground_causal_nodes.py        12 tests
  tests/test_validate_strict.py            11 tests
  tests/test_audit_writers.py              11 tests
  ---
  total new                                43 tests
  total suite                              54 tests (was 11)

Coverage highlights:

ground_causal_predicates.py:
- load_mapping: basic happy path, conflict detection (same label →
  different CURIEs raises ValueError), incomplete-row skipping,
  missing-file error.
- ground_edges_in_doc: idempotency (second pass = 0 changes),
  existing predicate_id never overwritten, residual counting for
  unmapped labels, empty/missing-predicate edges skipped.

ground_causal_nodes.py:
- All of the predicate suite plus:
- (label, node_type) keyed lookup — same label, different node_types
  map to different CURIEs without aliasing.
- Header validation (Copilot fix from PR #66): TSV with `nodetype`
  / `targetcurie` typo'd headers raises ValueError naming both
  missing columns.
- grounded_keys-on-validation-failure separability (Copilot fix
  from PR #66): caller can union residual + grounded_keys to
  recover the corpus-state residual after rolling back an invalid
  file write.

validate_strict.py:
- classify: parametrized over the 5 categories
  (unexpected_field, missing_required, enum_mismatch,
  pattern_mismatch, other) — the messages must match the actual
  jsonschema phrasings the validator emits.
- validate_one: clean record produces 0 errors; unknown field
  surfaces unexpected_field (the G01 gate behavior); missing
  required field surfaces missing_required; YAML parse error
  surfaces as yaml_parse_error category.
- iter_yaml_files: walks directories, filters .txt, picks up
  nested *.yaml.

audit_writers.py:
- looks_like_yaml_writer: yaml.safe_dump / yaml.dump positive,
  bare .write_text negative, .write_text near .yaml hint positive,
  arbitrary code negative.
- audit: full-safeguards writer flagged yes/yes/yes/yes;
  no-safeguards writer flagged no/no/no; non-writer returns None;
  wired_into_just yes when justfile mentions the script stem.
- Self-suppression (Copilot fix from PR #64): audit_writers.py
  itself returns None even though its own source matches
  yaml.safe_dump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot review on PR #71

Add explicit `assert "b.yml" not in names` to
test_iter_yaml_files_walks_directory_and_filters — the prior test
documented the .yml-skipping behavior in a comment but never
asserted it, so a regression that started picking up .yml during
directory walks would have slipped through silently.

Also add test_iter_yaml_files_accepts_yml_file_passed_directly
to lock in the asymmetry that the previous test only hinted at:
iter_yaml_files() does accept .yml when passed as a file argument
(only the rglob('*.yaml') walk is .yaml-only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update audit_writers tests to match #75's tightened heuristic

PR #75 changed `looks_like_yaml_writer` to require that the
yaml-serializer call feed directly into write_text on the same
line (instead of the looser "any .write_text + any .yaml token"
heuristic, which produced false positives for scripts that only
READ trait YAMLs).

The pre-#75 test asserted that
`path.write_text(content)  # .yaml` counted as a YAML writer.
That returned True under the old heuristic and False under the
new (correct) one. Replace it with two tests that lock in the
new contract:

  test_looks_like_yaml_writer_write_text_of_yaml_dump
    Positive: write_text(yaml.safe_dump(...)) / write_text(yaml.dump(...))
    both count.

  test_looks_like_yaml_writer_write_text_of_json_is_false
    Negative: a script that reads *.yaml then writes JSON via
    write_text is NOT a YAML writer — this is the false-positive
    case #75 explicitly fixed for scripts/build_embedding_index.py
    and scripts/render_trait_pages.py.

Also rename test_looks_like_yaml_writer_write_text_without_yaml_hint_is_false
to test_looks_like_yaml_writer_write_text_plain_is_false since the
"yaml hint" phrasing was tied to the old heuristic.

56 tests pass (was 54; +2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants