Skip to content

Ground GENE_OR_PROTEIN nodes via kg-microbe UniProt transform (+42 nodes)#70

Merged
realmarcin merged 2 commits into
mainfrom
ground-proteins-uniprot
May 24, 2026
Merged

Ground GENE_OR_PROTEIN nodes via kg-microbe UniProt transform (+42 nodes)#70
realmarcin merged 2 commits into
mainfrom
ground-proteins-uniprot

Conversation

@realmarcin
Copy link
Copy Markdown
Contributor

Summary

Uses kg-microbe's merged-kg_uniprot_nodes.tsv (≈2.2 GB, 19.8 M UniprotKB: rows) as the source-of-truth label index for TraitMech's ungrounded GENE_OR_PROTEIN nodes. Streams the file once, matches each TraitMech residual GENE_OR_PROTEIN label against UniProt names, and grounds 42 nodes via 27 representative UniProt CURIEs.

Per user request: "use the kgm uniprot transform for grounding and reference" — the path used is /Users/marcin/Documents/VIMSS/ontology/KG-Hub/KG-Microbe/kg-microbe/merged-kg_uniprot_nodes.tsv.

New tooling

scripts/match_uniprot_to_proteins.py — one-off matcher. Streams the kg-microbe UniProt nodes file, collects per-label candidates (cap 500/label), and picks a single representative per label via a two-tier selector:

  1. Tier 1 (best) — UniProt name (cleaned of trailing parens) equals the TraitMech label exactly. Among those, alphabetically-first CURIE for determinism.
  2. Tier 2 — name ends with " <label>" as the final whitespace token (e.g. "Polarized growth protein Scy" for label scy). Prefer SHORTEST such name (fewer modifier words like chaperone, maturation protein, assembly factor), alphabetic CURIE as tiebreaker.
  3. Otherwise — skip (too ambiguous to ground cleanly).

Also skips a hand-curated set of abstract-category labels that shouldn't be grounded to a specific UniProt entry: gene product, virulence factors, chaperone proteins, thermostable proteins, salinity-adaptation genes, cold-shock proteins, proton export pumps and antiporters, membrane transporters, gliding motility machinery, rod complex.

Mapping additions

mappings/node_grounding.tsv grows 47 → 74 rows. 27 UniProt mappings — examples:

TraitMech label UniProt CURIE UniProt name
mreb UniprotKB:A0A1B1UYY2 MreB
ftsz UniprotKB:C0LUM8 FtsZ
diviva UniprotKB:Q1IYG2 DivIVA
atp synthase UniprotKB:A0A415TT77 ATP synthase
na+/h+ antiporter UniprotKB:A0A068T423 Na+/H+ antiporter
superoxide dismutase UniprotKB:A0A009QPW9 Superoxide dismutase (Cu-Zn)
methyl-coenzyme m reductase UniprotKB:A0A099T5Q9 Methyl-coenzyme M reductase
pqq-dependent methanol dehydrogenase UniprotKB:A0A4U8YZA6 PQQ-dependent methanol dehydrogenase (Alpha subunit)
crescentin UniprotKB:A0A2N9AY16 Crescentin (CreS)
pbp2b UniprotKB:A0A892RPK7 Penicillin-binding protein PBP2B

Full per-label audit at reports/uniprot_match_candidates.tsv.

Corpus impact

Before After
Nodes grounded 622 664 (+42, 50% → 53%)
Nodes residual 630 588 (−42)
Distinct (label, type) residuals 503 476 (−27)
Mappings TSV (rows) 47 74 (+27)

Caveats

  • The 27 representatives are species-specific UniProt entries, not canonical family-level groundings. PRO (Protein Ontology) would be the cleaner target for protein families but no local PRO snapshot is available; this PR picks the representative-UniProt approach explicitly authorized by the user.
  • 41 of 93 GENE_OR_PROTEIN labels had at least one UniProt name match; 27 picked after the tier-1/2 filter; 14 had matches but failed the strict tier criteria (mostly chaperone / assembly factor / domain-containing protein candidates); 10 hand-skipped as abstract categories; 42 had no UniProt match at all (likely TraitMech-specific paraphrases or non-protein labels — salt-in strategy, weak/absent shape-determining cytoskeleton, etc.).
  • The matcher path is currently hardcoded to a local kg-microbe checkout; future runs need either an env var or a copy of merged-kg_uniprot_nodes.tsv into data/raw/. Left as-is for this PR since the script is one-off.

Verified locally

$ uv run python scripts/match_uniprot_to_proteins.py --apply
  labels with at least one match: 41 / 93
  representatives picked: 27
  abstract-category labels skipped: 10
  appended 27 rows to mappings/node_grounding.tsv

$ just ground-nodes --apply
  files modifiable:    [N]
  nodes grounded:      42
  residual nodes:      588 across 476 distinct (label, type) keys

$ just ground-nodes        # second pass, idempotency check
  nodes grounded:      0

$ just validate-strict
  files scanned:      357
  files with ERROR:   0

Test plan

  • Idempotency: re-run of ground-nodes produces 0 additional groundings
  • validate-strict clean
  • Audit trail in reports/uniprot_match_candidates.tsv
  • CI re-runs validate-strict on YAML diff

🤖 Generated with Claude Code

…des)

Uses kg-microbe's merged-kg_uniprot_nodes.tsv (≈2.2 GB, 19.8M
UniprotKB rows) as the source-of-truth label index for TraitMech's
ungrounded GENE_OR_PROTEIN nodes. Streams the file once, matches
each TraitMech residual GENE_OR_PROTEIN label against UniProt
names, and grounds 42 nodes via 27 representative UniProt CURIEs.

New tooling:
- scripts/match_uniprot_to_proteins.py — one-off matcher. Streams
  the kg-microbe UniProt nodes file (path is currently hardcoded
  to the local checkout), collects per-label candidates up to a
  500-match cap, and picks a single representative per label.

Two-tier representative selector:
  1. Tier 1 (best) — UniProt name (cleaned of trailing parens)
     equals the TraitMech label exactly. Among those, pick the
     alphabetically-first CURIE for determinism.
  2. Tier 2 — name ends with " <label>" as the final token
     (e.g. "Polarized growth protein Scy" for label "scy"). Prefer
     SHORTEST such name (fewer modifier words like "chaperone",
     "maturation protein", "assembly factor"), alphabetic CURIE
     as tiebreaker.
  3. Otherwise — skip (too ambiguous to ground cleanly).

Also skips a hand-curated set of abstract-category labels that
shouldn't be grounded to a specific UniProt entry: gene product,
virulence factors, chaperone proteins, thermostable proteins,
salinity-adaptation genes, cold-shock proteins, proton export
pumps and antiporters, membrane transporters, gliding motility
machinery, rod complex.

Mapping additions (mappings/node_grounding.tsv, 47 → 74 rows):
  27 UniProt mappings — examples:
    mreb        → UniprotKB:A0A1B1UYY2  MreB
    ftsz        → UniprotKB:C0LUM8      FtsZ
    diviva      → UniprotKB:Q1IYG2      DivIVA
    atp synthase → UniprotKB:A0A415TT77 ATP synthase
    superoxide dismutase → UniprotKB:A0A009QPW9
    methyl-coenzyme m reductase → UniprotKB:A0A099T5Q9
    pqq-dependent methanol dehydrogenase → UniprotKB:A0A4U8YZA6

Per-corpus impact:
  Nodes grounded:   622 → 664 (+42, 50% → 53%)
  Nodes residual:   630 → 588 (−42)
  Distinct keys:    503 → 476 (−27)
  Mappings TSV:     47 → 74 rows

Match-audit trail saved at reports/uniprot_match_candidates.tsv —
93 GENE_OR_PROTEIN labels processed; 41 had at least one UniProt
match; 27 picked (after tier-1/2 filter); 14 had matches but failed
the strict tier criteria (mostly entries that pulled "chaperone" /
"assembly factor" / "domain-containing protein" candidates); 10
hand-skipped as abstract categories; 42 had no UniProt match at
all (likely TraitMech-specific paraphrases or non-protein labels).

The 27 representatives are species-specific UniProt entries — not
canonical family-level groundings (PRO would be the cleaner
target but no local PRO snapshot is available). Documented as
"representative UniProt entry" in the mapping notes column.

Verified locally:
  - just validate-strict → 0 ERROR rows / 357 files
  - just ground-nodes (dry-run after --apply) → 0 additional groundings (idempotent)
  - just audit-writers → script picked up; appends_curation_history=no (it's a one-off matcher, not a YAML-writer per se — it writes to mappings/ + reports/)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 24, 2026 05:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a one-off matcher that uses kg-microbe’s UniProt node export to select representative UniProt CURIEs for previously-ungrounded GENE_OR_PROTEIN causal-graph nodes, then updates the curated mapping table and applies the resulting groundings across affected trait YAMLs (with an audit report and updated residual report).

Changes:

  • Add scripts/match_uniprot_to_proteins.py to stream merged-kg_uniprot_nodes.tsv, generate candidate matches, and (optionally) append selected representatives to mappings/node_grounding.tsv.
  • Expand mappings/node_grounding.tsv with 27 new UniProt representative mappings and commit the audit output reports/uniprot_match_candidates.tsv.
  • Apply new grounding: CURIEs to GENE_OR_PROTEIN nodes across multiple trait YAML causal graphs; update reports/node_grounding_residual.tsv accordingly.

Reviewed changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
scripts/match_uniprot_to_proteins.py New streaming matcher to find UniProt candidates and append representative mappings.
reports/uniprot_match_candidates.tsv New audit TSV listing candidate counts and chosen representatives per label.
reports/node_grounding_residual.tsv Updated residual report after applying new mappings/groundings.
mappings/node_grounding.tsv Adds 27 GENE_OR_PROTEIN → UniProt representative mappings.
data/traits/physiology/methylotrophic.yaml Adds UniProt groundings for methanol dehydrogenase nodes; appends curation event.
data/traits/physiology/methanotrophic.yaml Adds UniProt groundings for methane monooxygenase / methanol dehydrogenase; appends curation event.
data/traits/physiology/hydrogenotrophic.yaml Adds UniProt grounding for hydrogenase; appends curation event.
data/traits/physiology/chemolithotrophic.yaml Adds UniProt grounding for ammonia monooxygenase; appends curation event.
data/traits/physiology/carboxydotrophic.yaml Adds UniProt groundings for CODH / molybdenum hydroxylase; appends curation event.
data/traits/morphology/spirochete_shaped.yaml Adds UniProt grounding for FlaB; appends curation event.
data/traits/morphology/rod_shaped.yaml Adds UniProt groundings for MreB and FtsZ; appends curation event.
data/traits/morphology/ovoid_shaped.yaml Adds UniProt grounding for DivIVA; appends curation event.
data/traits/morphology/oval_shaped.yaml Adds UniProt grounding for PBP2b; appends curation event.
data/traits/morphology/motility.yaml Adds UniProt grounding for type IV pilus; appends curation event.
data/traits/morphology/motile.yaml Adds UniProt grounding for type IV pilus; appends curation event.
data/traits/morphology/helical_shaped.yaml Adds UniProt grounding for CcmA; appends curation event.
data/traits/morphology/filament_shaped.yaml Adds UniProt groundings for DivIVA and Scy; appends curation event.
data/traits/morphology/ellipsoidal.yaml Adds UniProt groundings for PBP2b and DivIVA; appends curation event.
data/traits/morphology/crescent_shaped.yaml Adds UniProt grounding for crescentin; appends curation event.
data/traits/morphology/coccobacillus_shaped.yaml Adds UniProt grounding for MreB; appends curation event.
data/traits/morphology/cell_shape.yaml Adds UniProt groundings for MreB, FtsZ, crescentin; appends curation event.
data/traits/morphology/brown_pigmented.yaml Adds UniProt grounding for 4-hydroxyphenylpyruvate dioxygenase; appends curation event.
data/traits/morphology/bacillus_shaped.yaml Adds UniProt groundings for MreB and FtsZ; appends curation event.
data/traits/metabolism/substrate_level_phosphorylation.yaml Adds UniProt grounding for acetate kinase; appends curation event.
data/traits/metabolism/respiration.yaml Adds UniProt grounding for ATP synthase; appends curation event.
data/traits/metabolism/methanogenesis.yaml Adds UniProt grounding for methyl-coenzyme M reductase; appends curation event.
data/traits/metabolism/electron_transfer.yaml Adds UniProt groundings for redox protein and c-type cytochrome; appends curation event.
data/traits/metabolism/aerobic_respiration.yaml Adds UniProt groundings for cytochrome c oxidase and ATP synthase; appends curation event.
data/traits/environment/obligately_alkaphilic.yaml Adds UniProt grounding for Na+/H+ antiporter; appends curation event.
data/traits/environment/neutrophilic.yaml Adds UniProt grounding for cation/proton antiporter; appends curation event.
data/traits/environment/hyperthermophilic.yaml Adds UniProt grounding for reverse gyrase; appends curation event.
data/traits/environment/facultatively_alkaphilic.yaml Adds UniProt grounding for Na+/H+ antiporter; appends curation event.
data/traits/environment/alkaphilic.yaml Adds UniProt grounding for Na+/H+ antiporter; appends curation event.
data/traits/environment/alkalotolerant.yaml Adds UniProt grounding for cation/proton antiporter; appends curation event.
data/traits/environment/aerotolerant.yaml Adds UniProt grounding for superoxide dismutase; appends curation event.
Comments suppressed due to low confidence (5)

scripts/match_uniprot_to_proteins.py:52

  • KG_UNIPROT_NODES is hardcoded to a local absolute path ("/Users/marcin/..."), which makes this script non-runnable for other contributors and in CI. Please make the UniProt-nodes TSV path configurable (e.g., --kg-uniprot-nodes arg and/or env var), and keep the repo default either unset or relative to the repo (with a clear error if missing).
    scripts/match_uniprot_to_proteins.py:98
  • build_regex() will compile a pattern that matches the empty string if labels is empty (because (?:|...) becomes (?:)), which would make regex.finditer(name) yield zero-length matches at every position and effectively hang the scan. Please guard against an empty labels list (e.g., return early in main() with a clear message, or raise in build_regex()).
    scripts/match_uniprot_to_proteins.py:206
  • The match_count reported to reports/uniprot_match_candidates.tsv is len(cands), but cands is capped at MAX_MATCHES_PER_LABEL, so counts of 500 are ambiguous (could mean 500 total, or “>=500”). Please track total match occurrences separately from the stored candidate list (e.g., total_matches[label]), and write both to the report so ambiguity/skip thresholds can be reasoned about correctly.
    scripts/match_uniprot_to_proteins.py:225
  • --apply appends to mappings/node_grounding.tsv unconditionally, so re-running the script will duplicate rows (and make diffs/noise harder to manage). Please load existing mappings first and skip (or update in-place) any (label, node_type) keys that already exist, and consider sorting/deduplicating output for deterministic reruns.
    mappings/node_grounding.tsv:52
  • The per-row notes string claims the representative was selected via “name-ends-with-label + alphabetic-first CURIE”, but the actual selection logic described/implemented in the matcher includes an exact-match tier and a “shortest suffix-hit name” preference with CURIE only as a tiebreaker. Please update the notes text to accurately reflect the selection criteria (or reference the script/report) so the mapping provenance is not misleading.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/match_uniprot_to_proteins.py Outdated
Comment thread scripts/match_uniprot_to_proteins.py Outdated
Comment thread mappings/node_grounding.tsv Outdated
Comment thread reports/uniprot_match_candidates.tsv Outdated
Comment thread data/traits/metabolism/aerobic_respiration.yaml Outdated
Five Copilot findings, all CURIE-prefix normalization + one
docstring drift:

- CURIE prefix UniprotKB → UniProtKB across all emitted artifacts.
  The kg-microbe source data uses `UniprotKB:` (lowercase p) but
  the TraitMech LinkML schema declares `UniProtKB` (uppercase P)
  in its prefix-map. CURIE expansion is case-sensitive in many
  consumers, so the mismatch would break downstream IRI
  resolution.

  Updated:
    - mappings/node_grounding.tsv          27 rows
    - reports/uniprot_match_candidates.tsv 27 rows
    - data/traits/**/*.yaml                31 files (grounding values)

  scripts/match_uniprot_to_proteins.py now reads using the
  source-data spelling (`UniprotKB:` startswith filter, since
  that's what kg-microbe emits) and normalizes the CURIE in-place
  before storing it in the matches index, so downstream artifacts
  are written in the schema-canonical form.

- Docstring rewrite to match the implemented selection algorithm.
  The original docstring described a suffix-first + fallback-to-
  any-containing + skip-above-100 scheme, but `pick_representative`
  actually implements a strict tier-1 (exact match) + tier-2
  (suffix-token, shortest-name) + skip-otherwise scheme. Docstring
  now matches the code, and also documents the SKIP_LABELS
  blocklist and the CURIE-prefix normalization contract above.

Verified: just validate-strict → 0 ERROR rows / 357 files.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin merged commit ab3592c into main May 24, 2026
1 check passed
@realmarcin realmarcin deleted the ground-proteins-uniprot branch May 24, 2026 08:12
realmarcin added a commit that referenced this pull request May 24, 2026
…0 nodes)

Adds 14 mappings across ENVIRONMENTAL_FACTOR (PATO + ENVO),
CHEMICAL (CHEBI), and PATHWAY (GO) to ground the top remaining
groundable labels in reports/node_grounding_residual.tsv after
#66 (39 base) + #69 (8 METPO) + #70 (27 UniProt) + #72 (biomass
retype).

Additions (mappings/node_grounding.tsv, 75 → 89 rows):

  ENVIRONMENTAL_FACTOR (5 rows, PATO + ENVO):
    acidic external ph        → PATO:0001428  acidic pH        (5 nodes)
    alkaline external ph      → PATO:0001429  alkaline pH      (5 nodes)
    near-neutral external ph  → PATO:0001432  neutral pH       (4 nodes)
    very high temperature     → PATO:0001637  extremely high temperature (2 nodes)
    high-salt environment     → ENVO:01000687 saline environment (2 nodes)

  CHEMICAL (3 rows, CHEBI):
    thiosulfate               → CHEBI:33542  thiosulfate(2-)        (2 nodes)
    electron donor            → CHEBI:17499  electron donor          (2 nodes)
    organic compound          → CHEBI:50860  organic molecular entity (6 nodes)

  PATHWAY (6 rows, GO):
    membrane electron transport chain → GO:0022900 ETC (3 nodes)
    electron transport chain          → GO:0022900     (1 node)
    electron transport system         → GO:0022900     (1 node)
    co2-fixation pathway              → GO:0015977 carbon fixation (3 nodes)
    autotrophic co2 fixation          → GO:0015977     (3 nodes)
    co2 fixation pathway              → GO:0015977     (1 node)

The PATHWAY additions collapse 6 distinct corpus-paraphrased
labels onto 2 GO terms, demonstrating that the
(label, node_type) mapping convention supports
multi-label-→-one-CURIE without conflict.

Per-corpus impact:
  Mapping TSV:      75 → 89 rows (+14)
  Nodes grounded:   ~704 (53%) → ~744 (59%)

Verified:
  - just ground-nodes --apply → 40 newly grounded
  - just ground-nodes (idempotency) → 0
  - just validate-strict → 0 ERROR rows / 357 files

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin added a commit that referenced this pull request May 24, 2026
…0 nodes) (#74)

Adds 14 mappings across ENVIRONMENTAL_FACTOR (PATO + ENVO),
CHEMICAL (CHEBI), and PATHWAY (GO) to ground the top remaining
groundable labels in reports/node_grounding_residual.tsv after
#66 (39 base) + #69 (8 METPO) + #70 (27 UniProt) + #72 (biomass
retype).

Additions (mappings/node_grounding.tsv, 75 → 89 rows):

  ENVIRONMENTAL_FACTOR (5 rows, PATO + ENVO):
    acidic external ph        → PATO:0001428  acidic pH        (5 nodes)
    alkaline external ph      → PATO:0001429  alkaline pH      (5 nodes)
    near-neutral external ph  → PATO:0001432  neutral pH       (4 nodes)
    very high temperature     → PATO:0001637  extremely high temperature (2 nodes)
    high-salt environment     → ENVO:01000687 saline environment (2 nodes)

  CHEMICAL (3 rows, CHEBI):
    thiosulfate               → CHEBI:33542  thiosulfate(2-)        (2 nodes)
    electron donor            → CHEBI:17499  electron donor          (2 nodes)
    organic compound          → CHEBI:50860  organic molecular entity (6 nodes)

  PATHWAY (6 rows, GO):
    membrane electron transport chain → GO:0022900 ETC (3 nodes)
    electron transport chain          → GO:0022900     (1 node)
    electron transport system         → GO:0022900     (1 node)
    co2-fixation pathway              → GO:0015977 carbon fixation (3 nodes)
    autotrophic co2 fixation          → GO:0015977     (3 nodes)
    co2 fixation pathway              → GO:0015977     (1 node)

The PATHWAY additions collapse 6 distinct corpus-paraphrased
labels onto 2 GO terms, demonstrating that the
(label, node_type) mapping convention supports
multi-label-→-one-CURIE without conflict.

Per-corpus impact:
  Mapping TSV:      75 → 89 rows (+14)
  Nodes grounded:   ~704 (53%) → ~744 (59%)

Verified:
  - just ground-nodes --apply → 40 newly grounded
  - just ground-nodes (idempotency) → 0
  - just validate-strict → 0 ERROR rows / 357 files

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin added a commit that referenced this pull request May 24, 2026
The grounding pipelines and audit scripts have been load-bearing
infrastructure for the last 7 PRs (#61, #66, #67, #69, #70 — all
of which rewrite causal-graph fields based on these scripts'
output). They had zero unit-test coverage. A silent regression in
idempotency, header validation, or self-suppression would not be
caught by validate-strict (which only checks per-record schema
conformance, not pipeline correctness).

Test counts:
  tests/test_ground_causal_predicates.py    9 tests
  tests/test_ground_causal_nodes.py        12 tests
  tests/test_validate_strict.py            11 tests
  tests/test_audit_writers.py              11 tests
  ---
  total new                                43 tests
  total suite                              54 tests (was 11)

Coverage highlights:

ground_causal_predicates.py:
- load_mapping: basic happy path, conflict detection (same label →
  different CURIEs raises ValueError), incomplete-row skipping,
  missing-file error.
- ground_edges_in_doc: idempotency (second pass = 0 changes),
  existing predicate_id never overwritten, residual counting for
  unmapped labels, empty/missing-predicate edges skipped.

ground_causal_nodes.py:
- All of the predicate suite plus:
- (label, node_type) keyed lookup — same label, different node_types
  map to different CURIEs without aliasing.
- Header validation (Copilot fix from PR #66): TSV with `nodetype`
  / `targetcurie` typo'd headers raises ValueError naming both
  missing columns.
- grounded_keys-on-validation-failure separability (Copilot fix
  from PR #66): caller can union residual + grounded_keys to
  recover the corpus-state residual after rolling back an invalid
  file write.

validate_strict.py:
- classify: parametrized over the 5 categories
  (unexpected_field, missing_required, enum_mismatch,
  pattern_mismatch, other) — the messages must match the actual
  jsonschema phrasings the validator emits.
- validate_one: clean record produces 0 errors; unknown field
  surfaces unexpected_field (the G01 gate behavior); missing
  required field surfaces missing_required; YAML parse error
  surfaces as yaml_parse_error category.
- iter_yaml_files: walks directories, filters .txt, picks up
  nested *.yaml.

audit_writers.py:
- looks_like_yaml_writer: yaml.safe_dump / yaml.dump positive,
  bare .write_text negative, .write_text near .yaml hint positive,
  arbitrary code negative.
- audit: full-safeguards writer flagged yes/yes/yes/yes;
  no-safeguards writer flagged no/no/no; non-writer returns None;
  wired_into_just yes when justfile mentions the script stem.
- Self-suppression (Copilot fix from PR #64): audit_writers.py
  itself returns None even though its own source matches
  yaml.safe_dump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin added a commit that referenced this pull request May 24, 2026
* Add tests for grounding pipeline + audit scripts (+43 tests)

The grounding pipelines and audit scripts have been load-bearing
infrastructure for the last 7 PRs (#61, #66, #67, #69, #70 — all
of which rewrite causal-graph fields based on these scripts'
output). They had zero unit-test coverage. A silent regression in
idempotency, header validation, or self-suppression would not be
caught by validate-strict (which only checks per-record schema
conformance, not pipeline correctness).

Test counts:
  tests/test_ground_causal_predicates.py    9 tests
  tests/test_ground_causal_nodes.py        12 tests
  tests/test_validate_strict.py            11 tests
  tests/test_audit_writers.py              11 tests
  ---
  total new                                43 tests
  total suite                              54 tests (was 11)

Coverage highlights:

ground_causal_predicates.py:
- load_mapping: basic happy path, conflict detection (same label →
  different CURIEs raises ValueError), incomplete-row skipping,
  missing-file error.
- ground_edges_in_doc: idempotency (second pass = 0 changes),
  existing predicate_id never overwritten, residual counting for
  unmapped labels, empty/missing-predicate edges skipped.

ground_causal_nodes.py:
- All of the predicate suite plus:
- (label, node_type) keyed lookup — same label, different node_types
  map to different CURIEs without aliasing.
- Header validation (Copilot fix from PR #66): TSV with `nodetype`
  / `targetcurie` typo'd headers raises ValueError naming both
  missing columns.
- grounded_keys-on-validation-failure separability (Copilot fix
  from PR #66): caller can union residual + grounded_keys to
  recover the corpus-state residual after rolling back an invalid
  file write.

validate_strict.py:
- classify: parametrized over the 5 categories
  (unexpected_field, missing_required, enum_mismatch,
  pattern_mismatch, other) — the messages must match the actual
  jsonschema phrasings the validator emits.
- validate_one: clean record produces 0 errors; unknown field
  surfaces unexpected_field (the G01 gate behavior); missing
  required field surfaces missing_required; YAML parse error
  surfaces as yaml_parse_error category.
- iter_yaml_files: walks directories, filters .txt, picks up
  nested *.yaml.

audit_writers.py:
- looks_like_yaml_writer: yaml.safe_dump / yaml.dump positive,
  bare .write_text negative, .write_text near .yaml hint positive,
  arbitrary code negative.
- audit: full-safeguards writer flagged yes/yes/yes/yes;
  no-safeguards writer flagged no/no/no; non-writer returns None;
  wired_into_just yes when justfile mentions the script stem.
- Self-suppression (Copilot fix from PR #64): audit_writers.py
  itself returns None even though its own source matches
  yaml.safe_dump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot review on PR #71

Add explicit `assert "b.yml" not in names` to
test_iter_yaml_files_walks_directory_and_filters — the prior test
documented the .yml-skipping behavior in a comment but never
asserted it, so a regression that started picking up .yml during
directory walks would have slipped through silently.

Also add test_iter_yaml_files_accepts_yml_file_passed_directly
to lock in the asymmetry that the previous test only hinted at:
iter_yaml_files() does accept .yml when passed as a file argument
(only the rglob('*.yaml') walk is .yaml-only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update audit_writers tests to match #75's tightened heuristic

PR #75 changed `looks_like_yaml_writer` to require that the
yaml-serializer call feed directly into write_text on the same
line (instead of the looser "any .write_text + any .yaml token"
heuristic, which produced false positives for scripts that only
READ trait YAMLs).

The pre-#75 test asserted that
`path.write_text(content)  # .yaml` counted as a YAML writer.
That returned True under the old heuristic and False under the
new (correct) one. Replace it with two tests that lock in the
new contract:

  test_looks_like_yaml_writer_write_text_of_yaml_dump
    Positive: write_text(yaml.safe_dump(...)) / write_text(yaml.dump(...))
    both count.

  test_looks_like_yaml_writer_write_text_of_json_is_false
    Negative: a script that reads *.yaml then writes JSON via
    write_text is NOT a YAML writer — this is the false-positive
    case #75 explicitly fixed for scripts/build_embedding_index.py
    and scripts/render_trait_pages.py.

Also rename test_looks_like_yaml_writer_write_text_without_yaml_hint_is_false
to test_looks_like_yaml_writer_write_text_plain_is_false since the
"yaml hint" phrasing was tied to the old heuristic.

56 tests pass (was 54; +2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants