Ground GENE_OR_PROTEIN nodes via kg-microbe UniProt transform (+42 nodes) by realmarcin · Pull Request #70 · CultureBotAI/TraitMech

realmarcin · 2026-05-24T05:06:43Z

Summary

Uses kg-microbe's merged-kg_uniprot_nodes.tsv (≈2.2 GB, 19.8 M UniprotKB: rows) as the source-of-truth label index for TraitMech's ungrounded GENE_OR_PROTEIN nodes. Streams the file once, matches each TraitMech residual GENE_OR_PROTEIN label against UniProt names, and grounds 42 nodes via 27 representative UniProt CURIEs.

Per user request: "use the kgm uniprot transform for grounding and reference" — the path used is /Users/marcin/Documents/VIMSS/ontology/KG-Hub/KG-Microbe/kg-microbe/merged-kg_uniprot_nodes.tsv.

New tooling

scripts/match_uniprot_to_proteins.py — one-off matcher. Streams the kg-microbe UniProt nodes file, collects per-label candidates (cap 500/label), and picks a single representative per label via a two-tier selector:

Tier 1 (best) — UniProt name (cleaned of trailing parens) equals the TraitMech label exactly. Among those, alphabetically-first CURIE for determinism.
Tier 2 — name ends with " <label>" as the final whitespace token (e.g. "Polarized growth protein Scy" for label scy). Prefer SHORTEST such name (fewer modifier words like chaperone, maturation protein, assembly factor), alphabetic CURIE as tiebreaker.
Otherwise — skip (too ambiguous to ground cleanly).

Also skips a hand-curated set of abstract-category labels that shouldn't be grounded to a specific UniProt entry: gene product, virulence factors, chaperone proteins, thermostable proteins, salinity-adaptation genes, cold-shock proteins, proton export pumps and antiporters, membrane transporters, gliding motility machinery, rod complex.

Mapping additions

mappings/node_grounding.tsv grows 47 → 74 rows. 27 UniProt mappings — examples:

TraitMech label	UniProt CURIE	UniProt name
`mreb`	`UniprotKB:A0A1B1UYY2`	MreB
`ftsz`	`UniprotKB:C0LUM8`	FtsZ
`diviva`	`UniprotKB:Q1IYG2`	DivIVA
`atp synthase`	`UniprotKB:A0A415TT77`	ATP synthase
`na+/h+ antiporter`	`UniprotKB:A0A068T423`	Na+/H+ antiporter
`superoxide dismutase`	`UniprotKB:A0A009QPW9`	Superoxide dismutase (Cu-Zn)
`methyl-coenzyme m reductase`	`UniprotKB:A0A099T5Q9`	Methyl-coenzyme M reductase
`pqq-dependent methanol dehydrogenase`	`UniprotKB:A0A4U8YZA6`	PQQ-dependent methanol dehydrogenase (Alpha subunit)
`crescentin`	`UniprotKB:A0A2N9AY16`	Crescentin (CreS)
`pbp2b`	`UniprotKB:A0A892RPK7`	Penicillin-binding protein PBP2B

Full per-label audit at reports/uniprot_match_candidates.tsv.

Corpus impact

	Before	After
Nodes grounded	622	664 (+42, 50% → 53%)
Nodes residual	630	588 (−42)
Distinct (label, type) residuals	503	476 (−27)
Mappings TSV (rows)	47	74 (+27)

Caveats

The 27 representatives are species-specific UniProt entries, not canonical family-level groundings. PRO (Protein Ontology) would be the cleaner target for protein families but no local PRO snapshot is available; this PR picks the representative-UniProt approach explicitly authorized by the user.
41 of 93 GENE_OR_PROTEIN labels had at least one UniProt name match; 27 picked after the tier-1/2 filter; 14 had matches but failed the strict tier criteria (mostly chaperone / assembly factor / domain-containing protein candidates); 10 hand-skipped as abstract categories; 42 had no UniProt match at all (likely TraitMech-specific paraphrases or non-protein labels — salt-in strategy, weak/absent shape-determining cytoskeleton, etc.).
The matcher path is currently hardcoded to a local kg-microbe checkout; future runs need either an env var or a copy of merged-kg_uniprot_nodes.tsv into data/raw/. Left as-is for this PR since the script is one-off.

Verified locally

$ uv run python scripts/match_uniprot_to_proteins.py --apply
  labels with at least one match: 41 / 93
  representatives picked: 27
  abstract-category labels skipped: 10
  appended 27 rows to mappings/node_grounding.tsv

$ just ground-nodes --apply
  files modifiable:    [N]
  nodes grounded:      42
  residual nodes:      588 across 476 distinct (label, type) keys

$ just ground-nodes        # second pass, idempotency check
  nodes grounded:      0

$ just validate-strict
  files scanned:      357
  files with ERROR:   0

Test plan

Idempotency: re-run of ground-nodes produces 0 additional groundings
validate-strict clean
Audit trail in reports/uniprot_match_candidates.tsv
CI re-runs validate-strict on YAML diff

🤖 Generated with Claude Code

…des) Uses kg-microbe's merged-kg_uniprot_nodes.tsv (≈2.2 GB, 19.8M UniprotKB rows) as the source-of-truth label index for TraitMech's ungrounded GENE_OR_PROTEIN nodes. Streams the file once, matches each TraitMech residual GENE_OR_PROTEIN label against UniProt names, and grounds 42 nodes via 27 representative UniProt CURIEs. New tooling: - scripts/match_uniprot_to_proteins.py — one-off matcher. Streams the kg-microbe UniProt nodes file (path is currently hardcoded to the local checkout), collects per-label candidates up to a 500-match cap, and picks a single representative per label. Two-tier representative selector: 1. Tier 1 (best) — UniProt name (cleaned of trailing parens) equals the TraitMech label exactly. Among those, pick the alphabetically-first CURIE for determinism. 2. Tier 2 — name ends with " <label>" as the final token (e.g. "Polarized growth protein Scy" for label "scy"). Prefer SHORTEST such name (fewer modifier words like "chaperone", "maturation protein", "assembly factor"), alphabetic CURIE as tiebreaker. 3. Otherwise — skip (too ambiguous to ground cleanly). Also skips a hand-curated set of abstract-category labels that shouldn't be grounded to a specific UniProt entry: gene product, virulence factors, chaperone proteins, thermostable proteins, salinity-adaptation genes, cold-shock proteins, proton export pumps and antiporters, membrane transporters, gliding motility machinery, rod complex. Mapping additions (mappings/node_grounding.tsv, 47 → 74 rows): 27 UniProt mappings — examples: mreb → UniprotKB:A0A1B1UYY2 MreB ftsz → UniprotKB:C0LUM8 FtsZ diviva → UniprotKB:Q1IYG2 DivIVA atp synthase → UniprotKB:A0A415TT77 ATP synthase superoxide dismutase → UniprotKB:A0A009QPW9 methyl-coenzyme m reductase → UniprotKB:A0A099T5Q9 pqq-dependent methanol dehydrogenase → UniprotKB:A0A4U8YZA6 Per-corpus impact: Nodes grounded: 622 → 664 (+42, 50% → 53%) Nodes residual: 630 → 588 (−42) Distinct keys: 503 → 476 (−27) Mappings TSV: 47 → 74 rows Match-audit trail saved at reports/uniprot_match_candidates.tsv — 93 GENE_OR_PROTEIN labels processed; 41 had at least one UniProt match; 27 picked (after tier-1/2 filter); 14 had matches but failed the strict tier criteria (mostly entries that pulled "chaperone" / "assembly factor" / "domain-containing protein" candidates); 10 hand-skipped as abstract categories; 42 had no UniProt match at all (likely TraitMech-specific paraphrases or non-protein labels). The 27 representatives are species-specific UniProt entries — not canonical family-level groundings (PRO would be the cleaner target but no local PRO snapshot is available). Documented as "representative UniProt entry" in the mapping notes column. Verified locally: - just validate-strict → 0 ERROR rows / 357 files - just ground-nodes (dry-run after --apply) → 0 additional groundings (idempotent) - just audit-writers → script picked up; appends_curation_history=no (it's a one-off matcher, not a YAML-writer per se — it writes to mappings/ + reports/) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR adds a one-off matcher that uses kg-microbe’s UniProt node export to select representative UniProt CURIEs for previously-ungrounded GENE_OR_PROTEIN causal-graph nodes, then updates the curated mapping table and applies the resulting groundings across affected trait YAMLs (with an audit report and updated residual report).

Changes:

Add scripts/match_uniprot_to_proteins.py to stream merged-kg_uniprot_nodes.tsv, generate candidate matches, and (optionally) append selected representatives to mappings/node_grounding.tsv.
Expand mappings/node_grounding.tsv with 27 new UniProt representative mappings and commit the audit output reports/uniprot_match_candidates.tsv.
Apply new grounding: CURIEs to GENE_OR_PROTEIN nodes across multiple trait YAML causal graphs; update reports/node_grounding_residual.tsv accordingly.

Reviewed changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
scripts/match_uniprot_to_proteins.py	New streaming matcher to find UniProt candidates and append representative mappings.
reports/uniprot_match_candidates.tsv	New audit TSV listing candidate counts and chosen representatives per label.
reports/node_grounding_residual.tsv	Updated residual report after applying new mappings/groundings.
mappings/node_grounding.tsv	Adds 27 `GENE_OR_PROTEIN` → UniProt representative mappings.
data/traits/physiology/methylotrophic.yaml	Adds UniProt groundings for methanol dehydrogenase nodes; appends curation event.
data/traits/physiology/methanotrophic.yaml	Adds UniProt groundings for methane monooxygenase / methanol dehydrogenase; appends curation event.
data/traits/physiology/hydrogenotrophic.yaml	Adds UniProt grounding for hydrogenase; appends curation event.
data/traits/physiology/chemolithotrophic.yaml	Adds UniProt grounding for ammonia monooxygenase; appends curation event.
data/traits/physiology/carboxydotrophic.yaml	Adds UniProt groundings for CODH / molybdenum hydroxylase; appends curation event.
data/traits/morphology/spirochete_shaped.yaml	Adds UniProt grounding for FlaB; appends curation event.
data/traits/morphology/rod_shaped.yaml	Adds UniProt groundings for MreB and FtsZ; appends curation event.
data/traits/morphology/ovoid_shaped.yaml	Adds UniProt grounding for DivIVA; appends curation event.
data/traits/morphology/oval_shaped.yaml	Adds UniProt grounding for PBP2b; appends curation event.
data/traits/morphology/motility.yaml	Adds UniProt grounding for type IV pilus; appends curation event.
data/traits/morphology/motile.yaml	Adds UniProt grounding for type IV pilus; appends curation event.
data/traits/morphology/helical_shaped.yaml	Adds UniProt grounding for CcmA; appends curation event.
data/traits/morphology/filament_shaped.yaml	Adds UniProt groundings for DivIVA and Scy; appends curation event.
data/traits/morphology/ellipsoidal.yaml	Adds UniProt groundings for PBP2b and DivIVA; appends curation event.
data/traits/morphology/crescent_shaped.yaml	Adds UniProt grounding for crescentin; appends curation event.
data/traits/morphology/coccobacillus_shaped.yaml	Adds UniProt grounding for MreB; appends curation event.
data/traits/morphology/cell_shape.yaml	Adds UniProt groundings for MreB, FtsZ, crescentin; appends curation event.
data/traits/morphology/brown_pigmented.yaml	Adds UniProt grounding for 4-hydroxyphenylpyruvate dioxygenase; appends curation event.
data/traits/morphology/bacillus_shaped.yaml	Adds UniProt groundings for MreB and FtsZ; appends curation event.
data/traits/metabolism/substrate_level_phosphorylation.yaml	Adds UniProt grounding for acetate kinase; appends curation event.
data/traits/metabolism/respiration.yaml	Adds UniProt grounding for ATP synthase; appends curation event.
data/traits/metabolism/methanogenesis.yaml	Adds UniProt grounding for methyl-coenzyme M reductase; appends curation event.
data/traits/metabolism/electron_transfer.yaml	Adds UniProt groundings for redox protein and c-type cytochrome; appends curation event.
data/traits/metabolism/aerobic_respiration.yaml	Adds UniProt groundings for cytochrome c oxidase and ATP synthase; appends curation event.
data/traits/environment/obligately_alkaphilic.yaml	Adds UniProt grounding for Na+/H+ antiporter; appends curation event.
data/traits/environment/neutrophilic.yaml	Adds UniProt grounding for cation/proton antiporter; appends curation event.
data/traits/environment/hyperthermophilic.yaml	Adds UniProt grounding for reverse gyrase; appends curation event.
data/traits/environment/facultatively_alkaphilic.yaml	Adds UniProt grounding for Na+/H+ antiporter; appends curation event.
data/traits/environment/alkaphilic.yaml	Adds UniProt grounding for Na+/H+ antiporter; appends curation event.
data/traits/environment/alkalotolerant.yaml	Adds UniProt grounding for cation/proton antiporter; appends curation event.
data/traits/environment/aerotolerant.yaml	Adds UniProt grounding for superoxide dismutase; appends curation event.

Comments suppressed due to low confidence (5)

scripts/match_uniprot_to_proteins.py:52

KG_UNIPROT_NODES is hardcoded to a local absolute path ("/Users/marcin/..."), which makes this script non-runnable for other contributors and in CI. Please make the UniProt-nodes TSV path configurable (e.g., --kg-uniprot-nodes arg and/or env var), and keep the repo default either unset or relative to the repo (with a clear error if missing).
scripts/match_uniprot_to_proteins.py:98
build_regex() will compile a pattern that matches the empty string if labels is empty (because (?:|...) becomes (?:)), which would make regex.finditer(name) yield zero-length matches at every position and effectively hang the scan. Please guard against an empty labels list (e.g., return early in main() with a clear message, or raise in build_regex()).
scripts/match_uniprot_to_proteins.py:206
The match_count reported to reports/uniprot_match_candidates.tsv is len(cands), but cands is capped at MAX_MATCHES_PER_LABEL, so counts of 500 are ambiguous (could mean 500 total, or “>=500”). Please track total match occurrences separately from the stored candidate list (e.g., total_matches[label]), and write both to the report so ambiguity/skip thresholds can be reasoned about correctly.
scripts/match_uniprot_to_proteins.py:225
--apply appends to mappings/node_grounding.tsv unconditionally, so re-running the script will duplicate rows (and make diffs/noise harder to manage). Please load existing mappings first and skip (or update in-place) any (label, node_type) keys that already exist, and consider sorting/deduplicating output for deterministic reruns.
mappings/node_grounding.tsv:52
The per-row notes string claims the representative was selected via “name-ends-with-label + alphabetic-first CURIE”, but the actual selection logic described/implemented in the matcher includes an exact-match tier and a “shortest suffix-hit name” preference with CURIE only as a tiebreaker. Please update the notes text to accurately reflect the selection criteria (or reference the script/report) so the mapping provenance is not misleading.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Five Copilot findings, all CURIE-prefix normalization + one docstring drift: - CURIE prefix UniprotKB → UniProtKB across all emitted artifacts. The kg-microbe source data uses `UniprotKB:` (lowercase p) but the TraitMech LinkML schema declares `UniProtKB` (uppercase P) in its prefix-map. CURIE expansion is case-sensitive in many consumers, so the mismatch would break downstream IRI resolution. Updated: - mappings/node_grounding.tsv 27 rows - reports/uniprot_match_candidates.tsv 27 rows - data/traits/**/*.yaml 31 files (grounding values) scripts/match_uniprot_to_proteins.py now reads using the source-data spelling (`UniprotKB:` startswith filter, since that's what kg-microbe emits) and normalizes the CURIE in-place before storing it in the matches index, so downstream artifacts are written in the schema-canonical form. - Docstring rewrite to match the implemented selection algorithm. The original docstring described a suffix-first + fallback-to- any-containing + skip-above-100 scheme, but `pick_representative` actually implements a strict tier-1 (exact match) + tier-2 (suffix-token, shortest-name) + skip-otherwise scheme. Docstring now matches the code, and also documents the SKIP_LABELS blocklist and the CURIE-prefix normalization contract above. Verified: just validate-strict → 0 ERROR rows / 357 files. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…0 nodes) Adds 14 mappings across ENVIRONMENTAL_FACTOR (PATO + ENVO), CHEMICAL (CHEBI), and PATHWAY (GO) to ground the top remaining groundable labels in reports/node_grounding_residual.tsv after #66 (39 base) + #69 (8 METPO) + #70 (27 UniProt) + #72 (biomass retype). Additions (mappings/node_grounding.tsv, 75 → 89 rows): ENVIRONMENTAL_FACTOR (5 rows, PATO + ENVO): acidic external ph → PATO:0001428 acidic pH (5 nodes) alkaline external ph → PATO:0001429 alkaline pH (5 nodes) near-neutral external ph → PATO:0001432 neutral pH (4 nodes) very high temperature → PATO:0001637 extremely high temperature (2 nodes) high-salt environment → ENVO:01000687 saline environment (2 nodes) CHEMICAL (3 rows, CHEBI): thiosulfate → CHEBI:33542 thiosulfate(2-) (2 nodes) electron donor → CHEBI:17499 electron donor (2 nodes) organic compound → CHEBI:50860 organic molecular entity (6 nodes) PATHWAY (6 rows, GO): membrane electron transport chain → GO:0022900 ETC (3 nodes) electron transport chain → GO:0022900 (1 node) electron transport system → GO:0022900 (1 node) co2-fixation pathway → GO:0015977 carbon fixation (3 nodes) autotrophic co2 fixation → GO:0015977 (3 nodes) co2 fixation pathway → GO:0015977 (1 node) The PATHWAY additions collapse 6 distinct corpus-paraphrased labels onto 2 GO terms, demonstrating that the (label, node_type) mapping convention supports multi-label-→-one-CURIE without conflict. Per-corpus impact: Mapping TSV: 75 → 89 rows (+14) Nodes grounded: ~704 (53%) → ~744 (59%) Verified: - just ground-nodes --apply → 40 newly grounded - just ground-nodes (idempotency) → 0 - just validate-strict → 0 ERROR rows / 357 files Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…0 nodes) (#74) Adds 14 mappings across ENVIRONMENTAL_FACTOR (PATO + ENVO), CHEMICAL (CHEBI), and PATHWAY (GO) to ground the top remaining groundable labels in reports/node_grounding_residual.tsv after #66 (39 base) + #69 (8 METPO) + #70 (27 UniProt) + #72 (biomass retype). Additions (mappings/node_grounding.tsv, 75 → 89 rows): ENVIRONMENTAL_FACTOR (5 rows, PATO + ENVO): acidic external ph → PATO:0001428 acidic pH (5 nodes) alkaline external ph → PATO:0001429 alkaline pH (5 nodes) near-neutral external ph → PATO:0001432 neutral pH (4 nodes) very high temperature → PATO:0001637 extremely high temperature (2 nodes) high-salt environment → ENVO:01000687 saline environment (2 nodes) CHEMICAL (3 rows, CHEBI): thiosulfate → CHEBI:33542 thiosulfate(2-) (2 nodes) electron donor → CHEBI:17499 electron donor (2 nodes) organic compound → CHEBI:50860 organic molecular entity (6 nodes) PATHWAY (6 rows, GO): membrane electron transport chain → GO:0022900 ETC (3 nodes) electron transport chain → GO:0022900 (1 node) electron transport system → GO:0022900 (1 node) co2-fixation pathway → GO:0015977 carbon fixation (3 nodes) autotrophic co2 fixation → GO:0015977 (3 nodes) co2 fixation pathway → GO:0015977 (1 node) The PATHWAY additions collapse 6 distinct corpus-paraphrased labels onto 2 GO terms, demonstrating that the (label, node_type) mapping convention supports multi-label-→-one-CURIE without conflict. Per-corpus impact: Mapping TSV: 75 → 89 rows (+14) Nodes grounded: ~704 (53%) → ~744 (59%) Verified: - just ground-nodes --apply → 40 newly grounded - just ground-nodes (idempotency) → 0 - just validate-strict → 0 ERROR rows / 357 files Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The grounding pipelines and audit scripts have been load-bearing infrastructure for the last 7 PRs (#61, #66, #67, #69, #70 — all of which rewrite causal-graph fields based on these scripts' output). They had zero unit-test coverage. A silent regression in idempotency, header validation, or self-suppression would not be caught by validate-strict (which only checks per-record schema conformance, not pipeline correctness). Test counts: tests/test_ground_causal_predicates.py 9 tests tests/test_ground_causal_nodes.py 12 tests tests/test_validate_strict.py 11 tests tests/test_audit_writers.py 11 tests --- total new 43 tests total suite 54 tests (was 11) Coverage highlights: ground_causal_predicates.py: - load_mapping: basic happy path, conflict detection (same label → different CURIEs raises ValueError), incomplete-row skipping, missing-file error. - ground_edges_in_doc: idempotency (second pass = 0 changes), existing predicate_id never overwritten, residual counting for unmapped labels, empty/missing-predicate edges skipped. ground_causal_nodes.py: - All of the predicate suite plus: - (label, node_type) keyed lookup — same label, different node_types map to different CURIEs without aliasing. - Header validation (Copilot fix from PR #66): TSV with `nodetype` / `targetcurie` typo'd headers raises ValueError naming both missing columns. - grounded_keys-on-validation-failure separability (Copilot fix from PR #66): caller can union residual + grounded_keys to recover the corpus-state residual after rolling back an invalid file write. validate_strict.py: - classify: parametrized over the 5 categories (unexpected_field, missing_required, enum_mismatch, pattern_mismatch, other) — the messages must match the actual jsonschema phrasings the validator emits. - validate_one: clean record produces 0 errors; unknown field surfaces unexpected_field (the G01 gate behavior); missing required field surfaces missing_required; YAML parse error surfaces as yaml_parse_error category. - iter_yaml_files: walks directories, filters .txt, picks up nested *.yaml. audit_writers.py: - looks_like_yaml_writer: yaml.safe_dump / yaml.dump positive, bare .write_text negative, .write_text near .yaml hint positive, arbitrary code negative. - audit: full-safeguards writer flagged yes/yes/yes/yes; no-safeguards writer flagged no/no/no; non-writer returns None; wired_into_just yes when justfile mentions the script stem. - Self-suppression (Copilot fix from PR #64): audit_writers.py itself returns None even though its own source matches yaml.safe_dump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add tests for grounding pipeline + audit scripts (+43 tests) The grounding pipelines and audit scripts have been load-bearing infrastructure for the last 7 PRs (#61, #66, #67, #69, #70 — all of which rewrite causal-graph fields based on these scripts' output). They had zero unit-test coverage. A silent regression in idempotency, header validation, or self-suppression would not be caught by validate-strict (which only checks per-record schema conformance, not pipeline correctness). Test counts: tests/test_ground_causal_predicates.py 9 tests tests/test_ground_causal_nodes.py 12 tests tests/test_validate_strict.py 11 tests tests/test_audit_writers.py 11 tests --- total new 43 tests total suite 54 tests (was 11) Coverage highlights: ground_causal_predicates.py: - load_mapping: basic happy path, conflict detection (same label → different CURIEs raises ValueError), incomplete-row skipping, missing-file error. - ground_edges_in_doc: idempotency (second pass = 0 changes), existing predicate_id never overwritten, residual counting for unmapped labels, empty/missing-predicate edges skipped. ground_causal_nodes.py: - All of the predicate suite plus: - (label, node_type) keyed lookup — same label, different node_types map to different CURIEs without aliasing. - Header validation (Copilot fix from PR #66): TSV with `nodetype` / `targetcurie` typo'd headers raises ValueError naming both missing columns. - grounded_keys-on-validation-failure separability (Copilot fix from PR #66): caller can union residual + grounded_keys to recover the corpus-state residual after rolling back an invalid file write. validate_strict.py: - classify: parametrized over the 5 categories (unexpected_field, missing_required, enum_mismatch, pattern_mismatch, other) — the messages must match the actual jsonschema phrasings the validator emits. - validate_one: clean record produces 0 errors; unknown field surfaces unexpected_field (the G01 gate behavior); missing required field surfaces missing_required; YAML parse error surfaces as yaml_parse_error category. - iter_yaml_files: walks directories, filters .txt, picks up nested *.yaml. audit_writers.py: - looks_like_yaml_writer: yaml.safe_dump / yaml.dump positive, bare .write_text negative, .write_text near .yaml hint positive, arbitrary code negative. - audit: full-safeguards writer flagged yes/yes/yes/yes; no-safeguards writer flagged no/no/no; non-writer returns None; wired_into_just yes when justfile mentions the script stem. - Self-suppression (Copilot fix from PR #64): audit_writers.py itself returns None even though its own source matches yaml.safe_dump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review on PR #71 Add explicit `assert "b.yml" not in names` to test_iter_yaml_files_walks_directory_and_filters — the prior test documented the .yml-skipping behavior in a comment but never asserted it, so a regression that started picking up .yml during directory walks would have slipped through silently. Also add test_iter_yaml_files_accepts_yml_file_passed_directly to lock in the asymmetry that the previous test only hinted at: iter_yaml_files() does accept .yml when passed as a file argument (only the rglob('*.yaml') walk is .yaml-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update audit_writers tests to match #75's tightened heuristic PR #75 changed `looks_like_yaml_writer` to require that the yaml-serializer call feed directly into write_text on the same line (instead of the looser "any .write_text + any .yaml token" heuristic, which produced false positives for scripts that only READ trait YAMLs). The pre-#75 test asserted that `path.write_text(content) # .yaml` counted as a YAML writer. That returned True under the old heuristic and False under the new (correct) one. Replace it with two tests that lock in the new contract: test_looks_like_yaml_writer_write_text_of_yaml_dump Positive: write_text(yaml.safe_dump(...)) / write_text(yaml.dump(...)) both count. test_looks_like_yaml_writer_write_text_of_json_is_false Negative: a script that reads *.yaml then writes JSON via write_text is NOT a YAML writer — this is the false-positive case #75 explicitly fixed for scripts/build_embedding_index.py and scripts/render_trait_pages.py. Also rename test_looks_like_yaml_writer_write_text_without_yaml_hint_is_false to test_looks_like_yaml_writer_write_text_plain_is_false since the "yaml hint" phrasing was tied to the old heuristic. 56 tests pass (was 54; +2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 24, 2026 05:06

Copilot started reviewing on behalf of realmarcin May 24, 2026 05:07 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

This was referenced May 24, 2026

Tests for grounding pipeline + audit scripts (+43 tests) #71

Merged

Node grounding v2: PATO/CHEBI/GO/ENVO mapping expansion (+15 rows, +40 nodes) #74

Merged

realmarcin merged commit ab3592c into main May 24, 2026
1 check passed

realmarcin deleted the ground-proteins-uniprot branch May 24, 2026 08:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ground GENE_OR_PROTEIN nodes via kg-microbe UniProt transform (+42 nodes)#70

Ground GENE_OR_PROTEIN nodes via kg-microbe UniProt transform (+42 nodes)#70
realmarcin merged 2 commits into
mainfrom
ground-proteins-uniprot

realmarcin commented May 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

realmarcin commented May 24, 2026

Summary

New tooling

Mapping additions

Corpus impact

Caveats

Verified locally

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants