Ground causal-graph predicates: 28-mapping cohort, 482 edges grounded#61
Merged
Conversation
Adds the predicate-grounding pipeline used to populate
`causal_graphs[].edges[].predicate_id` from a curated label→CURIE TSV.
This pass grounds 412 edges across 212 trait YAMLs using 28
high-confidence mappings (METPO, RO, biolink, rdfs).
New machinery:
- `scripts/ground_causal_predicates.py` — walks `data/traits/**/*.yaml`,
fills empty `predicate_id` from `mappings/predicate_grounding.tsv`,
validates closed-mode before write, appends one CurationEvent per
modified file, never overwrites existing groundings.
- `scripts/check_biolink_coverage.py` — cross-checks applied mappings
and residual labels against the Biolink model
(`data/raw/biolink-model.yaml`, vendored to keep CI self-contained).
- `just ground-predicates` and `just check-biolink-coverage` recipes.
Initial mapping cohort (`mappings/predicate_grounding.tsv`, 28 rows):
- 6 METPO ObjectProperty matches (produces, oxidizes, uses
carbon/electron-donor/electron-acceptor/energy-source).
- 4 RO matches (enables RO:0002327, contributes to RO:0002326,
regulates RO:0002211, depends on RO:0002502).
- 17 biolink slot matches (causes, catalyzes, associated_with,
located_in, participates_in, part_of, occurs_in, interacts_with,
develops_into, consumes, produces, encodes, plus three
located_in aliases — localized in/to, localizes to).
- 1 rdfs:subClassOf for `is a`, `specializes`, `example of`.
Residual: 537 edges across 191 distinct labels remain ungrounded.
See `reports/predicate_grounding_residual.tsv` for the ranked tail;
top residuals (`manifests as`, `supports`, `selects for`, `drives`)
are curator-paraphrased predicates without a clean RO/Biolink home
and are candidates for an upstream METPO predicate proposal.
Includes audit-pass output from the audit-schema-gaps skill
(`reports/{gap_fix_backlog,schema_gap_audit,instance_validation_*,
pipeline_*}`). Corpus passes `just validate-strict` clean: 0 ERROR
rows across 357 files. The CI gate that locks this in is tracked
as G01 in `reports/gap_fix_backlog.md` and lands in a follow-up PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
There was a problem hiding this comment.
Pull request overview
This PR adds a predicate-grounding workflow for TraitMech causal graphs, populating causal_graphs[].edges[].predicate_id from a curated label→CURIE TSV and wiring the workflow into just targets, alongside audit/coverage reports.
Changes:
- Add predicate grounding/coverage tooling (new scripts +
justrecipes) driven bymappings/predicate_grounding.tsv. - Apply predicate groundings across many trait YAMLs by adding
predicate_idplus newcuration_historyevents. - Add audit/summary/report artifacts under
reports/to capture writer-audit and validation snapshots.
Reviewed changes
Copilot reviewed 228 out of 229 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
mappings/predicate_grounding.tsv |
Adds the curated label→CURIE mapping cohort used for grounding. |
justfile |
Adds ground-predicates and check-biolink-coverage recipes. |
reports/pipeline_writers_audit.tsv |
Snapshot TSV of YAML-writer audit (currently appears incomplete vs new scripts). |
reports/pipeline_gap_audit.md |
Narrative audit of YAML-writing scripts and pipeline gaps (needs updates for new writer). |
reports/instance_validation_summary.md |
Summary of strict instance validation run. |
reports/instance_validation_failures.tsv |
Empty/placeholder failures TSV (header only). |
reports/gap_fix_backlog.tsv |
Backlog of pipeline/schema follow-ups. |
data/traits/**.yaml |
Adds predicate_id groundings and GROUND_CAUSAL_PREDICATES curation events across many traits. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1
to
+5
| path writes_yaml appends_curation_history has_write_safeguard validates_before_write wired_into_just | ||
| scripts/audit_writers.py yes yes yes yes yes | ||
| scripts/build_embedding_index.py yes no no no yes | ||
| scripts/render_trait_pages.py yes no yes no yes | ||
| scripts/seed_from_metpo.py yes yes yes no yes |
Comment on lines
+18
to
+23
| ### `scripts/seed_from_metpo.py` — the only real trait-YAML writer | ||
|
|
||
| This is the entry point for new trait records. It uses the safer **opt-in** convention (`--apply` defaults off; bare invocation is dry-run) and appends `CurationEvent` entries when it writes — both correct. It does **not** validate output against the schema before writing. | ||
|
|
||
| **Gap (P1):** add an in-process strict validation pass before each write, using the same `linkml.validator.Validator(closed=True)` configured in `scripts/validate_strict.py`. If a record fails, log + skip rather than abort the whole run, so one bad record doesn't poison a 357-file seed. Effort: M (refactor the writer loop to construct a per-process Validator and call it before `path.write_text`). This is the highest-leverage fix on the pipeline axis because the seeder is the *only* path producing new trait records. | ||
|
|
Comment on lines
135
to
137
| - timestamp: '2026-05-09T00:00:00-07:00' | ||
| curator: Codex | ||
| action: CURATED_CAUSAL_GRAPH |
Comment on lines
118
to
121
| llm_assisted: false | ||
| - timestamp: '2026-05-09T00:00:00-07:00' | ||
| curator: Codex | ||
| action: CURATED_CAUSAL_GRAPH |
Comment on lines
+9
to
+14
| | path | writes_yaml | curation_history | safeguard | validates_first | wired_into_just | | ||
| |---|---|---|---|---|---| | ||
| | `scripts/audit_writers.py` | yes | yes | yes | yes | yes | | ||
| | `scripts/build_embedding_index.py` | yes | no | no | no | yes | | ||
| | `scripts/render_trait_pages.py` | yes | no | yes | no | yes | | ||
| | `scripts/seed_from_metpo.py` | yes | yes | yes | no | yes | |
3 tasks
realmarcin
added a commit
that referenced
this pull request
May 23, 2026
…#63) Lifts the TraitMech causal-graph subsystem into METPO so downstream consumers can filter trait records by mechanism axis using METPO-native queries instead of TraitMech-internal LinkML enum codes. Cohort is committed here in-repo only — not filed upstream in this PR. Cohort (proposals/metpo_traitmech_v1/): - 3 top-level domain classes under METPO:1000000: - METPO:1007400 trait causal graph - METPO:1007401 trait causal node - METPO:1007402 trait causal edge - 1 enum-parent under METPO:1007401: - METPO:1007410 trait causal node type - 10 leaf classes under METPO:1007410, one per CausalNodeTypeEnum permissible value (METPO:1007411–1007420), e.g. causal-graph trait node, causal-graph pathway node, causal-graph environmental factor node (xref ENVO:01000254), causal-graph experimental factor node (xref EFO:0000001), etc. Out of scope (documented in proposal.md): - Scope A: no traitmech:NNNNNN synthetic IDs exist in corpus today. - Scope B (causal-graph predicates): deferred until the predicate-grounding migration (#61) reduces the 191-label residual. - 5 other LinkML enums (TraitCategoryEnum, TermKindEnum, SynonymTypeEnum, PriorityEnum, MappingStatusEnum) — workflow internals, not ontology axes. Tooling: - scripts/verify_metpo_proposal.py — column-count, header, parent integrity, subset tag, scope-A/C coverage. Wired as `just verify-proposal <cohort>`. - scripts/robot_validate_proposal.py — `robot template → merge with metpo.owl → reason ELK`. Wired as `just robot-validate-proposal <cohort>`. Discovers robot via $ROBOT, $ROBOT_BIN, PATH, then ../kg-microbe/data/raw/robot. - .gitignore: reports/robot/ (regenerable, dominated by re-serialized metpo.owl at ~500 KB per file). Verification (run locally on this branch): - `just verify-proposal metpo_traitmech_v1` → PASS, 0 failures. - `just robot-validate-proposal metpo_traitmech_v1` → PASS, no UNSAT, ELK delta +6 axioms (the inferred subclass closure). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
realmarcin
added a commit
that referenced
this pull request
May 24, 2026
…al-only) (#65) * Add METPO predicate proposal cohort metpo_traitmech_v2 (8 predicates) Scope-B proposal: 8 new METPO object properties (METPO:2007400-METPO:2007407) covering 128 of the 537 residual causal-edge predicates left by the v1 grounding pass (#61). Cohort (proposals/metpo_traitmech_v2/): - manifests as (METPO:2007400, 52 edges) — state → observable trait - selects for (METPO:2007401, 20 edges) — env condition → adapted trait - feeds electrons into (METPO:2007402, 12 edges) — donor → transport chain - transfers electrons to (METPO:2007403, 6 edges) — single-step redox - fixed by (METPO:2007404, 9 edges) — substrate → fixation pathway - oxidized to (METPO:2007405, 8 edges) — substrate → oxidized product - challenges (METPO:2007406, 9 edges) — stressor → tolerance trait - mitigates (METPO:2007407, 12 edges) — defense → stressor (paired) Each candidate was checked against RO/Biolink first; rejections documented per-row in proposal.md (e.g. biolink:manifestation_of has range `disease` — too narrow; biolink:treats is clinical; etc.). Subset tag: metpo_traitmech_2026_06. Domain = range = METPO:1007401 (trait causal node, minted in v1). ROBOT/ELK validates clean: delta +6 axioms, no UNSAT (v1's METPO:1007401 resolves to unnamed external IRI without v1 merged, which is fine — no error, just preserved domain/range constraint). Per-corpus impact (after re-running ground-predicates --apply with the expanded 38-row mappings TSV): - Edges grounded: 482 → 618 (+136) - Edges residual: 537 → 401 (−136) - Distinct labels: 191 → 181 (−10) Also adds 2 RO mappings (controls, directs → RO:0002211 regulates) that match the RO definition of regulation but were not in the v1 mapping cohort. NOT filed upstream in this PR (per user instruction). Cohort lives in this repo only; upstream filing path documented in proposal.md. Mapping TSV notes flag the 8 proposed CURIEs as "proposed upstream in proposals/metpo_traitmech_v2" so reviewers know they're pending METPO adoption. Verified locally: - just verify-proposal metpo_traitmech_v2 → PASS (0 failures) - just robot-validate-proposal metpo_traitmech_v2 → PASS (ELK +6) - just validate-strict → 0 ERROR rows / 357 files Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review on PR #65 Three fixes per Copilot inline comments: - TSV row 3 (manifests as): swap definition_source from `copiotrophic.yaml` (which has no `manifests as` edge — only `selects for`, `supports`, etc.) to `nutrient_adaptation.yaml#nutrient_adaptation_life_history_axis`, which is the canonical graph where `manifests as` first appears. - TSV row 9 (challenges): swap definition_source from `acidophilic.yaml` (no `challenges` edge — uses `selects for`) to `acidotolerant.yaml#acidotolerant_acid_stress_homeostasis`, which carries the `acidic_exposure challenges ...` edge directly. - proposal.md context paragraph: correct grounded counts from the incorrect "648 of 1185" to the actual "618 of 1019", matching the Corpus Impact table. Verified via `uv run python <<<` count over data/traits/**/causal_graphs[].edges[] (total=1019, grounded=618, residual=401). - proposal.md paired-predicate heading: rephrase "the only paired pair" → "the only paired predicate set" (removes the redundancy). Verified: `just verify-proposal metpo_traitmech_v2` → PASS, 0 failures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
4 tasks
realmarcin
added a commit
that referenced
this pull request
May 24, 2026
Adds the node-grounding pipeline, mirror of the predicate-grounding work in #61. Fills empty `causal_graphs[].nodes[].grounding` from a curated (label, node_type) → CURIE TSV. This pass grounds 77 nodes across trait YAMLs (38% → 45% of causal-graph nodes are now grounded; 1252 nodes total). New machinery: - scripts/ground_causal_nodes.py — walks data/traits/**/*.yaml, fills empty `grounding` from mappings/node_grounding.tsv, validates closed-mode before write, appends one CurationEvent per modified file, never overwrites existing groundings. Keyed on (label, node_type) since the same free-text label can refer to different ontology classes depending on node type (e.g. "terminal electron acceptor" as CHEMICAL vs MOLECULAR_FUNCTION). - just ground-nodes recipe. Hardening per the original Copilot review: - load_mapping validates required headers (label, node_type, target_curie) up-front; raises ValueError with a helpful message if any are missing (instead of silently producing an empty mapping when DictReader returns None for missing keys). - ground_nodes_in_doc returns a `grounded_keys` counter alongside the residual counter, so when a file fails validation the just-grounded nodes are re-added to the residual TSV. Without this, those nodes were invisible (removed from per-CURIE counts but not added to residual, even though the file is rejected and the nodes remain ungrounded on disk). - reports/pipeline_writers_audit.tsv refreshed to include the new writer (4 → 5 rows). Initial mapping cohort (mappings/node_grounding.tsv, 39 rows): - 14 CHEBI mappings for canonical metabolic chemicals (O2, CO2, CO, H2, CH4, methanol, NH3, NO3-, SO4(2-), S(2-), H+, Fe(2+), organic carbon, compatible solutes). - 10 GO-BP mappings for canonical processes (peptidoglycan synthesis, methanogenesis, aerobic/anaerobic respiration, photosynthesis, N2 fixation, fermentation, C fixation, oxidative phosphorylation, cellular pH regulation, response to osmotic stress). - 4 GO-CC mappings for canonical compartments (periplasmic space, outer membrane, plasma membrane, cytoplasm). - 2 GO-MF mappings (kinase activity, oxidoreductase activity). - 4 GO-BP/pathway mappings (ETC, photosynthetic ETC, Calvin-Benson, Wood-Ljungdahl). - 2 PATO + 3 ENVO env-factor mappings (light intensity, decreased temperature, anaerobic + anoxic environment). Residual: 688 nodes across 511 distinct (label, type) keys remain ungrounded. See reports/node_grounding_residual.tsv. The largest clusters are BIOLOGICAL_PROCESS abstractions (proton motive force, biomass, membrane fluidity) and GENE_OR_PROTEIN families (MreB, CRT enzymes, RuBisCO, FtsZ) — candidates for either a METPO node-class proposal cohort or upstream UniProt/PRO grounding. audit-writers TSV grows from 4 → 5 rows; the new script reports appends_curation_history + has_write_safeguard + validates_before_write all `yes` (matches the ground_causal_predicates contract from #61). Verified locally: - just ground-nodes (dry-run after --apply) → 0 additional groundings (idempotent) - header-missing test: TSV with bad headers raises ValueError naming the missing columns - just validate-strict → 0 ERROR rows / 357 files - just audit-writers → 5 writers, all wired into justfile Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin
added a commit
that referenced
this pull request
May 24, 2026
Adds the node-grounding pipeline, mirror of the predicate-grounding work in #61. Fills empty `causal_graphs[].nodes[].grounding` from a curated (label, node_type) → CURIE TSV. This pass grounds 77 nodes across trait YAMLs (38% → 45% of causal-graph nodes are now grounded; 1252 nodes total). New machinery: - scripts/ground_causal_nodes.py — walks data/traits/**/*.yaml, fills empty `grounding` from mappings/node_grounding.tsv, validates closed-mode before write, appends one CurationEvent per modified file, never overwrites existing groundings. Keyed on (label, node_type) since the same free-text label can refer to different ontology classes depending on node type (e.g. "terminal electron acceptor" as CHEMICAL vs MOLECULAR_FUNCTION). - just ground-nodes recipe. Hardening per the original Copilot review: - load_mapping validates required headers (label, node_type, target_curie) up-front; raises ValueError with a helpful message if any are missing (instead of silently producing an empty mapping when DictReader returns None for missing keys). - ground_nodes_in_doc returns a `grounded_keys` counter alongside the residual counter, so when a file fails validation the just-grounded nodes are re-added to the residual TSV. Without this, those nodes were invisible (removed from per-CURIE counts but not added to residual, even though the file is rejected and the nodes remain ungrounded on disk). - reports/pipeline_writers_audit.tsv refreshed to include the new writer (4 → 5 rows). Initial mapping cohort (mappings/node_grounding.tsv, 39 rows): - 14 CHEBI mappings for canonical metabolic chemicals (O2, CO2, CO, H2, CH4, methanol, NH3, NO3-, SO4(2-), S(2-), H+, Fe(2+), organic carbon, compatible solutes). - 10 GO-BP mappings for canonical processes (peptidoglycan synthesis, methanogenesis, aerobic/anaerobic respiration, photosynthesis, N2 fixation, fermentation, C fixation, oxidative phosphorylation, cellular pH regulation, response to osmotic stress). - 4 GO-CC mappings for canonical compartments (periplasmic space, outer membrane, plasma membrane, cytoplasm). - 2 GO-MF mappings (kinase activity, oxidoreductase activity). - 4 GO-BP/pathway mappings (ETC, photosynthetic ETC, Calvin-Benson, Wood-Ljungdahl). - 2 PATO + 3 ENVO env-factor mappings (light intensity, decreased temperature, anaerobic + anoxic environment). Residual: 688 nodes across 511 distinct (label, type) keys remain ungrounded. See reports/node_grounding_residual.tsv. The largest clusters are BIOLOGICAL_PROCESS abstractions (proton motive force, biomass, membrane fluidity) and GENE_OR_PROTEIN families (MreB, CRT enzymes, RuBisCO, FtsZ) — candidates for either a METPO node-class proposal cohort or upstream UniProt/PRO grounding. audit-writers TSV grows from 4 → 5 rows; the new script reports appends_curation_history + has_write_safeguard + validates_before_write all `yes` (matches the ground_causal_predicates contract from #61). Verified locally: - just ground-nodes (dry-run after --apply) → 0 additional groundings (idempotent) - header-missing test: TSV with bad headers raises ValueError naming the missing columns - just validate-strict → 0 ERROR rows / 357 files - just audit-writers → 5 writers, all wired into justfile Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
realmarcin
added a commit
that referenced
this pull request
May 24, 2026
The grounding pipelines and audit scripts have been load-bearing infrastructure for the last 7 PRs (#61, #66, #67, #69, #70 — all of which rewrite causal-graph fields based on these scripts' output). They had zero unit-test coverage. A silent regression in idempotency, header validation, or self-suppression would not be caught by validate-strict (which only checks per-record schema conformance, not pipeline correctness). Test counts: tests/test_ground_causal_predicates.py 9 tests tests/test_ground_causal_nodes.py 12 tests tests/test_validate_strict.py 11 tests tests/test_audit_writers.py 11 tests --- total new 43 tests total suite 54 tests (was 11) Coverage highlights: ground_causal_predicates.py: - load_mapping: basic happy path, conflict detection (same label → different CURIEs raises ValueError), incomplete-row skipping, missing-file error. - ground_edges_in_doc: idempotency (second pass = 0 changes), existing predicate_id never overwritten, residual counting for unmapped labels, empty/missing-predicate edges skipped. ground_causal_nodes.py: - All of the predicate suite plus: - (label, node_type) keyed lookup — same label, different node_types map to different CURIEs without aliasing. - Header validation (Copilot fix from PR #66): TSV with `nodetype` / `targetcurie` typo'd headers raises ValueError naming both missing columns. - grounded_keys-on-validation-failure separability (Copilot fix from PR #66): caller can union residual + grounded_keys to recover the corpus-state residual after rolling back an invalid file write. validate_strict.py: - classify: parametrized over the 5 categories (unexpected_field, missing_required, enum_mismatch, pattern_mismatch, other) — the messages must match the actual jsonschema phrasings the validator emits. - validate_one: clean record produces 0 errors; unknown field surfaces unexpected_field (the G01 gate behavior); missing required field surfaces missing_required; YAML parse error surfaces as yaml_parse_error category. - iter_yaml_files: walks directories, filters .txt, picks up nested *.yaml. audit_writers.py: - looks_like_yaml_writer: yaml.safe_dump / yaml.dump positive, bare .write_text negative, .write_text near .yaml hint positive, arbitrary code negative. - audit: full-safeguards writer flagged yes/yes/yes/yes; no-safeguards writer flagged no/no/no; non-writer returns None; wired_into_just yes when justfile mentions the script stem. - Self-suppression (Copilot fix from PR #64): audit_writers.py itself returns None even though its own source matches yaml.safe_dump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin
added a commit
that referenced
this pull request
May 24, 2026
* Add tests for grounding pipeline + audit scripts (+43 tests) The grounding pipelines and audit scripts have been load-bearing infrastructure for the last 7 PRs (#61, #66, #67, #69, #70 — all of which rewrite causal-graph fields based on these scripts' output). They had zero unit-test coverage. A silent regression in idempotency, header validation, or self-suppression would not be caught by validate-strict (which only checks per-record schema conformance, not pipeline correctness). Test counts: tests/test_ground_causal_predicates.py 9 tests tests/test_ground_causal_nodes.py 12 tests tests/test_validate_strict.py 11 tests tests/test_audit_writers.py 11 tests --- total new 43 tests total suite 54 tests (was 11) Coverage highlights: ground_causal_predicates.py: - load_mapping: basic happy path, conflict detection (same label → different CURIEs raises ValueError), incomplete-row skipping, missing-file error. - ground_edges_in_doc: idempotency (second pass = 0 changes), existing predicate_id never overwritten, residual counting for unmapped labels, empty/missing-predicate edges skipped. ground_causal_nodes.py: - All of the predicate suite plus: - (label, node_type) keyed lookup — same label, different node_types map to different CURIEs without aliasing. - Header validation (Copilot fix from PR #66): TSV with `nodetype` / `targetcurie` typo'd headers raises ValueError naming both missing columns. - grounded_keys-on-validation-failure separability (Copilot fix from PR #66): caller can union residual + grounded_keys to recover the corpus-state residual after rolling back an invalid file write. validate_strict.py: - classify: parametrized over the 5 categories (unexpected_field, missing_required, enum_mismatch, pattern_mismatch, other) — the messages must match the actual jsonschema phrasings the validator emits. - validate_one: clean record produces 0 errors; unknown field surfaces unexpected_field (the G01 gate behavior); missing required field surfaces missing_required; YAML parse error surfaces as yaml_parse_error category. - iter_yaml_files: walks directories, filters .txt, picks up nested *.yaml. audit_writers.py: - looks_like_yaml_writer: yaml.safe_dump / yaml.dump positive, bare .write_text negative, .write_text near .yaml hint positive, arbitrary code negative. - audit: full-safeguards writer flagged yes/yes/yes/yes; no-safeguards writer flagged no/no/no; non-writer returns None; wired_into_just yes when justfile mentions the script stem. - Self-suppression (Copilot fix from PR #64): audit_writers.py itself returns None even though its own source matches yaml.safe_dump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review on PR #71 Add explicit `assert "b.yml" not in names` to test_iter_yaml_files_walks_directory_and_filters — the prior test documented the .yml-skipping behavior in a comment but never asserted it, so a regression that started picking up .yml during directory walks would have slipped through silently. Also add test_iter_yaml_files_accepts_yml_file_passed_directly to lock in the asymmetry that the previous test only hinted at: iter_yaml_files() does accept .yml when passed as a file argument (only the rglob('*.yaml') walk is .yaml-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update audit_writers tests to match #75's tightened heuristic PR #75 changed `looks_like_yaml_writer` to require that the yaml-serializer call feed directly into write_text on the same line (instead of the looser "any .write_text + any .yaml token" heuristic, which produced false positives for scripts that only READ trait YAMLs). The pre-#75 test asserted that `path.write_text(content) # .yaml` counted as a YAML writer. That returned True under the old heuristic and False under the new (correct) one. Replace it with two tests that lock in the new contract: test_looks_like_yaml_writer_write_text_of_yaml_dump Positive: write_text(yaml.safe_dump(...)) / write_text(yaml.dump(...)) both count. test_looks_like_yaml_writer_write_text_of_json_is_false Negative: a script that reads *.yaml then writes JSON via write_text is NOT a YAML writer — this is the false-positive case #75 explicitly fixed for scripts/build_embedding_index.py and scripts/render_trait_pages.py. Also rename test_looks_like_yaml_writer_write_text_without_yaml_hint_is_false to test_looks_like_yaml_writer_write_text_plain_is_false since the "yaml hint" phrasing was tied to the old heuristic. 56 tests pass (was 54; +2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds the predicate-grounding pipeline that fills
causal_graphs[].edges[].predicate_idfrom a curated label→CURIE TSV. This pass grounds 482 edges across 212 trait YAMLs using 28 high-confidence mappings spanning METPO, RO, biolink, and rdfs.Mapping cohort (
mappings/predicate_grounding.tsv, 28 rows)produces,oxidizes,uses carbon/electron-donor/electron-acceptor/energy-source)enablesRO:0002327,contributes toRO:0002326,regulatesRO:0002211,depends onRO:0002502)causes,catalyzes,associated_with,located_in+ 3 aliases,participates_in,part_of,occurs_in,interacts_with,develops_into,consumes,produces,encodes, etc.)rdfs:subClassOfcoveringis a,specializes,example ofNew machinery
scripts/ground_causal_predicates.py— idempotent: never overwrites existing groundings, validates closed-mode before write, appends one CurationEvent per modified file.scripts/check_biolink_coverage.py— cross-checks applied mappings + residual labels againstdata/raw/biolink-model.yaml(vendored, 499 KB).just ground-predicatesandjust check-biolink-coveragerecipes.Residual
537 edges across 191 distinct labels remain ungrounded. See
reports/predicate_grounding_residual.tsvfor the ranked tail; top residuals (manifests as52,supports26,selects for20,drives19) are curator-paraphrased predicates without a clean RO/Biolink home — candidates for an upstream METPO predicate proposal.Audit snapshot
Includes audit-pass output from the
audit-schema-gapsskill (reports/{gap_fix_backlog,schema_gap_audit,instance_validation_*,pipeline_*}). Corpus passesjust validate-strictclean (0 ERROR rows / 357 files). The CI gate that locks this in is tracked as G01 inreports/gap_fix_backlog.mdand lands in a follow-up PR.Test plan
just validate-strict— 0 ERROR rows / 357 filesjust ground-predicates(dry-run after--apply) — reports 0 additional groundings (idempotent)just check-biolink-coverage— 28 applied mappings indexed, residual cross-checkedvalidate-stricton the diff (gate ships in PR2)🤖 Generated with Claude Code