Ground GENE_OR_PROTEIN nodes via kg-microbe UniProt transform (+42 nodes)#70
Merged
Conversation
…des)
Uses kg-microbe's merged-kg_uniprot_nodes.tsv (≈2.2 GB, 19.8M
UniprotKB rows) as the source-of-truth label index for TraitMech's
ungrounded GENE_OR_PROTEIN nodes. Streams the file once, matches
each TraitMech residual GENE_OR_PROTEIN label against UniProt
names, and grounds 42 nodes via 27 representative UniProt CURIEs.
New tooling:
- scripts/match_uniprot_to_proteins.py — one-off matcher. Streams
the kg-microbe UniProt nodes file (path is currently hardcoded
to the local checkout), collects per-label candidates up to a
500-match cap, and picks a single representative per label.
Two-tier representative selector:
1. Tier 1 (best) — UniProt name (cleaned of trailing parens)
equals the TraitMech label exactly. Among those, pick the
alphabetically-first CURIE for determinism.
2. Tier 2 — name ends with " <label>" as the final token
(e.g. "Polarized growth protein Scy" for label "scy"). Prefer
SHORTEST such name (fewer modifier words like "chaperone",
"maturation protein", "assembly factor"), alphabetic CURIE
as tiebreaker.
3. Otherwise — skip (too ambiguous to ground cleanly).
Also skips a hand-curated set of abstract-category labels that
shouldn't be grounded to a specific UniProt entry: gene product,
virulence factors, chaperone proteins, thermostable proteins,
salinity-adaptation genes, cold-shock proteins, proton export
pumps and antiporters, membrane transporters, gliding motility
machinery, rod complex.
Mapping additions (mappings/node_grounding.tsv, 47 → 74 rows):
27 UniProt mappings — examples:
mreb → UniprotKB:A0A1B1UYY2 MreB
ftsz → UniprotKB:C0LUM8 FtsZ
diviva → UniprotKB:Q1IYG2 DivIVA
atp synthase → UniprotKB:A0A415TT77 ATP synthase
superoxide dismutase → UniprotKB:A0A009QPW9
methyl-coenzyme m reductase → UniprotKB:A0A099T5Q9
pqq-dependent methanol dehydrogenase → UniprotKB:A0A4U8YZA6
Per-corpus impact:
Nodes grounded: 622 → 664 (+42, 50% → 53%)
Nodes residual: 630 → 588 (−42)
Distinct keys: 503 → 476 (−27)
Mappings TSV: 47 → 74 rows
Match-audit trail saved at reports/uniprot_match_candidates.tsv —
93 GENE_OR_PROTEIN labels processed; 41 had at least one UniProt
match; 27 picked (after tier-1/2 filter); 14 had matches but failed
the strict tier criteria (mostly entries that pulled "chaperone" /
"assembly factor" / "domain-containing protein" candidates); 10
hand-skipped as abstract categories; 42 had no UniProt match at
all (likely TraitMech-specific paraphrases or non-protein labels).
The 27 representatives are species-specific UniProt entries — not
canonical family-level groundings (PRO would be the cleaner
target but no local PRO snapshot is available). Documented as
"representative UniProt entry" in the mapping notes column.
Verified locally:
- just validate-strict → 0 ERROR rows / 357 files
- just ground-nodes (dry-run after --apply) → 0 additional groundings (idempotent)
- just audit-writers → script picked up; appends_curation_history=no (it's a one-off matcher, not a YAML-writer per se — it writes to mappings/ + reports/)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds a one-off matcher that uses kg-microbe’s UniProt node export to select representative UniProt CURIEs for previously-ungrounded GENE_OR_PROTEIN causal-graph nodes, then updates the curated mapping table and applies the resulting groundings across affected trait YAMLs (with an audit report and updated residual report).
Changes:
- Add
scripts/match_uniprot_to_proteins.pyto streammerged-kg_uniprot_nodes.tsv, generate candidate matches, and (optionally) append selected representatives tomappings/node_grounding.tsv. - Expand
mappings/node_grounding.tsvwith 27 new UniProt representative mappings and commit the audit outputreports/uniprot_match_candidates.tsv. - Apply new
grounding:CURIEs toGENE_OR_PROTEINnodes across multiple trait YAML causal graphs; updatereports/node_grounding_residual.tsvaccordingly.
Reviewed changes
Copilot reviewed 35 out of 35 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/match_uniprot_to_proteins.py | New streaming matcher to find UniProt candidates and append representative mappings. |
| reports/uniprot_match_candidates.tsv | New audit TSV listing candidate counts and chosen representatives per label. |
| reports/node_grounding_residual.tsv | Updated residual report after applying new mappings/groundings. |
| mappings/node_grounding.tsv | Adds 27 GENE_OR_PROTEIN → UniProt representative mappings. |
| data/traits/physiology/methylotrophic.yaml | Adds UniProt groundings for methanol dehydrogenase nodes; appends curation event. |
| data/traits/physiology/methanotrophic.yaml | Adds UniProt groundings for methane monooxygenase / methanol dehydrogenase; appends curation event. |
| data/traits/physiology/hydrogenotrophic.yaml | Adds UniProt grounding for hydrogenase; appends curation event. |
| data/traits/physiology/chemolithotrophic.yaml | Adds UniProt grounding for ammonia monooxygenase; appends curation event. |
| data/traits/physiology/carboxydotrophic.yaml | Adds UniProt groundings for CODH / molybdenum hydroxylase; appends curation event. |
| data/traits/morphology/spirochete_shaped.yaml | Adds UniProt grounding for FlaB; appends curation event. |
| data/traits/morphology/rod_shaped.yaml | Adds UniProt groundings for MreB and FtsZ; appends curation event. |
| data/traits/morphology/ovoid_shaped.yaml | Adds UniProt grounding for DivIVA; appends curation event. |
| data/traits/morphology/oval_shaped.yaml | Adds UniProt grounding for PBP2b; appends curation event. |
| data/traits/morphology/motility.yaml | Adds UniProt grounding for type IV pilus; appends curation event. |
| data/traits/morphology/motile.yaml | Adds UniProt grounding for type IV pilus; appends curation event. |
| data/traits/morphology/helical_shaped.yaml | Adds UniProt grounding for CcmA; appends curation event. |
| data/traits/morphology/filament_shaped.yaml | Adds UniProt groundings for DivIVA and Scy; appends curation event. |
| data/traits/morphology/ellipsoidal.yaml | Adds UniProt groundings for PBP2b and DivIVA; appends curation event. |
| data/traits/morphology/crescent_shaped.yaml | Adds UniProt grounding for crescentin; appends curation event. |
| data/traits/morphology/coccobacillus_shaped.yaml | Adds UniProt grounding for MreB; appends curation event. |
| data/traits/morphology/cell_shape.yaml | Adds UniProt groundings for MreB, FtsZ, crescentin; appends curation event. |
| data/traits/morphology/brown_pigmented.yaml | Adds UniProt grounding for 4-hydroxyphenylpyruvate dioxygenase; appends curation event. |
| data/traits/morphology/bacillus_shaped.yaml | Adds UniProt groundings for MreB and FtsZ; appends curation event. |
| data/traits/metabolism/substrate_level_phosphorylation.yaml | Adds UniProt grounding for acetate kinase; appends curation event. |
| data/traits/metabolism/respiration.yaml | Adds UniProt grounding for ATP synthase; appends curation event. |
| data/traits/metabolism/methanogenesis.yaml | Adds UniProt grounding for methyl-coenzyme M reductase; appends curation event. |
| data/traits/metabolism/electron_transfer.yaml | Adds UniProt groundings for redox protein and c-type cytochrome; appends curation event. |
| data/traits/metabolism/aerobic_respiration.yaml | Adds UniProt groundings for cytochrome c oxidase and ATP synthase; appends curation event. |
| data/traits/environment/obligately_alkaphilic.yaml | Adds UniProt grounding for Na+/H+ antiporter; appends curation event. |
| data/traits/environment/neutrophilic.yaml | Adds UniProt grounding for cation/proton antiporter; appends curation event. |
| data/traits/environment/hyperthermophilic.yaml | Adds UniProt grounding for reverse gyrase; appends curation event. |
| data/traits/environment/facultatively_alkaphilic.yaml | Adds UniProt grounding for Na+/H+ antiporter; appends curation event. |
| data/traits/environment/alkaphilic.yaml | Adds UniProt grounding for Na+/H+ antiporter; appends curation event. |
| data/traits/environment/alkalotolerant.yaml | Adds UniProt grounding for cation/proton antiporter; appends curation event. |
| data/traits/environment/aerotolerant.yaml | Adds UniProt grounding for superoxide dismutase; appends curation event. |
Comments suppressed due to low confidence (5)
scripts/match_uniprot_to_proteins.py:52
KG_UNIPROT_NODESis hardcoded to a local absolute path ("/Users/marcin/..."), which makes this script non-runnable for other contributors and in CI. Please make the UniProt-nodes TSV path configurable (e.g.,--kg-uniprot-nodesarg and/or env var), and keep the repo default either unset or relative to the repo (with a clear error if missing).
scripts/match_uniprot_to_proteins.py:98build_regex()will compile a pattern that matches the empty string iflabelsis empty (because(?:|...)becomes(?:)), which would makeregex.finditer(name)yield zero-length matches at every position and effectively hang the scan. Please guard against an emptylabelslist (e.g., return early inmain()with a clear message, or raise inbuild_regex()).
scripts/match_uniprot_to_proteins.py:206- The
match_countreported toreports/uniprot_match_candidates.tsvislen(cands), butcandsis capped atMAX_MATCHES_PER_LABEL, so counts of 500 are ambiguous (could mean 500 total, or “>=500”). Please track total match occurrences separately from the stored candidate list (e.g.,total_matches[label]), and write both to the report so ambiguity/skip thresholds can be reasoned about correctly.
scripts/match_uniprot_to_proteins.py:225 --applyappends tomappings/node_grounding.tsvunconditionally, so re-running the script will duplicate rows (and make diffs/noise harder to manage). Please load existing mappings first and skip (or update in-place) any(label, node_type)keys that already exist, and consider sorting/deduplicating output for deterministic reruns.
mappings/node_grounding.tsv:52- The per-row
notesstring claims the representative was selected via “name-ends-with-label + alphabetic-first CURIE”, but the actual selection logic described/implemented in the matcher includes an exact-match tier and a “shortest suffix-hit name” preference with CURIE only as a tiebreaker. Please update the notes text to accurately reflect the selection criteria (or reference the script/report) so the mapping provenance is not misleading.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This was referenced May 24, 2026
Five Copilot findings, all CURIE-prefix normalization + one
docstring drift:
- CURIE prefix UniprotKB → UniProtKB across all emitted artifacts.
The kg-microbe source data uses `UniprotKB:` (lowercase p) but
the TraitMech LinkML schema declares `UniProtKB` (uppercase P)
in its prefix-map. CURIE expansion is case-sensitive in many
consumers, so the mismatch would break downstream IRI
resolution.
Updated:
- mappings/node_grounding.tsv 27 rows
- reports/uniprot_match_candidates.tsv 27 rows
- data/traits/**/*.yaml 31 files (grounding values)
scripts/match_uniprot_to_proteins.py now reads using the
source-data spelling (`UniprotKB:` startswith filter, since
that's what kg-microbe emits) and normalizes the CURIE in-place
before storing it in the matches index, so downstream artifacts
are written in the schema-canonical form.
- Docstring rewrite to match the implemented selection algorithm.
The original docstring described a suffix-first + fallback-to-
any-containing + skip-above-100 scheme, but `pick_representative`
actually implements a strict tier-1 (exact match) + tier-2
(suffix-token, shortest-name) + skip-otherwise scheme. Docstring
now matches the code, and also documents the SKIP_LABELS
blocklist and the CURIE-prefix normalization contract above.
Verified: just validate-strict → 0 ERROR rows / 357 files.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin
added a commit
that referenced
this pull request
May 24, 2026
…0 nodes) Adds 14 mappings across ENVIRONMENTAL_FACTOR (PATO + ENVO), CHEMICAL (CHEBI), and PATHWAY (GO) to ground the top remaining groundable labels in reports/node_grounding_residual.tsv after #66 (39 base) + #69 (8 METPO) + #70 (27 UniProt) + #72 (biomass retype). Additions (mappings/node_grounding.tsv, 75 → 89 rows): ENVIRONMENTAL_FACTOR (5 rows, PATO + ENVO): acidic external ph → PATO:0001428 acidic pH (5 nodes) alkaline external ph → PATO:0001429 alkaline pH (5 nodes) near-neutral external ph → PATO:0001432 neutral pH (4 nodes) very high temperature → PATO:0001637 extremely high temperature (2 nodes) high-salt environment → ENVO:01000687 saline environment (2 nodes) CHEMICAL (3 rows, CHEBI): thiosulfate → CHEBI:33542 thiosulfate(2-) (2 nodes) electron donor → CHEBI:17499 electron donor (2 nodes) organic compound → CHEBI:50860 organic molecular entity (6 nodes) PATHWAY (6 rows, GO): membrane electron transport chain → GO:0022900 ETC (3 nodes) electron transport chain → GO:0022900 (1 node) electron transport system → GO:0022900 (1 node) co2-fixation pathway → GO:0015977 carbon fixation (3 nodes) autotrophic co2 fixation → GO:0015977 (3 nodes) co2 fixation pathway → GO:0015977 (1 node) The PATHWAY additions collapse 6 distinct corpus-paraphrased labels onto 2 GO terms, demonstrating that the (label, node_type) mapping convention supports multi-label-→-one-CURIE without conflict. Per-corpus impact: Mapping TSV: 75 → 89 rows (+14) Nodes grounded: ~704 (53%) → ~744 (59%) Verified: - just ground-nodes --apply → 40 newly grounded - just ground-nodes (idempotency) → 0 - just validate-strict → 0 ERROR rows / 357 files Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin
added a commit
that referenced
this pull request
May 24, 2026
…0 nodes) (#74) Adds 14 mappings across ENVIRONMENTAL_FACTOR (PATO + ENVO), CHEMICAL (CHEBI), and PATHWAY (GO) to ground the top remaining groundable labels in reports/node_grounding_residual.tsv after #66 (39 base) + #69 (8 METPO) + #70 (27 UniProt) + #72 (biomass retype). Additions (mappings/node_grounding.tsv, 75 → 89 rows): ENVIRONMENTAL_FACTOR (5 rows, PATO + ENVO): acidic external ph → PATO:0001428 acidic pH (5 nodes) alkaline external ph → PATO:0001429 alkaline pH (5 nodes) near-neutral external ph → PATO:0001432 neutral pH (4 nodes) very high temperature → PATO:0001637 extremely high temperature (2 nodes) high-salt environment → ENVO:01000687 saline environment (2 nodes) CHEMICAL (3 rows, CHEBI): thiosulfate → CHEBI:33542 thiosulfate(2-) (2 nodes) electron donor → CHEBI:17499 electron donor (2 nodes) organic compound → CHEBI:50860 organic molecular entity (6 nodes) PATHWAY (6 rows, GO): membrane electron transport chain → GO:0022900 ETC (3 nodes) electron transport chain → GO:0022900 (1 node) electron transport system → GO:0022900 (1 node) co2-fixation pathway → GO:0015977 carbon fixation (3 nodes) autotrophic co2 fixation → GO:0015977 (3 nodes) co2 fixation pathway → GO:0015977 (1 node) The PATHWAY additions collapse 6 distinct corpus-paraphrased labels onto 2 GO terms, demonstrating that the (label, node_type) mapping convention supports multi-label-→-one-CURIE without conflict. Per-corpus impact: Mapping TSV: 75 → 89 rows (+14) Nodes grounded: ~704 (53%) → ~744 (59%) Verified: - just ground-nodes --apply → 40 newly grounded - just ground-nodes (idempotency) → 0 - just validate-strict → 0 ERROR rows / 357 files Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin
added a commit
that referenced
this pull request
May 24, 2026
The grounding pipelines and audit scripts have been load-bearing infrastructure for the last 7 PRs (#61, #66, #67, #69, #70 — all of which rewrite causal-graph fields based on these scripts' output). They had zero unit-test coverage. A silent regression in idempotency, header validation, or self-suppression would not be caught by validate-strict (which only checks per-record schema conformance, not pipeline correctness). Test counts: tests/test_ground_causal_predicates.py 9 tests tests/test_ground_causal_nodes.py 12 tests tests/test_validate_strict.py 11 tests tests/test_audit_writers.py 11 tests --- total new 43 tests total suite 54 tests (was 11) Coverage highlights: ground_causal_predicates.py: - load_mapping: basic happy path, conflict detection (same label → different CURIEs raises ValueError), incomplete-row skipping, missing-file error. - ground_edges_in_doc: idempotency (second pass = 0 changes), existing predicate_id never overwritten, residual counting for unmapped labels, empty/missing-predicate edges skipped. ground_causal_nodes.py: - All of the predicate suite plus: - (label, node_type) keyed lookup — same label, different node_types map to different CURIEs without aliasing. - Header validation (Copilot fix from PR #66): TSV with `nodetype` / `targetcurie` typo'd headers raises ValueError naming both missing columns. - grounded_keys-on-validation-failure separability (Copilot fix from PR #66): caller can union residual + grounded_keys to recover the corpus-state residual after rolling back an invalid file write. validate_strict.py: - classify: parametrized over the 5 categories (unexpected_field, missing_required, enum_mismatch, pattern_mismatch, other) — the messages must match the actual jsonschema phrasings the validator emits. - validate_one: clean record produces 0 errors; unknown field surfaces unexpected_field (the G01 gate behavior); missing required field surfaces missing_required; YAML parse error surfaces as yaml_parse_error category. - iter_yaml_files: walks directories, filters .txt, picks up nested *.yaml. audit_writers.py: - looks_like_yaml_writer: yaml.safe_dump / yaml.dump positive, bare .write_text negative, .write_text near .yaml hint positive, arbitrary code negative. - audit: full-safeguards writer flagged yes/yes/yes/yes; no-safeguards writer flagged no/no/no; non-writer returns None; wired_into_just yes when justfile mentions the script stem. - Self-suppression (Copilot fix from PR #64): audit_writers.py itself returns None even though its own source matches yaml.safe_dump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin
added a commit
that referenced
this pull request
May 24, 2026
* Add tests for grounding pipeline + audit scripts (+43 tests) The grounding pipelines and audit scripts have been load-bearing infrastructure for the last 7 PRs (#61, #66, #67, #69, #70 — all of which rewrite causal-graph fields based on these scripts' output). They had zero unit-test coverage. A silent regression in idempotency, header validation, or self-suppression would not be caught by validate-strict (which only checks per-record schema conformance, not pipeline correctness). Test counts: tests/test_ground_causal_predicates.py 9 tests tests/test_ground_causal_nodes.py 12 tests tests/test_validate_strict.py 11 tests tests/test_audit_writers.py 11 tests --- total new 43 tests total suite 54 tests (was 11) Coverage highlights: ground_causal_predicates.py: - load_mapping: basic happy path, conflict detection (same label → different CURIEs raises ValueError), incomplete-row skipping, missing-file error. - ground_edges_in_doc: idempotency (second pass = 0 changes), existing predicate_id never overwritten, residual counting for unmapped labels, empty/missing-predicate edges skipped. ground_causal_nodes.py: - All of the predicate suite plus: - (label, node_type) keyed lookup — same label, different node_types map to different CURIEs without aliasing. - Header validation (Copilot fix from PR #66): TSV with `nodetype` / `targetcurie` typo'd headers raises ValueError naming both missing columns. - grounded_keys-on-validation-failure separability (Copilot fix from PR #66): caller can union residual + grounded_keys to recover the corpus-state residual after rolling back an invalid file write. validate_strict.py: - classify: parametrized over the 5 categories (unexpected_field, missing_required, enum_mismatch, pattern_mismatch, other) — the messages must match the actual jsonschema phrasings the validator emits. - validate_one: clean record produces 0 errors; unknown field surfaces unexpected_field (the G01 gate behavior); missing required field surfaces missing_required; YAML parse error surfaces as yaml_parse_error category. - iter_yaml_files: walks directories, filters .txt, picks up nested *.yaml. audit_writers.py: - looks_like_yaml_writer: yaml.safe_dump / yaml.dump positive, bare .write_text negative, .write_text near .yaml hint positive, arbitrary code negative. - audit: full-safeguards writer flagged yes/yes/yes/yes; no-safeguards writer flagged no/no/no; non-writer returns None; wired_into_just yes when justfile mentions the script stem. - Self-suppression (Copilot fix from PR #64): audit_writers.py itself returns None even though its own source matches yaml.safe_dump. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review on PR #71 Add explicit `assert "b.yml" not in names` to test_iter_yaml_files_walks_directory_and_filters — the prior test documented the .yml-skipping behavior in a comment but never asserted it, so a regression that started picking up .yml during directory walks would have slipped through silently. Also add test_iter_yaml_files_accepts_yml_file_passed_directly to lock in the asymmetry that the previous test only hinted at: iter_yaml_files() does accept .yml when passed as a file argument (only the rglob('*.yaml') walk is .yaml-only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Update audit_writers tests to match #75's tightened heuristic PR #75 changed `looks_like_yaml_writer` to require that the yaml-serializer call feed directly into write_text on the same line (instead of the looser "any .write_text + any .yaml token" heuristic, which produced false positives for scripts that only READ trait YAMLs). The pre-#75 test asserted that `path.write_text(content) # .yaml` counted as a YAML writer. That returned True under the old heuristic and False under the new (correct) one. Replace it with two tests that lock in the new contract: test_looks_like_yaml_writer_write_text_of_yaml_dump Positive: write_text(yaml.safe_dump(...)) / write_text(yaml.dump(...)) both count. test_looks_like_yaml_writer_write_text_of_json_is_false Negative: a script that reads *.yaml then writes JSON via write_text is NOT a YAML writer — this is the false-positive case #75 explicitly fixed for scripts/build_embedding_index.py and scripts/render_trait_pages.py. Also rename test_looks_like_yaml_writer_write_text_without_yaml_hint_is_false to test_looks_like_yaml_writer_write_text_plain_is_false since the "yaml hint" phrasing was tied to the old heuristic. 56 tests pass (was 54; +2). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Uses kg-microbe's
merged-kg_uniprot_nodes.tsv(≈2.2 GB, 19.8 MUniprotKB:rows) as the source-of-truth label index for TraitMech's ungroundedGENE_OR_PROTEINnodes. Streams the file once, matches each TraitMech residual GENE_OR_PROTEIN label against UniProt names, and grounds 42 nodes via 27 representative UniProt CURIEs.Per user request: "use the kgm uniprot transform for grounding and reference" — the path used is
/Users/marcin/Documents/VIMSS/ontology/KG-Hub/KG-Microbe/kg-microbe/merged-kg_uniprot_nodes.tsv.New tooling
scripts/match_uniprot_to_proteins.py— one-off matcher. Streams the kg-microbe UniProt nodes file, collects per-label candidates (cap 500/label), and picks a single representative per label via a two-tier selector:" <label>"as the final whitespace token (e.g."Polarized growth protein Scy"for labelscy). Prefer SHORTEST such name (fewer modifier words like chaperone, maturation protein, assembly factor), alphabetic CURIE as tiebreaker.Also skips a hand-curated set of abstract-category labels that shouldn't be grounded to a specific UniProt entry:
gene product,virulence factors,chaperone proteins,thermostable proteins,salinity-adaptation genes,cold-shock proteins,proton export pumps and antiporters,membrane transporters,gliding motility machinery,rod complex.Mapping additions
mappings/node_grounding.tsvgrows 47 → 74 rows. 27 UniProt mappings — examples:mrebUniprotKB:A0A1B1UYY2ftszUniprotKB:C0LUM8divivaUniprotKB:Q1IYG2atp synthaseUniprotKB:A0A415TT77na+/h+ antiporterUniprotKB:A0A068T423superoxide dismutaseUniprotKB:A0A009QPW9methyl-coenzyme m reductaseUniprotKB:A0A099T5Q9pqq-dependent methanol dehydrogenaseUniprotKB:A0A4U8YZA6crescentinUniprotKB:A0A2N9AY16pbp2bUniprotKB:A0A892RPK7Full per-label audit at
reports/uniprot_match_candidates.tsv.Corpus impact
Caveats
salt-in strategy,weak/absent shape-determining cytoskeleton, etc.).merged-kg_uniprot_nodes.tsvintodata/raw/. Left as-is for this PR since the script is one-off.Verified locally
Test plan
reports/uniprot_match_candidates.tsv🤖 Generated with Claude Code