Wetland #30 backfill, metals extractor bug fix, lint cleanup, cross-repo ID validator#80
Merged
Merged
Conversation
…repo validator Combines four follow-ups against #30 (cross-repo environmental linking) plus an unrelated lint cleanup, all of which build on each other and share the same test surface. 1. Wetland backfill (#30 Phase 5) Apply the SPRUCE related_ingredients pattern to 6 more peatland and wetland communities (Stordalen Mire, Prairie Pothole, MUCC Freshwater Wetland, Asgard Wetland Soil, Coastal Forested Wetland, Wetland Oxygen-Sulfate GHG). Each entry uses CHEBI terms and evidence anchored to already-cached PubMed abstracts; no MediaIngredientMech IDs are minted yet. 2. Metals extractor bug fix + 65-file cleanup metal_extraction.py used plain substring matching against 2-letter element symbols ('ti' for TITANIUM, 'au' for GOLD), which matched inside unrelated words ('characteristic', 'australia') and salted metals_present with TITANIUM in 56/67 metal-annotated YAMLs and GOLD in several more. Switched to non-alphanumeric-boundary regex matching (case-insensitive), with tests pinning the behavior. scripts/clean_metals_inplace.py re-runs extraction and rewrites only the metals_present / rare_earth_elements_present / metal_relevance / metal_notes blocks via line-based replacement, preserving comments and unrelated formatting (unlike backfill_metals.py's yaml.dump path). Applied once across the corpus: 65 community YAMLs corrected. 3. Lint cleanup (just lint ruff/black) 178 pre-existing ruff errors -> 0. Removed T20 (print) from the ruff selection with rationale: src/communitymech/ ships CLI entry points that legitimately use print. The remaining 44 non-print errors were fixed inline (unused imports, raise-from chains, collapsible ifs, redundant list() calls, zip strict, line splits, import order in batch_reporter.py) or suppressed with a per-file E501 ignore for llm/prompts.py (long prompt strings) and targeted `# noqa` lines with comments for S301/S701/S704/S112 cases that are intentional within their internal-only contexts. mypy still reports 256 pre-existing errors and is out of scope here. 4. Cross-repo ID validator (#30 Phase 3, local half) New module communitymech.validators.cross_repo_ids with a pattern + existence checker, plus a CLI (scripts/validate_cross_repo_ids.py) and justfile entries (validate-cross-repo-ids, validate-cross-repo-ids-all). Sibling repo paths are opt-in via env or flags; when omitted, the validator emits info-level skip notices rather than silently passing. 10 new tests cover pattern, existence, and edge cases. Test plan: just test (136 passed, 9 skipped), just validate-all (all 265 communities clean), ruff/black green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Bundles four follow-ups to #30: adds related_ingredients to 6 wetland/peatland community YAMLs (SPRUCE pattern); fixes the metal/REE keyword extractor's false-positive substring matching by anchoring on non-alphanumeric boundaries and re-cleans 65 community YAMLs via a new in-place script; introduces a cross-repo ID validator module, CLI, justfile targets, and tests; and performs a broad ruff/black lint cleanup including dropping the T20 rule and adding targeted # noqa justifications.
Changes:
- Wetland
related_ingredientsbackfill across 6 community YAMLs using cached PMID-anchored evidence. - Metal/REE extractor bug fix (
_keyword_in_textboundary regex) plus mass YAML cleanup viascripts/clean_metals_inplace.py. - New
communitymech.validators.cross_repo_idsmodule,scripts/validate_cross_repo_ids.py, justfile entries, docs, and 10 tests; plus repo-wide ruff/black lint cleanup.
Reviewed changes
Copilot reviewed 92 out of 92 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
src/communitymech/metal_extraction.py |
Adds _keyword_in_text boundary matcher and routes all keyword checks through it. |
scripts/clean_metals_inplace.py |
New script that re-extracts metals and rewrites only the metals_present/rare_earth_elements_present/metal_relevance/metal_notes blocks. |
src/communitymech/validators/cross_repo_ids.py |
New pattern + opt-in existence validator for culturemech_id / mediaingredientmech_id. |
scripts/validate_cross_repo_ids.py |
CLI wrapping the cross-repo validator, with env-var sibling-repo configuration. |
tests/test_metal_extraction.py, tests/test_cross_repo_ids.py |
New unit tests pinning the keyword-boundary fix and the cross-repo validator behavior. |
kb/communities/*Wetland*.yaml, kb/communities/SPRUCE*… etc. (6 files) |
Adds related_ingredients blocks with CHEBI terms and verbatim PMID-snippet evidence. |
65 × kb/communities/*.yaml |
Auto-rewritten metals_present/rare_earth_elements_present/metal_relevance/metal_notes per the new extractor. |
src/communitymech/network/auditor.py, network/validators.py, network/batch_reporter.py, network/llm_repair.py, llm/anthropic_client.py, literature.py, cli.py, visualization/umap_generator.py, render_community_pages.py, embedding/loader.py, utils/id_utils.py, uniprot_reference_proteomes.py |
Lint cleanup: line splits, raise … from, SIM simplifications, zip(..., strict=False), targeted # noqa with rationale comments. |
pyproject.toml, justfile, docs/cross_repo_linking.md |
Ruff select drops T20, adds per-file E501 ignore for prompts; justfile gains cross-repo-id targets; docs document the validator. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Copilot flagged two serious bugs in scripts/clean_metals_inplace.py from the prior commit: 1. _replace_scalar rewrote metal_notes by substituting only the first physical line of the YAML key. When the existing value spanned multiple lines (as PyYAML's folded scalars often do), the indented continuation lines were left orphaned and silently re-folded by the parser into the new value — producing strings like "...(context-validated) measurements; ...(context-validated)" and, on Ngawha, merging curator prose about mercury cycling into the auto-generated note. 2. The script unconditionally overwrote metal_notes and metal_relevance and removed any metals_present entries the (newly fixed) extractor wouldn't infer. That clobbered curator-authored values (Ngawha's MERCURY + curator note, Oak Ridge's NICKEL/COBALT/ZINC, Bayan Obo notes, etc.) — entries the extractor cannot derive but that are curator decisions to keep. Reverted all 65 YAMLs the prior commit touched, then rewrote the script to be surgical: - Touches only metals_present. Never reads or writes metal_relevance or metal_notes, which sidesteps the multi-line scalar bug entirely and preserves curator metadata. - Removes only entries whose extractor keyword list contains a known ambiguous short symbol (`ti`/`au`/`pd`) AND whose unambiguous tokens (full element name, charged ionic forms) do not appear anywhere else in the file as word-bounded tokens. Anything else is kept, including curator-added entries the extractor couldn't have inferred. - Never adds metals. Surprising additions (e.g., Trichodesmium IRON via newly-correct CHEBI tier-1 matching) are out of scope; running `scripts/backfill_metals.py --dry-run` surfaces them for separate curator review. Result: 56 files (down from 65), each diff is a 1-2 line removal of TITANIUM and/or GOLD. Ngawha MERCURY, Oak Ridge metals, all curator metal_notes preserved verbatim. 136 tests pass, all 265 communities validate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This was referenced May 24, 2026
realmarcin
added a commit
that referenced
this pull request
May 24, 2026
* #30 backfill batch 2: 4 metals + 3 gut/rhizosphere communities Continues the SPRUCE/wetland dogfood pattern from PRs #79/#80/#81. Each entry uses CHEBI terms with snippets taken verbatim from cached PMID/DOI abstracts; no cross-repo IDs (MIM IDs haven't been minted). AMD/biomining/REE (4 of 16 remaining): | Community | Ingredients | Source | |---|---|---| | Cyprus_Copper_Sulphide_Bioleaching_Consortium | chalcopyrite (Cu(II) surrogate), chalcocite (Cu(I) sulfide), iron(2+) | PMID:41381092 | | Ferroplasma_Leptospirillum_Syntrophy | iron(2+), pyrite | PMID:16104851 | | Iberian_Pit_Lake_Stratified_Community | sulfate, iron(2+) | PMID:23840525 | | Ewaste_Bioleaching_Consortium | glycine (10 g/L cyanide substrate), hydrogen cyanide (gold lixiviant) | PMID:26704063 | Gut/rhizosphere (3 of ~13 remaining): | Community | Ingredients | Source | |---|---|---| | Bacteroides_Eubacterium_Gnotobiotic_Gut_Model | acetate, butyrate, host-derived mucin glycans | PMID:19321416 | | Brachypodium_Young_Root_Rhizosphere_EcoFAB_Community | root exudates, labile root carbon | PMID:37280433 | | ORNL_PMI_Populus_PD10_SynCom | glucose (minimal-medium axis) | PMID:33995895 | #30 related_ingredients adoption: 12/265 -> 19/265. Test plan: just test (136 passed), all 7 modified files validate clean against the schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review on #83 Five findings, all valid: 1. Cyprus chalcopyrite was mapped to CHEBI:30074 / "copper(2+)", which is wrong on both axes. Updated to CHEBI:50885 / "chalcopyrite" — the mapping the repo already uses (Copper_Biomining_Heap_Leach metabolites). 2. Ewaste cyanide entry's `chebi_term.label` said "hydrogen cyanide" but CHEBI:17514's canonical label is "cyanide". Aligned label. 3. Ewaste cyanide entry's snippet ("This gold complexing agent was used…") did not literally mention cyanide. Replaced with the more direct adjacent abstract sentence ("cyanide-producing heterotrophic Pseudomonas fluorescens and Pseudomonas putida were used") and moved the gold-complexing context into the explanation field. 4. Iberian Pit Lake relevance text described an Fe(II)/Fe(III) cycle across the chemocline but only iron(2+) was listed. Added a separate iron(3+) related_ingredient with its own snippet anchoring the bottom-layer iron-reducing guild (Acidiphilium / Ferroplasma / Acidithiobacillus ferrooxidans in reducing mode); split the original Fe(II) relevance text to reference only the oxidising guild. 5. Ewaste "gold-mobilisation" -> "gold-mobilization" for spelling consistency with the rest of the repo (American spelling). 136 tests still pass; all 3 modified YAMLs validate clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Four follow-ups against #30 (cross-repo environmental linking) plus a related lint cleanup, bundled because they share test surface.
1. #30 Phase 5: wetland backfill (6 communities)
Applies the SPRUCE
related_ingredientspattern (introduced in #79) to 6 more peatland/wetland communities:All entries use CHEBI terms and snippets taken verbatim from already-cached PubMed abstracts. No
mediaingredientmech_idvalues (none have been minted in MIM yet — same dogfood pattern as SPRUCE).2. Metals extractor bug fix + 65-file cleanup
src/communitymech/metal_extraction.pyused plain substring matching against 2-letter element symbols ("ti"for TITANIUM,"au"for GOLD). This matched inside unrelated words (characteristic,kinetic,australia,auto…), saltingmetals_presentwith TITANIUM in 56/67 metal-annotated YAMLs and GOLD in several more.tests/test_metal_extraction.py.scripts/clean_metals_inplace.pyre-runs extraction and rewrites only themetals_present/rare_earth_elements_present/metal_relevance/metal_notesblocks via line-based regex replacement — preserving comments, key order, and unrelated whitespace (unlikebackfill_metals.py'syaml.dumppath).3. Lint cleanup (
just lint: ruff/black green)178 pre-existing ruff errors → 0:
T20(print) from the ruffselect. Rationale:src/communitymech/ships CLI entry points (cli.py,render_*,export/*,embedding/*,validators) that legitimately useprintfor user output; per-call# noqa: T201is louder than the rule is worth.raise … from e, SIM102/SIM103/SIM108 simplifications, C414 redundantlist()insidesorted(), B905zip(strict=False), E501 line splits (auditor, validators, umap_generator, cli), E402 import order inbatch_reporter.py.src/communitymech/llm/prompts.py(long prompt strings should not be re-wrapped).# noqawith WHY-comments for S301 (pickle from internal cache only), S701 (jinja2 autoescape would break JSON-in-script), S704 (markupsafe.Markup on curator-supplied Mermaid body), S112 (intentional skip-on-parse-fail in id_utils).just lintis still not all-green because mypy still reports 256 pre-existing errors (yaml stub missing, implicit Optionals,Console = Nonereassignment). Out of scope for this PR — flagging as separate tech debt.4. Cross-repo ID existence validator (#30 Phase 3, local half)
New module
communitymech.validators.cross_repo_idswith a two-stage validator:culturemech_idmatchesCultureMech:NNNNNN,mediaingredientmech_idmatchesMediaIngredientMech:NNNNNN. Always runs.COMMUNITYMECH_SIBLING_REPOSenv var), each ID is looked up in the partner repo'skb/dir. Opt-in: with no path configured, the validator emits info-level skip notices rather than silently passing.CLI:
scripts/validate_cross_repo_ids.py. Justfile entries:validate-cross-repo-ids FILE,validate-cross-repo-ids-all. 10 new tests intests/test_cross_repo_ids.pycover patterns, both repos configured, neither configured, malformed IDs, edge cases.Test plan
just test— 136 passed, 9 skipped (was 121 → +5 metal_extraction + +10 cross_repo_ids)just validate-all— all 265 community YAMLs validate against schemajust format— cleanuv run ruff check src/ tests/— clean (was 178 errors → 0)uv run black --check src/ tests/— clean (49 files)just validate-cross-repo-ids-all— all clean (no cross-repo IDs to existence-check yet, by design)🤖 Generated with Claude Code