Wetland #30 backfill, metals extractor bug fix, lint cleanup, cross-repo ID validator by realmarcin · Pull Request #80 · CultureBotAI/CommunityMech

realmarcin · 2026-05-24T04:35:01Z

Summary

Four follow-ups against #30 (cross-repo environmental linking) plus a related lint cleanup, bundled because they share test surface.

1. #30 Phase 5: wetland backfill (6 communities)

Applies the SPRUCE related_ingredients pattern (introduced in #79) to 6 more peatland/wetland communities:

Community	Ingredients added	Source PMID
Stordalen Mire Methylotrophic	methanol, methylamines, acetate	PMID:38063415
Prairie Pothole Sulfur/Carbon Virus	sulfate, methanol, ethanol, propan-2-ol	PMID:30086797
MUCC Freshwater Wetland Methane Network	methane, methylated compounds (methanol)	PMID:39843444
Asgard Wetland Soil Methanogenesis-Substrate	acetate, formate, dihydrogen	PMID:39085194
Coastal Forested Wetland Seawater-Ion	sulfate, seawater ions (NaCl), methane	PMID:38628812
Wetland Oxygen-Sulfate GHG	sulfate, dioxygen, lactate, hydrogen sulfide	PMID:38961111

All entries use CHEBI terms and snippets taken verbatim from already-cached PubMed abstracts. No mediaingredientmech_id values (none have been minted in MIM yet — same dogfood pattern as SPRUCE).

2. Metals extractor bug fix + 65-file cleanup

src/communitymech/metal_extraction.py used plain substring matching against 2-letter element symbols ("ti" for TITANIUM, "au" for GOLD). This matched inside unrelated words (characteristic, kinetic, australia, auto…), salting metals_present with TITANIUM in 56/67 metal-annotated YAMLs and GOLD in several more.

Fixed: switched to non-alphanumeric-boundary regex matching (case-insensitive). 5 new tests pin the behavior in tests/test_metal_extraction.py.
New scripts/clean_metals_inplace.py re-runs extraction and rewrites only the metals_present / rare_earth_elements_present / metal_relevance / metal_notes blocks via line-based regex replacement — preserving comments, key order, and unrelated whitespace (unlike backfill_metals.py's yaml.dump path).
Applied across the corpus: 65 community YAMLs corrected (most lose TITANIUM; a few also lose GOLD; one gains IRON that the old buggy path missed). Diff is uniformly subtractive in the affected lines except for one new tier-1 detection.

3. Lint cleanup (`just lint`: ruff/black green)

178 pre-existing ruff errors → 0:

Removed T20 (print) from the ruff select. Rationale: src/communitymech/ ships CLI entry points (cli.py, render_*, export/*, embedding/*, validators) that legitimately use print for user output; per-call # noqa: T201 is louder than the rule is worth.
Fixed the 44 remaining errors inline: F401 unused rich imports, B904 raise … from e, SIM102/SIM103/SIM108 simplifications, C414 redundant list() inside sorted(), B905 zip(strict=False), E501 line splits (auditor, validators, umap_generator, cli), E402 import order in batch_reporter.py.
Added a per-file E501 ignore for src/communitymech/llm/prompts.py (long prompt strings should not be re-wrapped).
Added targeted # noqa with WHY-comments for S301 (pickle from internal cache only), S701 (jinja2 autoescape would break JSON-in-script), S704 (markupsafe.Markup on curator-supplied Mermaid body), S112 (intentional skip-on-parse-fail in id_utils).

just lint is still not all-green because mypy still reports 256 pre-existing errors (yaml stub missing, implicit Optionals, Console = None reassignment). Out of scope for this PR — flagging as separate tech debt.

4. Cross-repo ID existence validator (#30 Phase 3, local half)

New module communitymech.validators.cross_repo_ids with a two-stage validator:

Pattern check — culturemech_id matches CultureMech:NNNNNN, mediaingredientmech_id matches MediaIngredientMech:NNNNNN. Always runs.
Existence check — when sibling-repo paths are configured (via flag or COMMUNITYMECH_SIBLING_REPOS env var), each ID is looked up in the partner repo's kb/ dir. Opt-in: with no path configured, the validator emits info-level skip notices rather than silently passing.

CLI: scripts/validate_cross_repo_ids.py. Justfile entries: validate-cross-repo-ids FILE, validate-cross-repo-ids-all. 10 new tests in tests/test_cross_repo_ids.py cover patterns, both repos configured, neither configured, malformed IDs, edge cases.

Test plan

just test — 136 passed, 9 skipped (was 121 → +5 metal_extraction + +10 cross_repo_ids)
just validate-all — all 265 community YAMLs validate against schema
just format — clean
uv run ruff check src/ tests/ — clean (was 178 errors → 0)
uv run black --check src/ tests/ — clean (49 files)
just validate-cross-repo-ids-all — all clean (no cross-repo IDs to existence-check yet, by design)

🤖 Generated with Claude Code

…repo validator Combines four follow-ups against #30 (cross-repo environmental linking) plus an unrelated lint cleanup, all of which build on each other and share the same test surface. 1. Wetland backfill (#30 Phase 5) Apply the SPRUCE related_ingredients pattern to 6 more peatland and wetland communities (Stordalen Mire, Prairie Pothole, MUCC Freshwater Wetland, Asgard Wetland Soil, Coastal Forested Wetland, Wetland Oxygen-Sulfate GHG). Each entry uses CHEBI terms and evidence anchored to already-cached PubMed abstracts; no MediaIngredientMech IDs are minted yet. 2. Metals extractor bug fix + 65-file cleanup metal_extraction.py used plain substring matching against 2-letter element symbols ('ti' for TITANIUM, 'au' for GOLD), which matched inside unrelated words ('characteristic', 'australia') and salted metals_present with TITANIUM in 56/67 metal-annotated YAMLs and GOLD in several more. Switched to non-alphanumeric-boundary regex matching (case-insensitive), with tests pinning the behavior. scripts/clean_metals_inplace.py re-runs extraction and rewrites only the metals_present / rare_earth_elements_present / metal_relevance / metal_notes blocks via line-based replacement, preserving comments and unrelated formatting (unlike backfill_metals.py's yaml.dump path). Applied once across the corpus: 65 community YAMLs corrected. 3. Lint cleanup (just lint ruff/black) 178 pre-existing ruff errors -> 0. Removed T20 (print) from the ruff selection with rationale: src/communitymech/ ships CLI entry points that legitimately use print. The remaining 44 non-print errors were fixed inline (unused imports, raise-from chains, collapsible ifs, redundant list() calls, zip strict, line splits, import order in batch_reporter.py) or suppressed with a per-file E501 ignore for llm/prompts.py (long prompt strings) and targeted `# noqa` lines with comments for S301/S701/S704/S112 cases that are intentional within their internal-only contexts. mypy still reports 256 pre-existing errors and is out of scope here. 4. Cross-repo ID validator (#30 Phase 3, local half) New module communitymech.validators.cross_repo_ids with a pattern + existence checker, plus a CLI (scripts/validate_cross_repo_ids.py) and justfile entries (validate-cross-repo-ids, validate-cross-repo-ids-all). Sibling repo paths are opt-in via env or flags; when omitted, the validator emits info-level skip notices rather than silently passing. 10 new tests cover pattern, existence, and edge cases. Test plan: just test (136 passed, 9 skipped), just validate-all (all 265 communities clean), ruff/black green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Bundles four follow-ups to #30: adds related_ingredients to 6 wetland/peatland community YAMLs (SPRUCE pattern); fixes the metal/REE keyword extractor's false-positive substring matching by anchoring on non-alphanumeric boundaries and re-cleans 65 community YAMLs via a new in-place script; introduces a cross-repo ID validator module, CLI, justfile targets, and tests; and performs a broad ruff/black lint cleanup including dropping the T20 rule and adding targeted # noqa justifications.

Changes:

Wetland related_ingredients backfill across 6 community YAMLs using cached PMID-anchored evidence.
Metal/REE extractor bug fix (_keyword_in_text boundary regex) plus mass YAML cleanup via scripts/clean_metals_inplace.py.
New communitymech.validators.cross_repo_ids module, scripts/validate_cross_repo_ids.py, justfile entries, docs, and 10 tests; plus repo-wide ruff/black lint cleanup.

Reviewed changes

Copilot reviewed 92 out of 92 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`src/communitymech/metal_extraction.py`	Adds `_keyword_in_text` boundary matcher and routes all keyword checks through it.
`scripts/clean_metals_inplace.py`	New script that re-extracts metals and rewrites only the `metals_present`/`rare_earth_elements_present`/`metal_relevance`/`metal_notes` blocks.
`src/communitymech/validators/cross_repo_ids.py`	New pattern + opt-in existence validator for `culturemech_id` / `mediaingredientmech_id`.
`scripts/validate_cross_repo_ids.py`	CLI wrapping the cross-repo validator, with env-var sibling-repo configuration.
`tests/test_metal_extraction.py`, `tests/test_cross_repo_ids.py`	New unit tests pinning the keyword-boundary fix and the cross-repo validator behavior.
`kb/communities/Wetland.yaml`, `kb/communities/SPRUCE*…` etc. (6 files)	Adds `related_ingredients` blocks with CHEBI terms and verbatim PMID-snippet evidence.
65 × `kb/communities/*.yaml`	Auto-rewritten `metals_present`/`rare_earth_elements_present`/`metal_relevance`/`metal_notes` per the new extractor.
`src/communitymech/network/auditor.py`, `network/validators.py`, `network/batch_reporter.py`, `network/llm_repair.py`, `llm/anthropic_client.py`, `literature.py`, `cli.py`, `visualization/umap_generator.py`, `render_community_pages.py`, `embedding/loader.py`, `utils/id_utils.py`, `uniprot_reference_proteomes.py`	Lint cleanup: line splits, `raise … from`, SIM simplifications, `zip(..., strict=False)`, targeted `# noqa` with rationale comments.
`pyproject.toml`, `justfile`, `docs/cross_repo_linking.md`	Ruff `select` drops `T20`, adds per-file E501 ignore for prompts; justfile gains cross-repo-id targets; docs document the validator.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot flagged two serious bugs in scripts/clean_metals_inplace.py from the prior commit: 1. _replace_scalar rewrote metal_notes by substituting only the first physical line of the YAML key. When the existing value spanned multiple lines (as PyYAML's folded scalars often do), the indented continuation lines were left orphaned and silently re-folded by the parser into the new value — producing strings like "...(context-validated) measurements; ...(context-validated)" and, on Ngawha, merging curator prose about mercury cycling into the auto-generated note. 2. The script unconditionally overwrote metal_notes and metal_relevance and removed any metals_present entries the (newly fixed) extractor wouldn't infer. That clobbered curator-authored values (Ngawha's MERCURY + curator note, Oak Ridge's NICKEL/COBALT/ZINC, Bayan Obo notes, etc.) — entries the extractor cannot derive but that are curator decisions to keep. Reverted all 65 YAMLs the prior commit touched, then rewrote the script to be surgical: - Touches only metals_present. Never reads or writes metal_relevance or metal_notes, which sidesteps the multi-line scalar bug entirely and preserves curator metadata. - Removes only entries whose extractor keyword list contains a known ambiguous short symbol (`ti`/`au`/`pd`) AND whose unambiguous tokens (full element name, charged ionic forms) do not appear anywhere else in the file as word-bounded tokens. Anything else is kept, including curator-added entries the extractor couldn't have inferred. - Never adds metals. Surprising additions (e.g., Trichodesmium IRON via newly-correct CHEBI tier-1 matching) are out of scope; running `scripts/backfill_metals.py --dry-run` surfaces them for separate curator review. Result: 56 files (down from 65), each diff is a 1-2 line removal of TITANIUM and/or GOLD. Ngawha MERCURY, Oak Ridge metals, all curator metal_notes preserved verbatim. 136 tests pass, all 265 communities validate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* #30 backfill batch 2: 4 metals + 3 gut/rhizosphere communities Continues the SPRUCE/wetland dogfood pattern from PRs #79/#80/#81. Each entry uses CHEBI terms with snippets taken verbatim from cached PMID/DOI abstracts; no cross-repo IDs (MIM IDs haven't been minted). AMD/biomining/REE (4 of 16 remaining): | Community | Ingredients | Source | |---|---|---| | Cyprus_Copper_Sulphide_Bioleaching_Consortium | chalcopyrite (Cu(II) surrogate), chalcocite (Cu(I) sulfide), iron(2+) | PMID:41381092 | | Ferroplasma_Leptospirillum_Syntrophy | iron(2+), pyrite | PMID:16104851 | | Iberian_Pit_Lake_Stratified_Community | sulfate, iron(2+) | PMID:23840525 | | Ewaste_Bioleaching_Consortium | glycine (10 g/L cyanide substrate), hydrogen cyanide (gold lixiviant) | PMID:26704063 | Gut/rhizosphere (3 of ~13 remaining): | Community | Ingredients | Source | |---|---|---| | Bacteroides_Eubacterium_Gnotobiotic_Gut_Model | acetate, butyrate, host-derived mucin glycans | PMID:19321416 | | Brachypodium_Young_Root_Rhizosphere_EcoFAB_Community | root exudates, labile root carbon | PMID:37280433 | | ORNL_PMI_Populus_PD10_SynCom | glucose (minimal-medium axis) | PMID:33995895 | #30 related_ingredients adoption: 12/265 -> 19/265. Test plan: just test (136 passed), all 7 modified files validate clean against the schema. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review on #83 Five findings, all valid: 1. Cyprus chalcopyrite was mapped to CHEBI:30074 / "copper(2+)", which is wrong on both axes. Updated to CHEBI:50885 / "chalcopyrite" — the mapping the repo already uses (Copper_Biomining_Heap_Leach metabolites). 2. Ewaste cyanide entry's `chebi_term.label` said "hydrogen cyanide" but CHEBI:17514's canonical label is "cyanide". Aligned label. 3. Ewaste cyanide entry's snippet ("This gold complexing agent was used…") did not literally mention cyanide. Replaced with the more direct adjacent abstract sentence ("cyanide-producing heterotrophic Pseudomonas fluorescens and Pseudomonas putida were used") and moved the gold-complexing context into the explanation field. 4. Iberian Pit Lake relevance text described an Fe(II)/Fe(III) cycle across the chemocline but only iron(2+) was listed. Added a separate iron(3+) related_ingredient with its own snippet anchoring the bottom-layer iron-reducing guild (Acidiphilium / Ferroplasma / Acidithiobacillus ferrooxidans in reducing mode); split the original Fe(II) relevance text to reference only the oxidising guild. 5. Ewaste "gold-mobilisation" -> "gold-mobilization" for spelling consistency with the rest of the repo (American spelling). 136 tests still pass; all 3 modified YAMLs validate clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 24, 2026 04:35

Copilot started reviewing on behalf of realmarcin May 24, 2026 04:35 View session

Copilot AI reviewed May 24, 2026

View reviewed changes

Comment thread scripts/clean_metals_inplace.py Outdated

Comment thread scripts/clean_metals_inplace.py Outdated

realmarcin merged commit 7fed895 into main May 24, 2026

realmarcin deleted the wetland-backfill-metals-ruff-validator branch May 24, 2026 04:54

This was referenced May 24, 2026

Mypy green + 5 community backfills + keyword_in_text consolidation #81

Merged

#30 backfill batch 2: 4 metals + 3 gut/rhizosphere communities #83

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wetland #30 backfill, metals extractor bug fix, lint cleanup, cross-repo ID validator#80

Wetland #30 backfill, metals extractor bug fix, lint cleanup, cross-repo ID validator#80
realmarcin merged 2 commits into
mainfrom
wetland-backfill-metals-ruff-validator

realmarcin commented May 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

realmarcin commented May 24, 2026

Summary

1. #30 Phase 5: wetland backfill (6 communities)

2. Metals extractor bug fix + 65-file cleanup

3. Lint cleanup (just lint: ruff/black green)

4. Cross-repo ID existence validator (#30 Phase 3, local half)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

3. Lint cleanup (`just lint`: ruff/black green)