Skip to content

Wetland #30 backfill, metals extractor bug fix, lint cleanup, cross-repo ID validator#80

Merged
realmarcin merged 2 commits into
mainfrom
wetland-backfill-metals-ruff-validator
May 24, 2026
Merged

Wetland #30 backfill, metals extractor bug fix, lint cleanup, cross-repo ID validator#80
realmarcin merged 2 commits into
mainfrom
wetland-backfill-metals-ruff-validator

Conversation

@realmarcin
Copy link
Copy Markdown
Contributor

Summary

Four follow-ups against #30 (cross-repo environmental linking) plus a related lint cleanup, bundled because they share test surface.

1. #30 Phase 5: wetland backfill (6 communities)

Applies the SPRUCE related_ingredients pattern (introduced in #79) to 6 more peatland/wetland communities:

Community Ingredients added Source PMID
Stordalen Mire Methylotrophic methanol, methylamines, acetate PMID:38063415
Prairie Pothole Sulfur/Carbon Virus sulfate, methanol, ethanol, propan-2-ol PMID:30086797
MUCC Freshwater Wetland Methane Network methane, methylated compounds (methanol) PMID:39843444
Asgard Wetland Soil Methanogenesis-Substrate acetate, formate, dihydrogen PMID:39085194
Coastal Forested Wetland Seawater-Ion sulfate, seawater ions (NaCl), methane PMID:38628812
Wetland Oxygen-Sulfate GHG sulfate, dioxygen, lactate, hydrogen sulfide PMID:38961111

All entries use CHEBI terms and snippets taken verbatim from already-cached PubMed abstracts. No mediaingredientmech_id values (none have been minted in MIM yet — same dogfood pattern as SPRUCE).

2. Metals extractor bug fix + 65-file cleanup

src/communitymech/metal_extraction.py used plain substring matching against 2-letter element symbols ("ti" for TITANIUM, "au" for GOLD). This matched inside unrelated words (characteristic, kinetic, australia, auto…), salting metals_present with TITANIUM in 56/67 metal-annotated YAMLs and GOLD in several more.

  • Fixed: switched to non-alphanumeric-boundary regex matching (case-insensitive). 5 new tests pin the behavior in tests/test_metal_extraction.py.
  • New scripts/clean_metals_inplace.py re-runs extraction and rewrites only the metals_present / rare_earth_elements_present / metal_relevance / metal_notes blocks via line-based regex replacement — preserving comments, key order, and unrelated whitespace (unlike backfill_metals.py's yaml.dump path).
  • Applied across the corpus: 65 community YAMLs corrected (most lose TITANIUM; a few also lose GOLD; one gains IRON that the old buggy path missed). Diff is uniformly subtractive in the affected lines except for one new tier-1 detection.

3. Lint cleanup (just lint: ruff/black green)

178 pre-existing ruff errors → 0:

  • Removed T20 (print) from the ruff select. Rationale: src/communitymech/ ships CLI entry points (cli.py, render_*, export/*, embedding/*, validators) that legitimately use print for user output; per-call # noqa: T201 is louder than the rule is worth.
  • Fixed the 44 remaining errors inline: F401 unused rich imports, B904 raise … from e, SIM102/SIM103/SIM108 simplifications, C414 redundant list() inside sorted(), B905 zip(strict=False), E501 line splits (auditor, validators, umap_generator, cli), E402 import order in batch_reporter.py.
  • Added a per-file E501 ignore for src/communitymech/llm/prompts.py (long prompt strings should not be re-wrapped).
  • Added targeted # noqa with WHY-comments for S301 (pickle from internal cache only), S701 (jinja2 autoescape would break JSON-in-script), S704 (markupsafe.Markup on curator-supplied Mermaid body), S112 (intentional skip-on-parse-fail in id_utils).

just lint is still not all-green because mypy still reports 256 pre-existing errors (yaml stub missing, implicit Optionals, Console = None reassignment). Out of scope for this PR — flagging as separate tech debt.

4. Cross-repo ID existence validator (#30 Phase 3, local half)

New module communitymech.validators.cross_repo_ids with a two-stage validator:

  1. Pattern checkculturemech_id matches CultureMech:NNNNNN, mediaingredientmech_id matches MediaIngredientMech:NNNNNN. Always runs.
  2. Existence check — when sibling-repo paths are configured (via flag or COMMUNITYMECH_SIBLING_REPOS env var), each ID is looked up in the partner repo's kb/ dir. Opt-in: with no path configured, the validator emits info-level skip notices rather than silently passing.

CLI: scripts/validate_cross_repo_ids.py. Justfile entries: validate-cross-repo-ids FILE, validate-cross-repo-ids-all. 10 new tests in tests/test_cross_repo_ids.py cover patterns, both repos configured, neither configured, malformed IDs, edge cases.

Test plan

  • just test — 136 passed, 9 skipped (was 121 → +5 metal_extraction + +10 cross_repo_ids)
  • just validate-all — all 265 community YAMLs validate against schema
  • just format — clean
  • uv run ruff check src/ tests/ — clean (was 178 errors → 0)
  • uv run black --check src/ tests/ — clean (49 files)
  • just validate-cross-repo-ids-all — all clean (no cross-repo IDs to existence-check yet, by design)

🤖 Generated with Claude Code

…repo validator

Combines four follow-ups against #30 (cross-repo environmental linking)
plus an unrelated lint cleanup, all of which build on each other and
share the same test surface.

1. Wetland backfill (#30 Phase 5)
   Apply the SPRUCE related_ingredients pattern to 6 more peatland and
   wetland communities (Stordalen Mire, Prairie Pothole, MUCC Freshwater
   Wetland, Asgard Wetland Soil, Coastal Forested Wetland, Wetland
   Oxygen-Sulfate GHG). Each entry uses CHEBI terms and evidence
   anchored to already-cached PubMed abstracts; no MediaIngredientMech
   IDs are minted yet.

2. Metals extractor bug fix + 65-file cleanup
   metal_extraction.py used plain substring matching against 2-letter
   element symbols ('ti' for TITANIUM, 'au' for GOLD), which matched
   inside unrelated words ('characteristic', 'australia') and salted
   metals_present with TITANIUM in 56/67 metal-annotated YAMLs and
   GOLD in several more. Switched to non-alphanumeric-boundary regex
   matching (case-insensitive), with tests pinning the behavior.
   scripts/clean_metals_inplace.py re-runs extraction and rewrites only
   the metals_present / rare_earth_elements_present / metal_relevance /
   metal_notes blocks via line-based replacement, preserving comments
   and unrelated formatting (unlike backfill_metals.py's yaml.dump
   path). Applied once across the corpus: 65 community YAMLs corrected.

3. Lint cleanup (just lint ruff/black)
   178 pre-existing ruff errors -> 0. Removed T20 (print) from the
   ruff selection with rationale: src/communitymech/ ships CLI entry
   points that legitimately use print. The remaining 44 non-print
   errors were fixed inline (unused imports, raise-from chains,
   collapsible ifs, redundant list() calls, zip strict, line splits,
   import order in batch_reporter.py) or suppressed with a per-file
   E501 ignore for llm/prompts.py (long prompt strings) and targeted
   `# noqa` lines with comments for S301/S701/S704/S112 cases that
   are intentional within their internal-only contexts. mypy still
   reports 256 pre-existing errors and is out of scope here.

4. Cross-repo ID validator (#30 Phase 3, local half)
   New module communitymech.validators.cross_repo_ids with a
   pattern + existence checker, plus a CLI
   (scripts/validate_cross_repo_ids.py) and justfile entries
   (validate-cross-repo-ids, validate-cross-repo-ids-all). Sibling
   repo paths are opt-in via env or flags; when omitted, the
   validator emits info-level skip notices rather than silently
   passing. 10 new tests cover pattern, existence, and edge cases.

Test plan: just test (136 passed, 9 skipped), just validate-all (all
265 communities clean), ruff/black green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 24, 2026 04:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Bundles four follow-ups to #30: adds related_ingredients to 6 wetland/peatland community YAMLs (SPRUCE pattern); fixes the metal/REE keyword extractor's false-positive substring matching by anchoring on non-alphanumeric boundaries and re-cleans 65 community YAMLs via a new in-place script; introduces a cross-repo ID validator module, CLI, justfile targets, and tests; and performs a broad ruff/black lint cleanup including dropping the T20 rule and adding targeted # noqa justifications.

Changes:

  • Wetland related_ingredients backfill across 6 community YAMLs using cached PMID-anchored evidence.
  • Metal/REE extractor bug fix (_keyword_in_text boundary regex) plus mass YAML cleanup via scripts/clean_metals_inplace.py.
  • New communitymech.validators.cross_repo_ids module, scripts/validate_cross_repo_ids.py, justfile entries, docs, and 10 tests; plus repo-wide ruff/black lint cleanup.

Reviewed changes

Copilot reviewed 92 out of 92 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/communitymech/metal_extraction.py Adds _keyword_in_text boundary matcher and routes all keyword checks through it.
scripts/clean_metals_inplace.py New script that re-extracts metals and rewrites only the metals_present/rare_earth_elements_present/metal_relevance/metal_notes blocks.
src/communitymech/validators/cross_repo_ids.py New pattern + opt-in existence validator for culturemech_id / mediaingredientmech_id.
scripts/validate_cross_repo_ids.py CLI wrapping the cross-repo validator, with env-var sibling-repo configuration.
tests/test_metal_extraction.py, tests/test_cross_repo_ids.py New unit tests pinning the keyword-boundary fix and the cross-repo validator behavior.
kb/communities/*Wetland*.yaml, kb/communities/SPRUCE*… etc. (6 files) Adds related_ingredients blocks with CHEBI terms and verbatim PMID-snippet evidence.
65 × kb/communities/*.yaml Auto-rewritten metals_present/rare_earth_elements_present/metal_relevance/metal_notes per the new extractor.
src/communitymech/network/auditor.py, network/validators.py, network/batch_reporter.py, network/llm_repair.py, llm/anthropic_client.py, literature.py, cli.py, visualization/umap_generator.py, render_community_pages.py, embedding/loader.py, utils/id_utils.py, uniprot_reference_proteomes.py Lint cleanup: line splits, raise … from, SIM simplifications, zip(..., strict=False), targeted # noqa with rationale comments.
pyproject.toml, justfile, docs/cross_repo_linking.md Ruff select drops T20, adds per-file E501 ignore for prompts; justfile gains cross-repo-id targets; docs document the validator.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/clean_metals_inplace.py Outdated
Comment thread scripts/clean_metals_inplace.py Outdated
Copilot flagged two serious bugs in scripts/clean_metals_inplace.py
from the prior commit:

1. _replace_scalar rewrote metal_notes by substituting only the first
   physical line of the YAML key. When the existing value spanned
   multiple lines (as PyYAML's folded scalars often do), the indented
   continuation lines were left orphaned and silently re-folded by
   the parser into the new value — producing strings like
   "...(context-validated) measurements; ...(context-validated)" and,
   on Ngawha, merging curator prose about mercury cycling into the
   auto-generated note.
2. The script unconditionally overwrote metal_notes and metal_relevance
   and removed any metals_present entries the (newly fixed) extractor
   wouldn't infer. That clobbered curator-authored values (Ngawha's
   MERCURY + curator note, Oak Ridge's NICKEL/COBALT/ZINC, Bayan Obo
   notes, etc.) — entries the extractor cannot derive but that are
   curator decisions to keep.

Reverted all 65 YAMLs the prior commit touched, then rewrote the script
to be surgical:

- Touches only metals_present. Never reads or writes metal_relevance
  or metal_notes, which sidesteps the multi-line scalar bug entirely
  and preserves curator metadata.
- Removes only entries whose extractor keyword list contains a known
  ambiguous short symbol (`ti`/`au`/`pd`) AND whose unambiguous tokens
  (full element name, charged ionic forms) do not appear anywhere
  else in the file as word-bounded tokens. Anything else is kept,
  including curator-added entries the extractor couldn't have inferred.
- Never adds metals. Surprising additions (e.g., Trichodesmium IRON
  via newly-correct CHEBI tier-1 matching) are out of scope; running
  `scripts/backfill_metals.py --dry-run` surfaces them for separate
  curator review.

Result: 56 files (down from 65), each diff is a 1-2 line removal of
TITANIUM and/or GOLD. Ngawha MERCURY, Oak Ridge metals, all curator
metal_notes preserved verbatim. 136 tests pass, all 265 communities
validate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin merged commit 7fed895 into main May 24, 2026
@realmarcin realmarcin deleted the wetland-backfill-metals-ruff-validator branch May 24, 2026 04:54
realmarcin added a commit that referenced this pull request May 24, 2026
* #30 backfill batch 2: 4 metals + 3 gut/rhizosphere communities

Continues the SPRUCE/wetland dogfood pattern from PRs #79/#80/#81. Each
entry uses CHEBI terms with snippets taken verbatim from cached
PMID/DOI abstracts; no cross-repo IDs (MIM IDs haven't been minted).

AMD/biomining/REE (4 of 16 remaining):

| Community | Ingredients | Source |
|---|---|---|
| Cyprus_Copper_Sulphide_Bioleaching_Consortium | chalcopyrite (Cu(II) surrogate), chalcocite (Cu(I) sulfide), iron(2+) | PMID:41381092 |
| Ferroplasma_Leptospirillum_Syntrophy | iron(2+), pyrite | PMID:16104851 |
| Iberian_Pit_Lake_Stratified_Community | sulfate, iron(2+) | PMID:23840525 |
| Ewaste_Bioleaching_Consortium | glycine (10 g/L cyanide substrate), hydrogen cyanide (gold lixiviant) | PMID:26704063 |

Gut/rhizosphere (3 of ~13 remaining):

| Community | Ingredients | Source |
|---|---|---|
| Bacteroides_Eubacterium_Gnotobiotic_Gut_Model | acetate, butyrate, host-derived mucin glycans | PMID:19321416 |
| Brachypodium_Young_Root_Rhizosphere_EcoFAB_Community | root exudates, labile root carbon | PMID:37280433 |
| ORNL_PMI_Populus_PD10_SynCom | glucose (minimal-medium axis) | PMID:33995895 |

#30 related_ingredients adoption: 12/265 -> 19/265.

Test plan: just test (136 passed), all 7 modified files validate
clean against the schema.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot review on #83

Five findings, all valid:

1. Cyprus chalcopyrite was mapped to CHEBI:30074 / "copper(2+)", which
   is wrong on both axes. Updated to CHEBI:50885 / "chalcopyrite" —
   the mapping the repo already uses (Copper_Biomining_Heap_Leach
   metabolites).

2. Ewaste cyanide entry's `chebi_term.label` said "hydrogen cyanide"
   but CHEBI:17514's canonical label is "cyanide". Aligned label.

3. Ewaste cyanide entry's snippet ("This gold complexing agent was
   used…") did not literally mention cyanide. Replaced with the more
   direct adjacent abstract sentence ("cyanide-producing heterotrophic
   Pseudomonas fluorescens and Pseudomonas putida were used") and
   moved the gold-complexing context into the explanation field.

4. Iberian Pit Lake relevance text described an Fe(II)/Fe(III) cycle
   across the chemocline but only iron(2+) was listed. Added a
   separate iron(3+) related_ingredient with its own snippet
   anchoring the bottom-layer iron-reducing guild
   (Acidiphilium / Ferroplasma / Acidithiobacillus ferrooxidans in
   reducing mode); split the original Fe(II) relevance text to
   reference only the oxidising guild.

5. Ewaste "gold-mobilisation" -> "gold-mobilization" for spelling
   consistency with the rest of the repo (American spelling).

136 tests still pass; all 3 modified YAMLs validate clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants