Backfill source_environment from CommunityMech ENVO data (closes #2)#28
Merged
Conversation
The SourceEnvironmentDescriptor schema and source_environment slot landed in dbe26e8 but no recipes populated them — issue #2's core goal of cross-repo environmental linking only matters once recipes actually carry ENVO terms. scripts/backfill_source_environment.py walks the sibling ../CommunityMech repo (skipping tests/ and examples/) for community YAMLs that carry both `environment_term` (ENVO-grounded) and a `culturemech_id` back-reference. Each such pair is a curator-vetted assertion that the linked recipe targets that environment, so we populate it directly on the recipe. Dedup is by (recipe_id, envo_id); when multiple communities point to the same recipe with the same ENVO id we take the first observation's preferred_term/notes. This pass touched 17 recipes and added 22 SourceEnvironmentDescriptor entries (some recipes are used by multiple communities targeting different environments — CultureMech:000423 alone gets 4 entries spanning anaerobic-digester / soil / acid-mine-drainage / sulfate-free anaerobic). Each touched recipe carries a CurationEvent for provenance. The script is idempotent — re-running adds nothing new because the dedup-by-ENVO-id guard catches existing entries. Inferring source_environment for the rest of the corpus (15,810 recipes) from recipe metadata is a separate enrichment problem (NLP / keyword mining over names + applications + descriptions) and out of scope here. just validate-strict: 0 ERROR rows / 15,827 records (unchanged). Closes #2. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR backfills the recently-added source_environment field on CultureMech media recipes using ENVO-grounded environment_term assertions curated in the sibling CommunityMech repository, and records provenance via curation_history.
Changes:
- Added
scripts/backfill_source_environment.pyto scan CommunityMech YAMLs and appendsource_environmentdescriptors to linked CultureMech recipes. - Updated 17 normalized recipe YAMLs to include
source_environmententries plus aBACKFILLED_SOURCE_ENVIRONMENTcuration event. - Introduced some YAML reflow/formatting changes (line wrapping) as a consequence of re-dumping touched files.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/backfill_source_environment.py | New backfill script to harvest CommunityMech environment_term + culturemech_id links and write source_environment + provenance. |
| data/normalized_yaml/specialized/Nitrogen_free_plant_nutrient_solution_for_soybean_growth.yaml | Adds source_environment (rhizosphere) and curation event. |
| data/normalized_yaml/specialized/nitrogen_free_b_d_medium_for_lotus_japonicus_growth.yaml | Adds source_environment (rhizosphere) and curation event. |
| data/normalized_yaml/specialized/Modified_Freshwater_Medium_for_DIET_Coculture.yaml | Adds source_environment (sediment) and curation event. |
| data/normalized_yaml/specialized/Modified_DSM_120_Medium_for_DIET_Coculture.yaml | Adds source_environment (sediment) and curation event. |
| data/normalized_yaml/specialized/Half_strength_Murashige_Skoog_medium_for_Arabidopsis_growth.yaml | Adds source_environment (rhizosphere) and curation event. |
| data/normalized_yaml/specialized/Glycerol_Fermentation_Medium_for_DIET_Coculture.yaml | Adds source_environment (laboratory culture) and curation event. |
| data/normalized_yaml/bacterial/tryptic_soy_broth.yaml | Adds source_environment (rhizosphere) + reflows some YAML text. |
| data/normalized_yaml/bacterial/tryptic_soy_agar.yaml | Adds source_environment (rhizosphere) + reflows some YAML text. |
| data/normalized_yaml/bacterial/r2a_agar.yaml | Adds source_environment (rhizosphere) and curation event. |
| data/normalized_yaml/bacterial/PCS_FP_medium_for_thermophilic_cellulose_degradation.yaml | Adds source_environment (compost) and curation event. |
| data/normalized_yaml/bacterial/Nitrogen_Free_Medium_for_Leptospirillum_ferrodiazotrophum.yaml | Adds source_environment (acid mine drainage) and curation event. |
| data/normalized_yaml/bacterial/mineral_medium.yaml | Adds 4 source_environment entries and curation event. |
| data/normalized_yaml/bacterial/mannitol_agar.yaml | Adds source_environment (rhizosphere) and curation event. |
| data/normalized_yaml/bacterial/luria_bertani_lb_medium.yaml | Adds source_environment (rhizosphere) + reflows some YAML text. |
| data/normalized_yaml/bacterial/9k_medium.yaml | Adds 2 source_environment entries and curation event. |
| data/normalized_yaml/algae/f_2.yaml | Adds source_environment (marine environment) and curation event. |
| data/normalized_yaml/algae/CCAP_TAP Medium.yaml | Adds 2 source_environment entries and curation event. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Five Copilot findings, all fixed at script + data level: 1. **scripts/backfill_source_environment.py:63** — `path.read_text()` only caught yaml.YAMLError; now also catches OSError (permissions / disk) and UnicodeDecodeError (non-UTF-8 content) so a multi-thousand file scan can't crash mid-run. 2. **Inconsistent term.label across recipes for the same ENVO CURIE** (ENVO:01001405 was labeled "laboratory bioreactor" / "laboratory culture" / "laboratory environment" across the three communities that referenced it). The Term.label slot is intended to be the *canonical* ontology label — propagating the curator-supplied label from CommunityMech meant downstream queries would treat the same ENVO id as three different terms. Fix: stop emitting term.label at all. Downstream consumers should resolve the canonical label from ENVO directly. (Resolves both inline comments on mineral_medium.yaml and CCAP_TAP Medium.yaml.) 3. **preferred_term fell back to ""** when CommunityMech omitted the field. Now falls back to ENVO label → ENVO id → skip (never empty string). 4. **merge_into_recipe() assumed a list**. LinkML dataclasses accept a single dict for multivalued slots; the function now normalizes None → [] → list and dict → [dict] before processing. Re-running the script on the 17 already-backfilled records strips the 22 stale labels in place (idempotent: no new env entries added since all (recipe, ENVO id) pairs were already present). validate-strict remains at 0 ERROR rows / 15,827 records. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Issue #2 added the
SourceEnvironmentDescriptorschema +source_environmentslot (commitdbe26e8ce) but no recipes populated it. This PR shipsscripts/backfill_source_environment.pyand applies it: 17 recipes get 22 ENVO-grounded environment entries pulled directly from CommunityMech's curatedenvironment_termdata.Method
The script walks
../CommunityMech(excludingtests/andexamples/) for community YAMLs that carry bothenvironment_term(ENVO-grounded) and aculturemech_idback-reference. Each such pair is a curator-vetted assertion that the linked recipe targets that environment, so it's safe to populatesource_environmentdirectly. Dedup is by(recipe_id, envo_id); some recipes are touched by multiple communities at different environments (e.g.,CultureMech:000423gets 4 entries: anaerobic digester / soil / acid mine drainage / sulfate-free anaerobic).Each touched recipe carries a
CurationEventviarecord_curation_event()for provenance. Script is idempotent — re-runs add nothing because the dedup guard catches existing entries.Out of scope
Inferring
source_environmentfor the remaining 15,810 recipes from recipe metadata (names, applications, descriptions) is an NLP/keyword-mining enrichment problem — separate future work.Validation
just validate-strict: 0 ERROR rows across 15,827 records (unchanged).Test plan
just validate-strictcleansource_environment:listCloses #2.
🤖 Generated with Claude Code