Skip to content

Backfill source_environment from CommunityMech ENVO data (closes #2)#28

Merged
realmarcin merged 2 commits into
mainfrom
backfill-source-environment
May 24, 2026
Merged

Backfill source_environment from CommunityMech ENVO data (closes #2)#28
realmarcin merged 2 commits into
mainfrom
backfill-source-environment

Conversation

@realmarcin
Copy link
Copy Markdown
Contributor

Summary

Issue #2 added the SourceEnvironmentDescriptor schema + source_environment slot (commit dbe26e8ce) but no recipes populated it. This PR ships scripts/backfill_source_environment.py and applies it: 17 recipes get 22 ENVO-grounded environment entries pulled directly from CommunityMech's curated environment_term data.

Method

The script walks ../CommunityMech (excluding tests/ and examples/) for community YAMLs that carry both environment_term (ENVO-grounded) and a culturemech_id back-reference. Each such pair is a curator-vetted assertion that the linked recipe targets that environment, so it's safe to populate source_environment directly. Dedup is by (recipe_id, envo_id); some recipes are touched by multiple communities at different environments (e.g., CultureMech:000423 gets 4 entries: anaerobic digester / soil / acid mine drainage / sulfate-free anaerobic).

Each touched recipe carries a CurationEvent via record_curation_event() for provenance. Script is idempotent — re-runs add nothing because the dedup guard catches existing entries.

Out of scope

Inferring source_environment for the remaining 15,810 recipes from recipe metadata (names, applications, descriptions) is an NLP/keyword-mining enrichment problem — separate future work.

Validation

just validate-strict: 0 ERROR rows across 15,827 records (unchanged).

Test plan

  • Dry-run reports 17 recipes / 22 env entries to add
  • Apply produces matching numbers
  • Re-run after apply touches 0 recipes (idempotent)
  • just validate-strict clean
  • Sample record shows correctly structured source_environment: list

Closes #2.

🤖 Generated with Claude Code

The SourceEnvironmentDescriptor schema and source_environment slot
landed in dbe26e8 but no recipes populated them — issue #2's core
goal of cross-repo environmental linking only matters once recipes
actually carry ENVO terms.

scripts/backfill_source_environment.py walks the sibling
../CommunityMech repo (skipping tests/ and examples/) for community
YAMLs that carry both `environment_term` (ENVO-grounded) and a
`culturemech_id` back-reference. Each such pair is a curator-vetted
assertion that the linked recipe targets that environment, so we
populate it directly on the recipe. Dedup is by (recipe_id, envo_id);
when multiple communities point to the same recipe with the same ENVO
id we take the first observation's preferred_term/notes.

This pass touched 17 recipes and added 22 SourceEnvironmentDescriptor
entries (some recipes are used by multiple communities targeting
different environments — CultureMech:000423 alone gets 4 entries
spanning anaerobic-digester / soil / acid-mine-drainage / sulfate-free
anaerobic). Each touched recipe carries a CurationEvent for provenance.

The script is idempotent — re-running adds nothing new because the
dedup-by-ENVO-id guard catches existing entries.

Inferring source_environment for the rest of the corpus (15,810
recipes) from recipe metadata is a separate enrichment problem (NLP /
keyword mining over names + applications + descriptions) and out of
scope here.

just validate-strict: 0 ERROR rows / 15,827 records (unchanged).

Closes #2.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 24, 2026 04:39
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR backfills the recently-added source_environment field on CultureMech media recipes using ENVO-grounded environment_term assertions curated in the sibling CommunityMech repository, and records provenance via curation_history.

Changes:

  • Added scripts/backfill_source_environment.py to scan CommunityMech YAMLs and append source_environment descriptors to linked CultureMech recipes.
  • Updated 17 normalized recipe YAMLs to include source_environment entries plus a BACKFILLED_SOURCE_ENVIRONMENT curation event.
  • Introduced some YAML reflow/formatting changes (line wrapping) as a consequence of re-dumping touched files.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
scripts/backfill_source_environment.py New backfill script to harvest CommunityMech environment_term + culturemech_id links and write source_environment + provenance.
data/normalized_yaml/specialized/Nitrogen_free_plant_nutrient_solution_for_soybean_growth.yaml Adds source_environment (rhizosphere) and curation event.
data/normalized_yaml/specialized/nitrogen_free_b_d_medium_for_lotus_japonicus_growth.yaml Adds source_environment (rhizosphere) and curation event.
data/normalized_yaml/specialized/Modified_Freshwater_Medium_for_DIET_Coculture.yaml Adds source_environment (sediment) and curation event.
data/normalized_yaml/specialized/Modified_DSM_120_Medium_for_DIET_Coculture.yaml Adds source_environment (sediment) and curation event.
data/normalized_yaml/specialized/Half_strength_Murashige_Skoog_medium_for_Arabidopsis_growth.yaml Adds source_environment (rhizosphere) and curation event.
data/normalized_yaml/specialized/Glycerol_Fermentation_Medium_for_DIET_Coculture.yaml Adds source_environment (laboratory culture) and curation event.
data/normalized_yaml/bacterial/tryptic_soy_broth.yaml Adds source_environment (rhizosphere) + reflows some YAML text.
data/normalized_yaml/bacterial/tryptic_soy_agar.yaml Adds source_environment (rhizosphere) + reflows some YAML text.
data/normalized_yaml/bacterial/r2a_agar.yaml Adds source_environment (rhizosphere) and curation event.
data/normalized_yaml/bacterial/PCS_FP_medium_for_thermophilic_cellulose_degradation.yaml Adds source_environment (compost) and curation event.
data/normalized_yaml/bacterial/Nitrogen_Free_Medium_for_Leptospirillum_ferrodiazotrophum.yaml Adds source_environment (acid mine drainage) and curation event.
data/normalized_yaml/bacterial/mineral_medium.yaml Adds 4 source_environment entries and curation event.
data/normalized_yaml/bacterial/mannitol_agar.yaml Adds source_environment (rhizosphere) and curation event.
data/normalized_yaml/bacterial/luria_bertani_lb_medium.yaml Adds source_environment (rhizosphere) + reflows some YAML text.
data/normalized_yaml/bacterial/9k_medium.yaml Adds 2 source_environment entries and curation event.
data/normalized_yaml/algae/f_2.yaml Adds source_environment (marine environment) and curation event.
data/normalized_yaml/algae/CCAP_TAP Medium.yaml Adds 2 source_environment entries and curation event.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/backfill_source_environment.py
Comment thread data/normalized_yaml/bacterial/mineral_medium.yaml Outdated
Comment thread data/normalized_yaml/algae/CCAP_TAP Medium.yaml Outdated
Comment thread scripts/backfill_source_environment.py Outdated
Comment thread scripts/backfill_source_environment.py Outdated
Five Copilot findings, all fixed at script + data level:

1. **scripts/backfill_source_environment.py:63** — `path.read_text()`
   only caught yaml.YAMLError; now also catches OSError (permissions /
   disk) and UnicodeDecodeError (non-UTF-8 content) so a multi-thousand
   file scan can't crash mid-run.

2. **Inconsistent term.label across recipes for the same ENVO CURIE**
   (ENVO:01001405 was labeled "laboratory bioreactor" / "laboratory
   culture" / "laboratory environment" across the three communities
   that referenced it). The Term.label slot is intended to be the
   *canonical* ontology label — propagating the curator-supplied
   label from CommunityMech meant downstream queries would treat the
   same ENVO id as three different terms. Fix: stop emitting term.label
   at all. Downstream consumers should resolve the canonical label from
   ENVO directly. (Resolves both inline comments on mineral_medium.yaml
   and CCAP_TAP Medium.yaml.)

3. **preferred_term fell back to ""** when CommunityMech omitted the
   field. Now falls back to ENVO label → ENVO id → skip (never empty
   string).

4. **merge_into_recipe() assumed a list**. LinkML dataclasses accept a
   single dict for multivalued slots; the function now normalizes
   None → [] → list and dict → [dict] before processing.

Re-running the script on the 17 already-backfilled records strips the
22 stale labels in place (idempotent: no new env entries added since
all (recipe, ENVO id) pairs were already present). validate-strict
remains at 0 ERROR rows / 15,827 records.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@realmarcin realmarcin merged commit f7cdfcf into main May 24, 2026
1 check passed
@realmarcin realmarcin deleted the backfill-source-environment branch May 24, 2026 04:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add source_environment field to media schema for cross-repo environmental linking

2 participants