Skip to content

Add oral caries and dysbiosis community models#32

Merged
realmarcin merged 1 commit into
mainfrom
add-oral-caries-syncoms
May 5, 2026
Merged

Add oral caries and dysbiosis community models#32
realmarcin merged 1 commit into
mainfrom
add-oral-caries-syncoms

Conversation

@cmungall
Copy link
Copy Markdown
Collaborator

Summary

This PR adds oral microbiome exemplars focused on dysbiosis and caries, plus a small schema extension to classify them cleanly.

Included communities:

  • Early Dental Biofilm Five-Species Model
  • Defined Multispecies Enamel Caries Model
  • Streptococcus mutans - Candida albicans ECC Biofilm Model
  • Streptococcus mutans - Selenomonas sputigena ECC Pathobiont Model
  • Streptococcus mutans - Veillonella parvula Adult Severe Caries Model

Schema/datamodel changes:

  • Add ORAL to CommunityCategoryEnum
  • Regenerate the LinkML Python datamodel

Evidence support:

  • Added cached PubMed abstracts for PMIDs 21966490, 23446436, 24566629, 37217495, and 39345197 used by reference validation

Validation

Passed:

  • just validate for all 5 new community YAMLs
  • just validate-terms for all 5 new community YAMLs
  • just validate-references for all 5 new community YAMLs

Tests

just test is not clean on this branch, but the failures appear pre-existing and unrelated to this change set:

  • 61 passed
  • 9 failed in tests/test_llm_client.py
  • failure mode: missing anthropic package / tests attempting to patch anthropic.Anthropic

Copilot AI review requested due to automatic review settings April 15, 2026 01:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new ORAL community category and introduces five curated oral microbiome/dental biofilm community exemplars (caries/dysbiosis), along with cached PubMed abstracts to support reference validation.

Changes:

  • Extend CommunityCategoryEnum with ORAL and regenerate the LinkML Python datamodel.
  • Add 5 new oral community YAML records under kb/communities/.
  • Add cached PubMed abstract text files for the 5 supporting PMIDs under references_cache/.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/communitymech/schema/communitymech.yaml Adds ORAL to CommunityCategoryEnum in the LinkML schema.
src/communitymech/datamodel/communitymech.py Regenerated datamodel to include the new enum permissible value.
kb/communities/Early_Dental_Biofilm_FiveSpecies.yaml New engineered five-species early dental biofilm model exemplar.
kb/communities/Defined_Multispecies_Enamel_Caries_Model.yaml New defined multispecies enamel caries model exemplar.
kb/communities/SMutans_CAlbicans_ECC_Biofilm.yaml New S. mutans–C. albicans ECC dual-species biofilm exemplar.
kb/communities/SMutans_SSputigena_ECC_Pathobiont.yaml New S. mutans–S. sputigena ECC pathobiont exemplar.
kb/communities/SMutans_VParvula_ASC_Biofilm.yaml New S. mutans–V. parvula adult severe caries exemplar.
references_cache/pmid_21966490.txt Cached abstract used by reference validation for the 5-species model.
references_cache/pmid_23446436.txt Cached abstract used by reference validation for the enamel caries model.
references_cache/pmid_24566629.txt Cached abstract used by reference validation for the S. mutans–C. albicans model.
references_cache/pmid_37217495.txt Cached abstract used by reference validation for the S. mutans–S. sputigena model.
references_cache/pmid_39345197.txt Cached abstract used by reference validation for the S. mutans–V. parvula model.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

for metagenomic analysis. The acidification, aciduricity, oxidative stress
tolerance, and gtf (glucosyltransferase) gene expression of S. mutans cocultured
with V. parvula which was identified as ASC-related dominant bacterium. The
biofilm formation and extracellular exopolysaccharide (EPS) synthesis of
@realmarcin realmarcin force-pushed the add-oral-caries-syncoms branch from 808700c to 59cad73 Compare May 5, 2026 07:24
@realmarcin realmarcin merged commit 711fc37 into main May 5, 2026
1 check passed
@realmarcin realmarcin deleted the add-oral-caries-syncoms branch May 5, 2026 07:40
realmarcin added a commit that referenced this pull request May 25, 2026
…rict + write_validated_community + record_curation_event + audit_writers) (#84)

* Port audit machinery from CultureMech: schema extension + validate_strict + write_validated_community + record_curation_event + audit_writers

Brings CommunityMech to parity with the audit-machinery ports recently
landed in CultureMech (source), MediaIngredientMech (#32), and TraitMech
(#76). CommunityMech is the last sibling; the lift is larger than MIM /
TraitMech because the schema did not yet define CurationEvent or
curation_history.

Schema additions (additive, no migration needed):
- New CurationEvent class with timestamp / curator / action / changes /
  llm_assisted attributes, mirroring the shape used by sibling Mech repos
  so cross-repo tooling reads curation events uniformly.
- New curation_history slot on MicrobialCommunity, multivalued + inlined
  + optional. Existing community YAMLs continue to validate without
  modification.
- src/communitymech/datamodel/communitymech.py regenerated (just gen-python).

New helpers:
- src/communitymech/validation/write_validated.py — write_validated_community()
  refuses to dump a MicrobialCommunity that fails closed-schema LinkML
  validation; raises ValidationFailedError. Single-root-class schema so
  no target_class routing needed. Default yaml opts match the repo's
  existing emission convention (default_flow_style=False, sort_keys=False,
  allow_unicode=True, width=120, indent=2) so existing files roundtrip
  byte-identically.
- src/communitymech/curate/curation_event.py — record_curation_event() is
  the standard helper for appending a CurationEvent to
  doc['curation_history']. Schema-aligned signature; whole-second + Z
  suffix timestamps; skip_if_recent support for idempotent re-runs.

New scripts:
- scripts/validate_strict.py — strict closed-schema parallel walk of
  kb/communities/ (with backups/ + snapshots/ excluded). Emits
  reports/instance_validation_failures.tsv categorized by error class,
  exits non-zero on ERROR. Strictly stronger than the per-file
  linkml-validate loop in just validate-all (open-mode, swallows
  exit codes).
- scripts/audit_writers.py — inventory of every YAML-writing module under
  scripts/ + src/communitymech/, flags whether each script validates
  before writing and appends a curation_history event.

Writer conversions (5 of ~15):
- scripts/add_community_ids.py (action=ASSIGN_COMMUNITY_ID; also gained a
  --dry-run safeguard it lacked before)
- scripts/apply_pmc_conversions.py (action=CONVERT_PMC_TO_PMID)
- scripts/fix_network_integrity.py (action=FIX_NETWORK_INTEGRITY)
- scripts/link_growth_media.py (action=LINK_GROWTH_MEDIA)
- src/communitymech/network/llm_repair.py (action=LLM_REPAIR_APPLIED,
  llm_assisted=True)

Each one was wrapped in try/except ValidationFailedError on the write
call so one bad record can't kill a batch run. Existing CLI surfaces
preserved.

Justfile:
- New validate-strict + audit-writers recipes.
- qc composite extended to include validate-strict.

Baseline:
- just validate-strict — 265 files, 0 ERROR rows (clean).
- just audit-writers — 15 writers; 5 now validate before write + append
  curation_history. The other 10 are flagged in the TSV as future-work
  conversions (apply_strain_designations, apply_taxonomy_corrections,
  apply_suggested_fixes / suggested_snippets, backfill_metals,
  batch_snippet_fixer, clean_metals_inplace, curate_evidence_with_pdfs,
  enhance_strain_data, fix_invalid_snippets, fix_reference_formats,
  intelligent_snippet_fixer, etc.) — converting them follows the same
  pattern as the 5 above.
- pytest tests/ — 136 passed, 9 skipped.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot review on PR #84

5 findings, all real and addressed:

- scripts/apply_pmc_conversions.py + scripts/link_growth_media.py (both
  process_single_community and process_all_communities paths): all three
  scripts rename the source file to a `.bak` backup before writing the
  validated result. Previously, if write_validated_community raised
  ValidationFailedError the handler only logged and continued — leaving
  the original path missing on disk (only the .bak existed). Now restore
  the backup on validation failure before logging.

- scripts/audit_writers.py: replace the substring check for
  `wired_into_just` with a per-line check that ignores comments and
  requires a word-boundary match on the full filename. The previous
  check was a false positive when a justfile comment merely mentioned
  the filename — e.g. write_validated.py matched the justfile comment
  referencing write_validated_community(). Drops the wired-into-just
  count from 3 (with false positives) to 1 (genuine: link_growth_media).

- scripts/add_community_ids.py: guard against running on already-IDed
  YAMLs. The previous flow built `{"id": community_id}.update(data)`,
  which silently retained the source file's existing id while the
  curation event still recorded "Assigned id=<new>" — a misleading audit
  entry. Skip such files with an explanatory log line instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants