Close audit-writers blind-spot for text-templated writers + lock in growth_media id patterns by realmarcin · Pull Request #85 · CultureBotAI/CommunityMech

realmarcin · 2026-05-26T01:14:28Z

Summary

Two small follow-ups to the schema-gap audit landed in #84:

F7 (high) — scripts/audit_writers.py was blind to scripts/clean_metals_inplace.py, which mutates 149 community YAMLs via regex/line-edit substitutions + path.write_text(string). The auditor's _WRITE_TEXT_OF_YAML regex only matched path.write_text(yaml.dump(...)), so a high-risk writer with no validation, no curation trail, and no yaml-roundtrip safety was invisible. Broaden the heuristic so it now surfaces. (Does not convert the script — that's a separate, larger change.)
Fix Regenerate HTML for all 61 communities including new EcoFAB #5 (low) — GrowthMedia.culturemech_id and GrowthMediaComponent.media_ingredient_mech_id had no pattern: declared, while the sibling RelatedMedia.culturemech_id / RelatedIngredient.mediaingredientmech_id slots do. Live data already conforms to the standard ^CultureMech:\d{6}$ / ^MediaIngredientMech:\d{6}$ — just lock it in.

What changed

1. `scripts/audit_writers.py` — extended `looks_like_yaml_writer`

New branch in the detector:

A script that references the kb/communities path literal and round-trips the same path variable through .read_text() and .write_text(...) is a community-YAML writer (text-templated in-place mutation).

Calibration:

Across all scripts/ + src/communitymech/ only one new script matches: clean_metals_inplace.py.
src/communitymech/visualization/umap_generator.py and src/communitymech/render_community_pages.py reference kb/communities and call .write_text(...) but write HTML output to a different out_path — no same-variable round-trip, so they don't match.
src/communitymech/literature.py and src/communitymech/utils/media_linker.py have same-variable read_text / write_text round-trips but operate on cache files (no kb/communities reference) — they don't match either.
audit_writers.py itself round-trips path and references kb/communities in its regex source, but is already self-excluded via path.resolve() == Path(__file__).resolve().

After-state audit row for the now-surfaced writer:

scripts/clean_metals_inplace.py  yes  no  yes  no  no

(writes_yaml=yes, appends_curation_history=no, has_write_safeguard=yes — it does have --dry-run, validates_before_write=no, wired_into_just=no.)

2. `src/communitymech/schema/communitymech.yaml` — patterns added

GrowthMediaComponent:
  attributes:
    media_ingredient_mech_id:
      ...
      pattern: "^MediaIngredientMech:\\d{6}$"

GrowthMedia:
  attributes:
    culturemech_id:
      ...
      pattern: "^CultureMech:\\d{6}$"

Schema regenerated via just gen-python; the four compiled regexes are now in src/communitymech/datamodel/communitymech.py (two new lines for the GrowthMedia* slots, in addition to the two already present for RelatedMedia/RelatedIngredient).

3. `reports/pipeline_writers_audit.tsv` — refreshed

Now 16 rows (was 15). The added row is clean_metals_inplace.py.

Before/after

check	before	after
audit_writers TSV rows	15	16
`clean_metals_inplace.py` surfaced	no	yes
`validate_strict` ERROR rows	0 / 265	0 / 265
`pytest tests/`	136 pass / 9 skip	136 pass / 9 skip
`ruff check scripts/audit_writers.py`	clean	clean

Deferred

Converting scripts/clean_metals_inplace.py itself to a yaml.safe_load → mutate-dict → write_validated_community flow is not in this PR. That change is more invasive (149-file regex surgery to load/dump round-trip, with quoting / float-rendering / comment-preservation pitfalls). Now that the audit surfaces it, it's tracked in the standard backlog. From the audit's remaining un-gated writers, the next priority candidates are:

intelligent_snippet_fixer.py (no safeguard, no validation, no curation)
enhance_strain_data.py (no safeguard, no validation, no curation)
add_evidence_source.py (safeguarded but no validation, no curation)
clean_metals_inplace.py (this PR's new finding — --dry-run already in place)

Test plan

uv run python scripts/audit_writers.py | grep clean_metals_inplace returns a row
No new false-positives surfaced in the audit TSV (rows = 15 + 1 = 16; the new row is the target)
uv run python scripts/validate_strict.py --quiet → 0 ERROR rows / 265 files
uv run pytest tests/ -q --no-cov → 136 passed, 9 skipped
uv run ruff check scripts/audit_writers.py → clean
Regenerated datamodel/communitymech.py contains four ^CultureMech:\d{6}$ / ^MediaIngredientMech:\d{6}$ compiled patterns

Two small follow-ups to the recent schema-gap audit (#84): - F7: scripts/audit_writers.py's writer-detection heuristic only caught yaml.dump / yaml.safe_dump / path.write_text(yaml.dump(...)) patterns. scripts/clean_metals_inplace.py mutates 149 community YAMLs via regex/line-edit substitutions + path.write_text(string), bypassing the detector entirely. Broaden the heuristic: a file that references the kb/communities path literal AND round-trips the same path variable through .read_text() and .write_text(...) is also a community-YAML writer. clean_metals_inplace.py is the only script that matches this pattern today; it now surfaces in the audit TSV (correctly flagged validates_before_write=no, appends_curation_history=no — slated for conversion as separate work). - Fix #5: GrowthMedia.culturemech_id and GrowthMediaComponent.media_ingredient_mech_id had no pattern declared, while the sibling RelatedMedia.culturemech_id / RelatedIngredient.mediaingredientmech_id slots do. Live data already conforms to the standard ^CultureMech:\d{6}$ / ^MediaIngredientMech:\d{6}$ format — just lock it in to catch future drift at write time. Schema regenerated via `just gen-python`; `validate_strict.py` still reports 0 ERROR rows across 265 files. This PR does NOT convert clean_metals_inplace.py itself — that's a separate, more invasive change (regex surgery → load/dump roundtrip across 149 files). The audit-blind-spot fix above ensures the next audit cycle surfaces it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR tightens audit coverage for community-YAML “writers” that mutate YAML via raw text round-trips (read_text → write_text) and strengthens schema validation by enforcing ID regex patterns for GrowthMedia-related cross-repo identifiers.

Changes:

Expanded scripts/audit_writers.py writer detection to include in-place text-templated mutations targeting kb/communities (catches clean_metals_inplace.py).
Added pattern: constraints for GrowthMedia.culturemech_id and GrowthMediaComponent.media_ingredient_mech_id in the LinkML schema and regenerated the Python datamodel.
Refreshed reports/pipeline_writers_audit.tsv to include the newly detected writer.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
`src/communitymech/schema/communitymech.yaml`	Adds regex patterns to lock in GrowthMedia* ID formats consistent with existing related-* slots.
`src/communitymech/datamodel/communitymech.py`	Regenerated datamodel to include compiled regex patterns for the updated slots.
`scripts/audit_writers.py`	Broadens writer heuristics to detect community YAML mutations done via text round-tripping.
`reports/pipeline_writers_audit.tsv`	Updates the audit report to include `scripts/clean_metals_inplace.py`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

* Convert clean_metals_inplace.py to load/dump roundtrip with write_validated_community Previously this script mutated 149 community YAMLs in place via regex / line-edit surgery on the raw text, then wrote the result with path.write_text(string). The recent schema-gap audit (PR #85) flagged it as the highest-risk writer in the repo: no validation gate, no curation_history entry on changed files, and the audit detector itself was blind to it until the previous PR broadened the heuristic to catch read_text/write_text round-trips against kb/communities/. Now: load through yaml.safe_load, run the same cleaning logic on the parsed dict, append a CLEAN_METAL_FIELDS CurationEvent, and write back via write_validated_community (with backup-restore on validation failure, matching the apply_pmc_conversions / link_growth_media pattern from PR #85). The --dry-run default + --apply opt-in flag matches the existing repo convention. Cleaning semantics preserved verbatim: the same AMBIGUOUS_METAL_KEYWORDS table is consulted (TITANIUM / GOLD / PALLADIUM), and the same word-bounded evidence search runs over the document minus its metals_present block — implemented now by yaml.safe_dump'ing a shallow copy of the doc with metals_present stripped, which keyword_in_text scans identically (its regex anchors on non-alphanumeric boundaries, so YAML structural characters do not change match outcomes). Smoke-tested against a /tmp copy of the corpus: - Baseline corpus (already cleaned by a prior buggy-extractor run): reports 0/265 files would change, as expected. - With an injected false-positive (TITANIUM in a doc with no titanium evidence): correctly removed in dry-run preview and in --apply, with a schema-valid curation_history event appended and backup unlinked on success. Audit row before: yes no yes no no Audit row after: yes yes yes yes no (writes_yaml / appends_curation_history / has_write_safeguard / validates_before_write / wired_into_just — wired_into_just stays no; this script is a one-shot cleanup, not a recurring pipeline step.) Baselines remain green: scripts/validate_strict.py reports 0 errors across 265 files; pytest tests/ 136 passed / 9 skipped; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review on PR #86 Two findings: - The backup/restore flow only caught ValidationFailedError. If any other exception fired after path.rename(backup) (disk error, permission denied, unexpected serializer fault), the original would be left as .yaml.bak_metals with no .yaml. Catch BaseException too: restore the backup before re-raising so the corpus is never left with a missing canonical file. - _evidence_text serializes via yaml.safe_dump, which drops comments and can reflow scalars — strictly not a perfect raw-text equivalent. Document the caveat: the keywords we search for live in YAML content (description, evidence snippet, ecological_state), never in comments, so the on-disk behavior is identical for the corpus today. If a future workflow starts encoding evidence in YAML comments, the function should switch to path.read_text() with a line-strip filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d helpers Brings CommunityMech writer coverage from 5/16 to 8/16. Continues the pattern established in PR #84 and refined by PR #85 (restore-on-failure backup handling): every script that mutates a community YAML loads through yaml.safe_load, mutates the dict, records a CurationEvent via record_curation_event(), and writes back via write_validated_community() which gates on closed-schema LinkML validation. Converted scripts: - scripts/intelligent_snippet_fixer.py (LLM-driven snippet repair; llm_assisted=True; action=FIX_SNIPPETS_LLM). Uses skip_if_recent=True on the curation event so a session of auto-approved fixes collapses into a single trail entry instead of one per snippet. The existing .yaml.bak_intelligent backup created at session start by shutil.copy2 remains the user-visible safety net. - scripts/enhance_strain_data.py (strain-ID enrichment; action=ENHANCE_STRAIN_DATA). Previously the script extracted strain data but only emitted a copy-paste snippets file; this PR adds an --apply mode that writes strain_designation entries back into matching taxonomy[*] entries via write_validated_community(). Default behavior preserves the historical extract-only flow (no kb/communities writes without --apply). --overwrite controls whether to replace existing curator-authored strain_designation values. - scripts/add_evidence_source.py (evidence_source enum backfill; action=BACKFILL_EVIDENCE_SOURCE). Uses the backup-then-rename pattern from PR #85 — the original is moved to .yaml.bak_source before the validated write; on ValidationFailedError the backup is renamed back in place so the batch loop can continue without leaving a half-written community on disk. Each per-record loop continues on ValidationFailedError so one bad file can't kill the batch. CLI surfaces (--auto, --interactive, --dry-run, --auto-approve, --file, --apply, etc.) preserved. After-state: scripts/audit_writers.py reports 8/16 writers gated (was 5/16). The remaining un-converted writers (apply_strain_designations, apply_taxonomy_corrections, backfill_metals, clean_metals_inplace, fix_reference_formats, plus a handful of smaller src/ writers) follow the same conversion pattern; left as future work to keep this PR focused. Note: scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py import communitymech.literature_enhanced (a pre-existing module that does not currently exist in the repo) at module top-level. This PR does not introduce or fix that — the scripts have always failed at the import step when invoked from CLI. Out of scope for this conversion; tracked separately. Baseline (unchanged): - validate_strict: 0 ERROR rows / 265 files - pytest tests/: 136 passed, 9 skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ance_strain_data + add_evidence_source) (#87) * Convert 3 more CommunityMech writers to use shared validate-and-record helpers Brings CommunityMech writer coverage from 5/16 to 8/16. Continues the pattern established in PR #84 and refined by PR #85 (restore-on-failure backup handling): every script that mutates a community YAML loads through yaml.safe_load, mutates the dict, records a CurationEvent via record_curation_event(), and writes back via write_validated_community() which gates on closed-schema LinkML validation. Converted scripts: - scripts/intelligent_snippet_fixer.py (LLM-driven snippet repair; llm_assisted=True; action=FIX_SNIPPETS_LLM). Uses skip_if_recent=True on the curation event so a session of auto-approved fixes collapses into a single trail entry instead of one per snippet. The existing .yaml.bak_intelligent backup created at session start by shutil.copy2 remains the user-visible safety net. - scripts/enhance_strain_data.py (strain-ID enrichment; action=ENHANCE_STRAIN_DATA). Previously the script extracted strain data but only emitted a copy-paste snippets file; this PR adds an --apply mode that writes strain_designation entries back into matching taxonomy[*] entries via write_validated_community(). Default behavior preserves the historical extract-only flow (no kb/communities writes without --apply). --overwrite controls whether to replace existing curator-authored strain_designation values. - scripts/add_evidence_source.py (evidence_source enum backfill; action=BACKFILL_EVIDENCE_SOURCE). Uses the backup-then-rename pattern from PR #85 — the original is moved to .yaml.bak_source before the validated write; on ValidationFailedError the backup is renamed back in place so the batch loop can continue without leaving a half-written community on disk. Each per-record loop continues on ValidationFailedError so one bad file can't kill the batch. CLI surfaces (--auto, --interactive, --dry-run, --auto-approve, --file, --apply, etc.) preserved. After-state: scripts/audit_writers.py reports 8/16 writers gated (was 5/16). The remaining un-converted writers (apply_strain_designations, apply_taxonomy_corrections, backfill_metals, clean_metals_inplace, fix_reference_formats, plus a handful of smaller src/ writers) follow the same conversion pattern; left as future work to keep this PR focused. Note: scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py import communitymech.literature_enhanced (a pre-existing module that does not currently exist in the repo) at module top-level. This PR does not introduce or fix that — the scripts have always failed at the import step when invoked from CLI. Out of scope for this conversion; tracked separately. Baseline (unchanged): - validate_strict: 0 ERROR rows / 265 files - pytest tests/: 136 passed, 9 skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Refresh audit TSV after rebase onto #86 PR #86 (Convert clean_metals_inplace.py) just merged into main. After rebasing, scripts/clean_metals_inplace.py is now gated, so the appends_curation_history / validates_before_write columns for it flip to yes. Re-running scripts/audit_writers.py produces a 1-row delta; commit it so the report reflects the actual post-rebase state. Combined post-merge: 9/16 appends_curation_history, 10/16 validates_before_write (was 5/16 and 6/16 respectively at the start of this PR series). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 26, 2026 01:14

Copilot started reviewing on behalf of realmarcin May 26, 2026 01:14 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

realmarcin merged commit 9d863f8 into main May 26, 2026
1 check passed

realmarcin deleted the audit/detector-and-patterns branch May 26, 2026 01:25

realmarcin mentioned this pull request May 26, 2026

Convert clean_metals_inplace.py to validated yaml roundtrip #86

Merged

6 tasks

realmarcin mentioned this pull request May 26, 2026

Convert 3 more CommunityMech writers (intelligent_snippet_fixer + enhance_strain_data + add_evidence_source) #87

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close audit-writers blind-spot for text-templated writers + lock in growth_media id patterns#85

Close audit-writers blind-spot for text-templated writers + lock in growth_media id patterns#85
realmarcin merged 1 commit into
mainfrom
audit/detector-and-patterns

realmarcin commented May 26, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

realmarcin commented May 26, 2026

Summary

What changed

1. scripts/audit_writers.py — extended looks_like_yaml_writer

2. src/communitymech/schema/communitymech.yaml — patterns added

3. reports/pipeline_writers_audit.tsv — refreshed

Before/after

Deferred

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1. `scripts/audit_writers.py` — extended `looks_like_yaml_writer`

2. `src/communitymech/schema/communitymech.yaml` — patterns added

3. `reports/pipeline_writers_audit.tsv` — refreshed