Close audit-writers blind-spot for text-templated writers + lock in growth_media id patterns#85
Merged
Merged
Conversation
Two small follow-ups to the recent schema-gap audit (#84): - F7: scripts/audit_writers.py's writer-detection heuristic only caught yaml.dump / yaml.safe_dump / path.write_text(yaml.dump(...)) patterns. scripts/clean_metals_inplace.py mutates 149 community YAMLs via regex/line-edit substitutions + path.write_text(string), bypassing the detector entirely. Broaden the heuristic: a file that references the kb/communities path literal AND round-trips the same path variable through .read_text() and .write_text(...) is also a community-YAML writer. clean_metals_inplace.py is the only script that matches this pattern today; it now surfaces in the audit TSV (correctly flagged validates_before_write=no, appends_curation_history=no — slated for conversion as separate work). - Fix #5: GrowthMedia.culturemech_id and GrowthMediaComponent.media_ingredient_mech_id had no pattern declared, while the sibling RelatedMedia.culturemech_id / RelatedIngredient.mediaingredientmech_id slots do. Live data already conforms to the standard ^CultureMech:\d{6}$ / ^MediaIngredientMech:\d{6}$ format — just lock it in to catch future drift at write time. Schema regenerated via `just gen-python`; `validate_strict.py` still reports 0 ERROR rows across 265 files. This PR does NOT convert clean_metals_inplace.py itself — that's a separate, more invasive change (regex surgery → load/dump roundtrip across 149 files). The audit-blind-spot fix above ensures the next audit cycle surfaces it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR tightens audit coverage for community-YAML “writers” that mutate YAML via raw text round-trips (read_text → write_text) and strengthens schema validation by enforcing ID regex patterns for GrowthMedia-related cross-repo identifiers.
Changes:
- Expanded
scripts/audit_writers.pywriter detection to include in-place text-templated mutations targetingkb/communities(catchesclean_metals_inplace.py). - Added
pattern:constraints forGrowthMedia.culturemech_idandGrowthMediaComponent.media_ingredient_mech_idin the LinkML schema and regenerated the Python datamodel. - Refreshed
reports/pipeline_writers_audit.tsvto include the newly detected writer.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
src/communitymech/schema/communitymech.yaml |
Adds regex patterns to lock in GrowthMedia* ID formats consistent with existing related-* slots. |
src/communitymech/datamodel/communitymech.py |
Regenerated datamodel to include compiled regex patterns for the updated slots. |
scripts/audit_writers.py |
Broadens writer heuristics to detect community YAML mutations done via text round-tripping. |
reports/pipeline_writers_audit.tsv |
Updates the audit report to include scripts/clean_metals_inplace.py. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6 tasks
realmarcin
added a commit
that referenced
this pull request
May 26, 2026
* Convert clean_metals_inplace.py to load/dump roundtrip with write_validated_community Previously this script mutated 149 community YAMLs in place via regex / line-edit surgery on the raw text, then wrote the result with path.write_text(string). The recent schema-gap audit (PR #85) flagged it as the highest-risk writer in the repo: no validation gate, no curation_history entry on changed files, and the audit detector itself was blind to it until the previous PR broadened the heuristic to catch read_text/write_text round-trips against kb/communities/. Now: load through yaml.safe_load, run the same cleaning logic on the parsed dict, append a CLEAN_METAL_FIELDS CurationEvent, and write back via write_validated_community (with backup-restore on validation failure, matching the apply_pmc_conversions / link_growth_media pattern from PR #85). The --dry-run default + --apply opt-in flag matches the existing repo convention. Cleaning semantics preserved verbatim: the same AMBIGUOUS_METAL_KEYWORDS table is consulted (TITANIUM / GOLD / PALLADIUM), and the same word-bounded evidence search runs over the document minus its metals_present block — implemented now by yaml.safe_dump'ing a shallow copy of the doc with metals_present stripped, which keyword_in_text scans identically (its regex anchors on non-alphanumeric boundaries, so YAML structural characters do not change match outcomes). Smoke-tested against a /tmp copy of the corpus: - Baseline corpus (already cleaned by a prior buggy-extractor run): reports 0/265 files would change, as expected. - With an injected false-positive (TITANIUM in a doc with no titanium evidence): correctly removed in dry-run preview and in --apply, with a schema-valid curation_history event appended and backup unlinked on success. Audit row before: yes no yes no no Audit row after: yes yes yes yes no (writes_yaml / appends_curation_history / has_write_safeguard / validates_before_write / wired_into_just — wired_into_just stays no; this script is a one-shot cleanup, not a recurring pipeline step.) Baselines remain green: scripts/validate_strict.py reports 0 errors across 265 files; pytest tests/ 136 passed / 9 skipped; ruff clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address Copilot review on PR #86 Two findings: - The backup/restore flow only caught ValidationFailedError. If any other exception fired after path.rename(backup) (disk error, permission denied, unexpected serializer fault), the original would be left as .yaml.bak_metals with no .yaml. Catch BaseException too: restore the backup before re-raising so the corpus is never left with a missing canonical file. - _evidence_text serializes via yaml.safe_dump, which drops comments and can reflow scalars — strictly not a perfect raw-text equivalent. Document the caveat: the keywords we search for live in YAML content (description, evidence snippet, ecological_state), never in comments, so the on-disk behavior is identical for the corpus today. If a future workflow starts encoding evidence in YAML comments, the function should switch to path.read_text() with a line-strip filter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 tasks
realmarcin
added a commit
that referenced
this pull request
May 26, 2026
…d helpers Brings CommunityMech writer coverage from 5/16 to 8/16. Continues the pattern established in PR #84 and refined by PR #85 (restore-on-failure backup handling): every script that mutates a community YAML loads through yaml.safe_load, mutates the dict, records a CurationEvent via record_curation_event(), and writes back via write_validated_community() which gates on closed-schema LinkML validation. Converted scripts: - scripts/intelligent_snippet_fixer.py (LLM-driven snippet repair; llm_assisted=True; action=FIX_SNIPPETS_LLM). Uses skip_if_recent=True on the curation event so a session of auto-approved fixes collapses into a single trail entry instead of one per snippet. The existing .yaml.bak_intelligent backup created at session start by shutil.copy2 remains the user-visible safety net. - scripts/enhance_strain_data.py (strain-ID enrichment; action=ENHANCE_STRAIN_DATA). Previously the script extracted strain data but only emitted a copy-paste snippets file; this PR adds an --apply mode that writes strain_designation entries back into matching taxonomy[*] entries via write_validated_community(). Default behavior preserves the historical extract-only flow (no kb/communities writes without --apply). --overwrite controls whether to replace existing curator-authored strain_designation values. - scripts/add_evidence_source.py (evidence_source enum backfill; action=BACKFILL_EVIDENCE_SOURCE). Uses the backup-then-rename pattern from PR #85 — the original is moved to .yaml.bak_source before the validated write; on ValidationFailedError the backup is renamed back in place so the batch loop can continue without leaving a half-written community on disk. Each per-record loop continues on ValidationFailedError so one bad file can't kill the batch. CLI surfaces (--auto, --interactive, --dry-run, --auto-approve, --file, --apply, etc.) preserved. After-state: scripts/audit_writers.py reports 8/16 writers gated (was 5/16). The remaining un-converted writers (apply_strain_designations, apply_taxonomy_corrections, backfill_metals, clean_metals_inplace, fix_reference_formats, plus a handful of smaller src/ writers) follow the same conversion pattern; left as future work to keep this PR focused. Note: scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py import communitymech.literature_enhanced (a pre-existing module that does not currently exist in the repo) at module top-level. This PR does not introduce or fix that — the scripts have always failed at the import step when invoked from CLI. Out of scope for this conversion; tracked separately. Baseline (unchanged): - validate_strict: 0 ERROR rows / 265 files - pytest tests/: 136 passed, 9 skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin
added a commit
that referenced
this pull request
May 26, 2026
…ance_strain_data + add_evidence_source) (#87) * Convert 3 more CommunityMech writers to use shared validate-and-record helpers Brings CommunityMech writer coverage from 5/16 to 8/16. Continues the pattern established in PR #84 and refined by PR #85 (restore-on-failure backup handling): every script that mutates a community YAML loads through yaml.safe_load, mutates the dict, records a CurationEvent via record_curation_event(), and writes back via write_validated_community() which gates on closed-schema LinkML validation. Converted scripts: - scripts/intelligent_snippet_fixer.py (LLM-driven snippet repair; llm_assisted=True; action=FIX_SNIPPETS_LLM). Uses skip_if_recent=True on the curation event so a session of auto-approved fixes collapses into a single trail entry instead of one per snippet. The existing .yaml.bak_intelligent backup created at session start by shutil.copy2 remains the user-visible safety net. - scripts/enhance_strain_data.py (strain-ID enrichment; action=ENHANCE_STRAIN_DATA). Previously the script extracted strain data but only emitted a copy-paste snippets file; this PR adds an --apply mode that writes strain_designation entries back into matching taxonomy[*] entries via write_validated_community(). Default behavior preserves the historical extract-only flow (no kb/communities writes without --apply). --overwrite controls whether to replace existing curator-authored strain_designation values. - scripts/add_evidence_source.py (evidence_source enum backfill; action=BACKFILL_EVIDENCE_SOURCE). Uses the backup-then-rename pattern from PR #85 — the original is moved to .yaml.bak_source before the validated write; on ValidationFailedError the backup is renamed back in place so the batch loop can continue without leaving a half-written community on disk. Each per-record loop continues on ValidationFailedError so one bad file can't kill the batch. CLI surfaces (--auto, --interactive, --dry-run, --auto-approve, --file, --apply, etc.) preserved. After-state: scripts/audit_writers.py reports 8/16 writers gated (was 5/16). The remaining un-converted writers (apply_strain_designations, apply_taxonomy_corrections, backfill_metals, clean_metals_inplace, fix_reference_formats, plus a handful of smaller src/ writers) follow the same conversion pattern; left as future work to keep this PR focused. Note: scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py import communitymech.literature_enhanced (a pre-existing module that does not currently exist in the repo) at module top-level. This PR does not introduce or fix that — the scripts have always failed at the import step when invoked from CLI. Out of scope for this conversion; tracked separately. Baseline (unchanged): - validate_strict: 0 ERROR rows / 265 files - pytest tests/: 136 passed, 9 skipped Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Refresh audit TSV after rebase onto #86 PR #86 (Convert clean_metals_inplace.py) just merged into main. After rebasing, scripts/clean_metals_inplace.py is now gated, so the appends_curation_history / validates_before_write columns for it flip to yes. Re-running scripts/audit_writers.py produces a 1-row delta; commit it so the report reflects the actual post-rebase state. Combined post-merge: 9/16 appends_curation_history, 10/16 validates_before_write (was 5/16 and 6/16 respectively at the start of this PR series). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two small follow-ups to the schema-gap audit landed in #84:
scripts/audit_writers.pywas blind toscripts/clean_metals_inplace.py, which mutates 149 community YAMLs via regex/line-edit substitutions +path.write_text(string). The auditor's_WRITE_TEXT_OF_YAMLregex only matchedpath.write_text(yaml.dump(...)), so a high-risk writer with no validation, no curation trail, and noyaml-roundtrip safety was invisible. Broaden the heuristic so it now surfaces. (Does not convert the script — that's a separate, larger change.)GrowthMedia.culturemech_idandGrowthMediaComponent.media_ingredient_mech_idhad nopattern:declared, while the siblingRelatedMedia.culturemech_id/RelatedIngredient.mediaingredientmech_idslots do. Live data already conforms to the standard^CultureMech:\d{6}$/^MediaIngredientMech:\d{6}$— just lock it in.What changed
1.
scripts/audit_writers.py— extendedlooks_like_yaml_writerNew branch in the detector:
Calibration:
scripts/+src/communitymech/only one new script matches:clean_metals_inplace.py.src/communitymech/visualization/umap_generator.pyandsrc/communitymech/render_community_pages.pyreferencekb/communitiesand call.write_text(...)but write HTML output to a differentout_path— no same-variable round-trip, so they don't match.src/communitymech/literature.pyandsrc/communitymech/utils/media_linker.pyhave same-variableread_text/write_textround-trips but operate on cache files (nokb/communitiesreference) — they don't match either.audit_writers.pyitself round-tripspathand referenceskb/communitiesin its regex source, but is already self-excluded viapath.resolve() == Path(__file__).resolve().After-state audit row for the now-surfaced writer:
(
writes_yaml=yes,appends_curation_history=no,has_write_safeguard=yes— it does have--dry-run,validates_before_write=no,wired_into_just=no.)2.
src/communitymech/schema/communitymech.yaml— patterns addedSchema regenerated via
just gen-python; the four compiled regexes are now insrc/communitymech/datamodel/communitymech.py(two new lines for the GrowthMedia* slots, in addition to the two already present for RelatedMedia/RelatedIngredient).3.
reports/pipeline_writers_audit.tsv— refreshedNow 16 rows (was 15). The added row is
clean_metals_inplace.py.Before/after
clean_metals_inplace.pysurfacedvalidate_strictERROR rowspytest tests/ruff check scripts/audit_writers.pyDeferred
Converting
scripts/clean_metals_inplace.pyitself to ayaml.safe_load→ mutate-dict →write_validated_communityflow is not in this PR. That change is more invasive (149-file regex surgery to load/dump round-trip, with quoting / float-rendering / comment-preservation pitfalls). Now that the audit surfaces it, it's tracked in the standard backlog. From the audit's remaining un-gated writers, the next priority candidates are:intelligent_snippet_fixer.py(no safeguard, no validation, no curation)enhance_strain_data.py(no safeguard, no validation, no curation)add_evidence_source.py(safeguarded but no validation, no curation)clean_metals_inplace.py(this PR's new finding —--dry-runalready in place)Test plan
uv run python scripts/audit_writers.py | grep clean_metals_inplacereturns a rowuv run python scripts/validate_strict.py --quiet→ 0 ERROR rows / 265 filesuv run pytest tests/ -q --no-cov→ 136 passed, 9 skippeduv run ruff check scripts/audit_writers.py→ cleandatamodel/communitymech.pycontains four^CultureMech:\d{6}$/^MediaIngredientMech:\d{6}$compiled patterns