Skip to content

Close audit-writers blind-spot for text-templated writers + lock in growth_media id patterns#85

Merged
realmarcin merged 1 commit into
mainfrom
audit/detector-and-patterns
May 26, 2026
Merged

Close audit-writers blind-spot for text-templated writers + lock in growth_media id patterns#85
realmarcin merged 1 commit into
mainfrom
audit/detector-and-patterns

Conversation

@realmarcin
Copy link
Copy Markdown
Contributor

Summary

Two small follow-ups to the schema-gap audit landed in #84:

  • F7 (high)scripts/audit_writers.py was blind to scripts/clean_metals_inplace.py, which mutates 149 community YAMLs via regex/line-edit substitutions + path.write_text(string). The auditor's _WRITE_TEXT_OF_YAML regex only matched path.write_text(yaml.dump(...)), so a high-risk writer with no validation, no curation trail, and no yaml-roundtrip safety was invisible. Broaden the heuristic so it now surfaces. (Does not convert the script — that's a separate, larger change.)
  • Fix Regenerate HTML for all 61 communities including new EcoFAB #5 (low)GrowthMedia.culturemech_id and GrowthMediaComponent.media_ingredient_mech_id had no pattern: declared, while the sibling RelatedMedia.culturemech_id / RelatedIngredient.mediaingredientmech_id slots do. Live data already conforms to the standard ^CultureMech:\d{6}$ / ^MediaIngredientMech:\d{6}$ — just lock it in.

What changed

1. scripts/audit_writers.py — extended looks_like_yaml_writer

New branch in the detector:

A script that references the kb/communities path literal and round-trips the same path variable through .read_text() and .write_text(...) is a community-YAML writer (text-templated in-place mutation).

Calibration:

  • Across all scripts/ + src/communitymech/ only one new script matches: clean_metals_inplace.py.
  • src/communitymech/visualization/umap_generator.py and src/communitymech/render_community_pages.py reference kb/communities and call .write_text(...) but write HTML output to a different out_path — no same-variable round-trip, so they don't match.
  • src/communitymech/literature.py and src/communitymech/utils/media_linker.py have same-variable read_text / write_text round-trips but operate on cache files (no kb/communities reference) — they don't match either.
  • audit_writers.py itself round-trips path and references kb/communities in its regex source, but is already self-excluded via path.resolve() == Path(__file__).resolve().

After-state audit row for the now-surfaced writer:

scripts/clean_metals_inplace.py  yes  no  yes  no  no

(writes_yaml=yes, appends_curation_history=no, has_write_safeguard=yes — it does have --dry-run, validates_before_write=no, wired_into_just=no.)

2. src/communitymech/schema/communitymech.yaml — patterns added

GrowthMediaComponent:
  attributes:
    media_ingredient_mech_id:
      ...
      pattern: "^MediaIngredientMech:\\d{6}$"

GrowthMedia:
  attributes:
    culturemech_id:
      ...
      pattern: "^CultureMech:\\d{6}$"

Schema regenerated via just gen-python; the four compiled regexes are now in src/communitymech/datamodel/communitymech.py (two new lines for the GrowthMedia* slots, in addition to the two already present for RelatedMedia/RelatedIngredient).

3. reports/pipeline_writers_audit.tsv — refreshed

Now 16 rows (was 15). The added row is clean_metals_inplace.py.

Before/after

check before after
audit_writers TSV rows 15 16
clean_metals_inplace.py surfaced no yes
validate_strict ERROR rows 0 / 265 0 / 265
pytest tests/ 136 pass / 9 skip 136 pass / 9 skip
ruff check scripts/audit_writers.py clean clean

Deferred

Converting scripts/clean_metals_inplace.py itself to a yaml.safe_load → mutate-dict → write_validated_community flow is not in this PR. That change is more invasive (149-file regex surgery to load/dump round-trip, with quoting / float-rendering / comment-preservation pitfalls). Now that the audit surfaces it, it's tracked in the standard backlog. From the audit's remaining un-gated writers, the next priority candidates are:

  • intelligent_snippet_fixer.py (no safeguard, no validation, no curation)
  • enhance_strain_data.py (no safeguard, no validation, no curation)
  • add_evidence_source.py (safeguarded but no validation, no curation)
  • clean_metals_inplace.py (this PR's new finding — --dry-run already in place)

Test plan

  • uv run python scripts/audit_writers.py | grep clean_metals_inplace returns a row
  • No new false-positives surfaced in the audit TSV (rows = 15 + 1 = 16; the new row is the target)
  • uv run python scripts/validate_strict.py --quiet → 0 ERROR rows / 265 files
  • uv run pytest tests/ -q --no-cov → 136 passed, 9 skipped
  • uv run ruff check scripts/audit_writers.py → clean
  • Regenerated datamodel/communitymech.py contains four ^CultureMech:\d{6}$ / ^MediaIngredientMech:\d{6}$ compiled patterns

Two small follow-ups to the recent schema-gap audit (#84):

- F7: scripts/audit_writers.py's writer-detection heuristic only caught
  yaml.dump / yaml.safe_dump / path.write_text(yaml.dump(...)) patterns.
  scripts/clean_metals_inplace.py mutates 149 community YAMLs via
  regex/line-edit substitutions + path.write_text(string), bypassing
  the detector entirely. Broaden the heuristic: a file that references
  the kb/communities path literal AND round-trips the same path
  variable through .read_text() and .write_text(...) is also a
  community-YAML writer. clean_metals_inplace.py is the only script
  that matches this pattern today; it now surfaces in the audit TSV
  (correctly flagged validates_before_write=no,
  appends_curation_history=no — slated for conversion as separate
  work).

- Fix #5: GrowthMedia.culturemech_id and
  GrowthMediaComponent.media_ingredient_mech_id had no pattern
  declared, while the sibling RelatedMedia.culturemech_id /
  RelatedIngredient.mediaingredientmech_id slots do. Live data
  already conforms to the standard ^CultureMech:\d{6}$ /
  ^MediaIngredientMech:\d{6}$ format — just lock it in to catch
  future drift at write time. Schema regenerated via `just gen-python`;
  `validate_strict.py` still reports 0 ERROR rows across 265 files.

This PR does NOT convert clean_metals_inplace.py itself — that's a
separate, more invasive change (regex surgery → load/dump roundtrip
across 149 files). The audit-blind-spot fix above ensures the next
audit cycle surfaces it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 26, 2026 01:14
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens audit coverage for community-YAML “writers” that mutate YAML via raw text round-trips (read_text → write_text) and strengthens schema validation by enforcing ID regex patterns for GrowthMedia-related cross-repo identifiers.

Changes:

  • Expanded scripts/audit_writers.py writer detection to include in-place text-templated mutations targeting kb/communities (catches clean_metals_inplace.py).
  • Added pattern: constraints for GrowthMedia.culturemech_id and GrowthMediaComponent.media_ingredient_mech_id in the LinkML schema and regenerated the Python datamodel.
  • Refreshed reports/pipeline_writers_audit.tsv to include the newly detected writer.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
src/communitymech/schema/communitymech.yaml Adds regex patterns to lock in GrowthMedia* ID formats consistent with existing related-* slots.
src/communitymech/datamodel/communitymech.py Regenerated datamodel to include compiled regex patterns for the updated slots.
scripts/audit_writers.py Broadens writer heuristics to detect community YAML mutations done via text round-tripping.
reports/pipeline_writers_audit.tsv Updates the audit report to include scripts/clean_metals_inplace.py.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@realmarcin realmarcin merged commit 9d863f8 into main May 26, 2026
1 check passed
@realmarcin realmarcin deleted the audit/detector-and-patterns branch May 26, 2026 01:25
realmarcin added a commit that referenced this pull request May 26, 2026
* Convert clean_metals_inplace.py to load/dump roundtrip with write_validated_community

Previously this script mutated 149 community YAMLs in place via regex /
line-edit surgery on the raw text, then wrote the result with
path.write_text(string). The recent schema-gap audit (PR #85) flagged
it as the highest-risk writer in the repo: no validation gate, no
curation_history entry on changed files, and the audit detector itself
was blind to it until the previous PR broadened the heuristic to catch
read_text/write_text round-trips against kb/communities/.

Now: load through yaml.safe_load, run the same cleaning logic on the
parsed dict, append a CLEAN_METAL_FIELDS CurationEvent, and write back
via write_validated_community (with backup-restore on validation
failure, matching the apply_pmc_conversions / link_growth_media pattern
from PR #85). The --dry-run default + --apply opt-in flag matches the
existing repo convention.

Cleaning semantics preserved verbatim: the same AMBIGUOUS_METAL_KEYWORDS
table is consulted (TITANIUM / GOLD / PALLADIUM), and the same
word-bounded evidence search runs over the document minus its
metals_present block — implemented now by yaml.safe_dump'ing a shallow
copy of the doc with metals_present stripped, which keyword_in_text
scans identically (its regex anchors on non-alphanumeric boundaries, so
YAML structural characters do not change match outcomes).

Smoke-tested against a /tmp copy of the corpus:
- Baseline corpus (already cleaned by a prior buggy-extractor run):
  reports 0/265 files would change, as expected.
- With an injected false-positive (TITANIUM in a doc with no titanium
  evidence): correctly removed in dry-run preview and in --apply, with
  a schema-valid curation_history event appended and backup unlinked
  on success.

Audit row before:  yes  no   yes  no   no
Audit row after:   yes  yes  yes  yes  no
(writes_yaml / appends_curation_history / has_write_safeguard /
validates_before_write / wired_into_just — wired_into_just stays no;
this script is a one-shot cleanup, not a recurring pipeline step.)

Baselines remain green: scripts/validate_strict.py reports 0 errors
across 265 files; pytest tests/ 136 passed / 9 skipped; ruff clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Address Copilot review on PR #86

Two findings:

- The backup/restore flow only caught ValidationFailedError. If any
  other exception fired after path.rename(backup) (disk error,
  permission denied, unexpected serializer fault), the original would
  be left as .yaml.bak_metals with no .yaml. Catch BaseException too:
  restore the backup before re-raising so the corpus is never left
  with a missing canonical file.

- _evidence_text serializes via yaml.safe_dump, which drops comments
  and can reflow scalars — strictly not a perfect raw-text equivalent.
  Document the caveat: the keywords we search for live in YAML
  content (description, evidence snippet, ecological_state), never in
  comments, so the on-disk behavior is identical for the corpus today.
  If a future workflow starts encoding evidence in YAML comments, the
  function should switch to path.read_text() with a line-strip filter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin added a commit that referenced this pull request May 26, 2026
…d helpers

Brings CommunityMech writer coverage from 5/16 to 8/16. Continues the
pattern established in PR #84 and refined by PR #85 (restore-on-failure
backup handling): every script that mutates a community YAML loads
through yaml.safe_load, mutates the dict, records a CurationEvent via
record_curation_event(), and writes back via write_validated_community()
which gates on closed-schema LinkML validation.

Converted scripts:
- scripts/intelligent_snippet_fixer.py (LLM-driven snippet repair;
  llm_assisted=True; action=FIX_SNIPPETS_LLM). Uses skip_if_recent=True
  on the curation event so a session of auto-approved fixes collapses
  into a single trail entry instead of one per snippet. The existing
  .yaml.bak_intelligent backup created at session start by
  shutil.copy2 remains the user-visible safety net.
- scripts/enhance_strain_data.py (strain-ID enrichment;
  action=ENHANCE_STRAIN_DATA). Previously the script extracted strain
  data but only emitted a copy-paste snippets file; this PR adds an
  --apply mode that writes strain_designation entries back into
  matching taxonomy[*] entries via write_validated_community(). Default
  behavior preserves the historical extract-only flow (no kb/communities
  writes without --apply). --overwrite controls whether to replace
  existing curator-authored strain_designation values.
- scripts/add_evidence_source.py (evidence_source enum backfill;
  action=BACKFILL_EVIDENCE_SOURCE). Uses the backup-then-rename pattern
  from PR #85 — the original is moved to .yaml.bak_source before the
  validated write; on ValidationFailedError the backup is renamed back
  in place so the batch loop can continue without leaving a half-written
  community on disk.

Each per-record loop continues on ValidationFailedError so one bad file
can't kill the batch. CLI surfaces (--auto, --interactive, --dry-run,
--auto-approve, --file, --apply, etc.) preserved.

After-state: scripts/audit_writers.py reports 8/16 writers gated
(was 5/16). The remaining un-converted writers (apply_strain_designations,
apply_taxonomy_corrections, backfill_metals, clean_metals_inplace,
fix_reference_formats, plus a handful of smaller src/ writers) follow
the same conversion pattern; left as future work to keep this PR focused.

Note: scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py
import communitymech.literature_enhanced (a pre-existing module that does
not currently exist in the repo) at module top-level. This PR does not
introduce or fix that — the scripts have always failed at the import step
when invoked from CLI. Out of scope for this conversion; tracked separately.

Baseline (unchanged):
- validate_strict: 0 ERROR rows / 265 files
- pytest tests/: 136 passed, 9 skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
realmarcin added a commit that referenced this pull request May 26, 2026
…ance_strain_data + add_evidence_source) (#87)

* Convert 3 more CommunityMech writers to use shared validate-and-record helpers

Brings CommunityMech writer coverage from 5/16 to 8/16. Continues the
pattern established in PR #84 and refined by PR #85 (restore-on-failure
backup handling): every script that mutates a community YAML loads
through yaml.safe_load, mutates the dict, records a CurationEvent via
record_curation_event(), and writes back via write_validated_community()
which gates on closed-schema LinkML validation.

Converted scripts:
- scripts/intelligent_snippet_fixer.py (LLM-driven snippet repair;
  llm_assisted=True; action=FIX_SNIPPETS_LLM). Uses skip_if_recent=True
  on the curation event so a session of auto-approved fixes collapses
  into a single trail entry instead of one per snippet. The existing
  .yaml.bak_intelligent backup created at session start by
  shutil.copy2 remains the user-visible safety net.
- scripts/enhance_strain_data.py (strain-ID enrichment;
  action=ENHANCE_STRAIN_DATA). Previously the script extracted strain
  data but only emitted a copy-paste snippets file; this PR adds an
  --apply mode that writes strain_designation entries back into
  matching taxonomy[*] entries via write_validated_community(). Default
  behavior preserves the historical extract-only flow (no kb/communities
  writes without --apply). --overwrite controls whether to replace
  existing curator-authored strain_designation values.
- scripts/add_evidence_source.py (evidence_source enum backfill;
  action=BACKFILL_EVIDENCE_SOURCE). Uses the backup-then-rename pattern
  from PR #85 — the original is moved to .yaml.bak_source before the
  validated write; on ValidationFailedError the backup is renamed back
  in place so the batch loop can continue without leaving a half-written
  community on disk.

Each per-record loop continues on ValidationFailedError so one bad file
can't kill the batch. CLI surfaces (--auto, --interactive, --dry-run,
--auto-approve, --file, --apply, etc.) preserved.

After-state: scripts/audit_writers.py reports 8/16 writers gated
(was 5/16). The remaining un-converted writers (apply_strain_designations,
apply_taxonomy_corrections, backfill_metals, clean_metals_inplace,
fix_reference_formats, plus a handful of smaller src/ writers) follow
the same conversion pattern; left as future work to keep this PR focused.

Note: scripts/add_evidence_source.py and scripts/intelligent_snippet_fixer.py
import communitymech.literature_enhanced (a pre-existing module that does
not currently exist in the repo) at module top-level. This PR does not
introduce or fix that — the scripts have always failed at the import step
when invoked from CLI. Out of scope for this conversion; tracked separately.

Baseline (unchanged):
- validate_strict: 0 ERROR rows / 265 files
- pytest tests/: 136 passed, 9 skipped

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Refresh audit TSV after rebase onto #86

PR #86 (Convert clean_metals_inplace.py) just merged into main. After
rebasing, scripts/clean_metals_inplace.py is now gated, so the
appends_curation_history / validates_before_write columns for it flip
to yes. Re-running scripts/audit_writers.py produces a 1-row delta;
commit it so the report reflects the actual post-rebase state.

Combined post-merge: 9/16 appends_curation_history, 10/16
validates_before_write (was 5/16 and 6/16 respectively at the start
of this PR series).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants