fix(credibility): M1 slice — validator cross-checks, invocation-graded inspect, gate hardening, honest schemas + docs#8
Conversation
An empty visible set made _all_passed([]) vacuously True, so decide([], [green_holdout]) wrongly certified Succeeded. Add a symmetric guard mirroring the empty-holdout case: no visible checks means nothing was optimized against, so the gate is NotReady. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
examples/coverage-repair ships no receipts trail, but WORKFLOW.md,
README.md, and the CHANGELOG 0.3.4 note asserted it "records receipts at
.loop/receipts/*.jsonl". Soften the two example docs to describe the
mechanism ("a live run appends receipts to ...; this frozen example ships
the contract artifacts, not a receipts trail") and add a dated ## Errata
to CHANGELOG correcting the claim without rewriting 0.3.4 history.
Add scripts/test_docs_claims.py: a behavioral guard that flags any
present-tense "records receipts"/"receipts land" assertion adjacent to the
.loop/receipts glob and requires the referenced example to actually ship
receipt files (changelog history exonerated by a receipts Errata).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The collection-shape and severity-mapping self-checks only catch two known shapes; a diff inserting `return False` into _is_gate_path's body rewrote the decision logic itself and certified clean:true. Add a diff-layer invariant: any hunk touching anticheat_scan.py that adds or removes a non-comment, non-blank line is a scanner_self_edit finding (high -> FailedUnverifiable). Cosmetic-only edits stay clean. Docstrings are deliberately not exempted — a triple-quote-state heuristic would itself be a bypass vector, and a false-positive on maintenance is correct: scanner maintenance should get human eyes. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The false-completion-defense check credited a self-asserted
`false_completion: false` terminal flag, a `verifier_gaming` manifest
key, prose mentions, or an unreferenced gate script file — all claims a
loop makes about itself, not evidence. A temp dir with only
`.loop/terminal_state.json={"false_completion":false}` earned full
defense credit.
Grade the credit instead:
- invoked (full): a scripts/verify-* gate invokes a holdout/anti-cheat
gate on an executable line, or RUNLOG/.loop/receipts records a run.
- wired (half): a gate script exists and is referenced from the verify
surface but no run is recorded.
- none (zero): a bare terminal flag, prose, or unreferenced script.
examples/coverage-repair loses its self-asserted credit (90/strong →
76/ok) and gains an honest, actionable gap; README snippet updated to
the real output.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
state.json.tmpl and terminal_state.json.tmpl emitted schema_version: "1.0" while contract.py checks the schema key against loop-engineer/state@1 and loop-engineer/terminal@1; terminal_state.json.tmpl was a wholesale-obsolete shape missing criteria_met/false_completion/evidence/state. Rewrite both to the validators real field names. Replace the STUB-marked verify scripts with real, dependency-free minimal gates so a fresh scaffold passes the products own doctor. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add python -m loop scaffold <dir>: copies templates/ into the standard repo-OS
layout with every {{PLACEHOLDER}} filled by an honest, valid default (goal
REPLACE: one-line goal, empty-but-valid structures, project name from the target
dir), resolving templates/ relative to the package root so it works from an
editable install. It never writes terminal_state.json (written once at loop end)
and refuses to overwrite an existing contract dir.
Make validate_contract treat a missing terminal_state.json as valid-in-flight
when state.json declares terminal_state: null, so a fresh scaffold passes doctor
unedited; a state that names a terminal with the file missing still flags.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A Succeeded terminal is the loop's strongest claim, but _validate_terminal accepted one with false_completion=true or with no met criterion. Add a cross-field check (runs in both validation modes) that emits a contradictory_terminal issue naming exactly what contradicts. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e (M1-SCHEMAS)
schemas/*.json were never loaded; doctor emitted schemas_checked implying
validation that never happened. Now: when jsonschema is importable, validate
the manifest/state/tasks/terminal artifacts against schemas/*.json (resolved
relative to the package repo root); otherwise fall back to the stdlib
structural hand checks. The report gains validation_mode ("jsonschema" |
"structural-fallback") stating what actually ran.
Reconcile the schema files with the real shipped contracts (examples/coverage-repair,
roadmap/v1.0) so both pass in BOTH modes -- narrowing over-required fields to
optional and widening a few types, each documented in the schema description.
Cross-field rules JSON Schema cannot express (terminal contradiction, task id
uniqueness, evidence-before-done) run in both modes.
Add a schemas extra (jsonschema) to pyproject and jsonschema to the CI install
so CI exercises the real-validation path.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e root .loop to validator shape Scaffold rendered TASKS.json task status "todo", but tasks.schema.json's enum is [pending, active, blocked, done, abandoned] — enforced via the jsonschema validation path in loop/contract.py. Emit "pending" instead so a fresh scaffold passes doctor unedited both with and without jsonschema. Also migrate the repo-root .loop/ v0.3 dogfood contract in place to the validator's shapes (schema fields on state/tasks/terminal; terminal now carries state/criteria_met/evidence/false_completion; tasks map to done + real evidence) so `python -m loop doctor .loop` exits ok:true in both validation modes. (.loop/ is gitignored run telemetry, so the migration lives in the working tree.) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 63c993deae
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| target = m.group("b") or pending_minus | ||
| in_self = bool(target) and _basename(target) == "anticheat_scan.py" |
There was a problem hiding this comment.
Flag scanner renames as self-edits
When a diff only renames scripts/anticheat_scan.py (for example to scripts/_disabled_scan.py), this logic keys self-edit detection solely off the +++ b/... path and then sees no added/removed hunk lines, while parse_changed_files() still excludes the old scanner basename from gate-tampering via _SELF_FILES. That means a pure scanner rename returns clean: true and can disable the anti-cheat gate without human review; please treat rename-from/rename-to metadata involving the scanner as a scanner_self_edit.
Useful? React with 👍 / 👎.
| stripped = line.strip() | ||
| if not stripped or stripped.startswith("#"): | ||
| continue | ||
| if any(token in stripped for token in _GATE_TOKENS): |
There was a problem hiding this comment.
Require real gate commands before invoked credit
When a verify-* script merely prints or documents a gate name on a non-comment line, such as echo "TODO: run holdout_gate.py", this substring check awards full false-completion defense (invoked) credit and can raise the inspector verdict to strong even though no holdout/anti-cheat gate ran. Since the new scoring is supposed to be based on invocation evidence rather than claims, please distinguish actual command invocations from echo/assignment/prose lines before granting full credit.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR hardens the “credibility slice” of the loop contract tooling by making success/verification claims mechanically enforceable: it tightens validator semantics (including optional real JSON-Schema validation), makes the inspector grade false-completion defense based on invocation evidence, hardens holdout/anti-cheat gates, adds a deterministic scaffold command, and corrects docs to avoid overstating shipped receipts.
Changes:
- Add optional jsonschema-backed validation with an explicit
validation_mode, plus stronger cross-field/cross-task enforcement in the core contract validator. - Update the inspector and gates to prevent self-asserted success/defense signals from receiving credit without evidence.
- Introduce a deterministic
python -m loop scaffold <dir>path with templates aligned to the enforced contract shape, and add tests guarding docs honesty.
Reviewed changes
Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| templates/verify-full.sh | Adjust verify-full template to compose verify-fast and remove stub markers. |
| templates/verify-fast.sh | Add a real fast check (contract files present) and remove stub markers. |
| templates/terminal_state.json.tmpl | Update terminal template to the new schema/key shape. |
| templates/state.json.tmpl | Update state template to use schema key (v1 shape). |
| scripts/test_scaffold.py | Add scaffold regression tests (doctor-clean output, layout, CLI). |
| scripts/test_loop_contract_core.py | Add tests for validation_mode, jsonschema enforcement, and terminal contradiction rules. |
| scripts/test_inspect_loop.py | Add tests for invocation-graded false-completion defense scoring. |
| scripts/test_holdout_gate.py | Add test for empty visible set returning NotReady. |
| scripts/test_docs_claims.py | Add guard test preventing docs from claiming shipped receipts that don’t exist. |
| scripts/test_anticheat_scan.py | Expand tests to require scanner self-edits be flagged for human review (non-cosmetic). |
| scripts/inspect_loop.py | Implement invocation/wiring/none grading for false-completion defense; update scoring output. |
| scripts/holdout_gate.py | Make empty visible set return NotReady (cannot certify). |
| scripts/anticheat_scan.py | Add scanner self-edit detection for non-cosmetic edits to the scanner source. |
| schemas/terminal.schema.json | Reconcile terminal schema required fields with real contracts; clarify description. |
| schemas/tasks.schema.json | Broaden task evidence type to allow arrays; clarify description. |
| schemas/state.schema.json | Narrow required fields and broaden types to match shipped contracts; clarify description. |
| schemas/manifest.schema.json | Broaden permissions item type; clarify description. |
| README.md | Document validation_mode and updated inspect scoring for the example. |
| pyproject.toml | Add [schemas] extra (jsonschema) and update optional-deps documentation comments. |
| loop/scaffold.py | Add scaffold implementation (template rendering + verify script installation). |
| loop/contract.py | Add contradiction checks, jsonschema validation mode, schema loading, and in-flight terminal handling. |
| loop/main.py | Add scaffold CLI subcommand and update usage. |
| examples/coverage-repair/WORKFLOW.md | Reword receipts language to mechanism description (no shipped receipts claim). |
| examples/coverage-repair/README.md | Reword receipts language to mechanism description (no shipped receipts claim). |
| CHANGELOG.md | Add Errata entry correcting prior receipts claim. |
| .gitignore | Ignore review/ and roadmap/ workbench directories. |
| .github/workflows/ci.yml | Install jsonschema in CI to exercise jsonschema-mode tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def _gate_run_recorded(paths) -> bool: | ||
| """RUNLOG.md / .loop/receipts/*.jsonl record an actual gate run.""" | ||
| texts = [_read_text(paths.runlog)] | ||
| receipts = paths.loop_dir / "receipts" | ||
| if receipts.is_dir(): | ||
| texts.extend(_read_text(p) for p in sorted(receipts.glob("*.jsonl"))) | ||
| for text in texts: | ||
| for line in text.splitlines(): | ||
| low = line.lower() | ||
| if any(token in low for token in _GATE_TOKENS): | ||
| return True | ||
| if ("holdout" in low or "anticheat" in low or "anti-cheat" in low) and any( | ||
| word in low for word in _GATE_RUN_WORDS | ||
| ): | ||
| return True | ||
| return False |
| def _validation_mode() -> str: | ||
| try: | ||
| import jsonschema # type: ignore # noqa: F401 | ||
| except Exception: | ||
| return "structural-fallback" | ||
| return "jsonschema" | ||
|
|
| def scaffold(target: str | Path) -> dict[str, Any]: | ||
| """Write a fresh, doctor-clean repo-OS contract into ``target``. | ||
|
|
||
| Refuses to overwrite an existing contract dir (a live loop owns its state). | ||
| """ | ||
|
|
||
| target = Path(target) | ||
| if target.exists() and _has_existing_contract(target): | ||
| raise FileExistsError(f"contract already exists at {target}") |
| # The core is pure-stdlib. Two optional extras enrich validation when present: | ||
| # yaml — PyYAML parses the manifest; absent, loop/contract.py falls back to | ||
| # a stdlib subset parser. | ||
| # schemas — jsonschema runs real JSON-Schema validation against schemas/*.json; | ||
| # absent, loop/contract.py falls back to structural hand checks. | ||
| # So `pip install -e .` pulls in zero third-party runtime dependencies. | ||
| [project.optional-dependencies] | ||
| yaml = ["pyyaml>=6"] | ||
| schemas = ["jsonschema>=4"] |
…oval) Provenance note added: review/ and roadmap/ workbenches stay untracked; pointers are maintainer-facing.
M1 credibility slice of the v1.0 launch cut-line: make every "the loop proves its work" claim mechanically true, and stop the toolkit from accepting self-asserted success anywhere.
What changed
Validator (G1 + SCHEMAS) —
loop/contract.pySucceededterminal now requiresfalse_completion: falseAND at least one truecriteria_metentry; contradictory terminals emit a doctor issue.schemas/*.jsonvia the new optional[schemas]extra (pip install -e ".[schemas]"); the report now carries an honestvalidation_modefield (jsonschemavsstructural-fallback) instead of implying schema validation that wasn't running. A field-agreement test pins every schemarequiredfield to actual enforcement.Inspector (INSPECT) —
scripts/inspect_loop.pyinvoked(full) /wired(half) /none(zero). A barefalse_completion: falseflag or prose mention earns nothing.Gates (G2 + G3)
holdout_gate: an empty visible set now returnsNotReady(symmetric with empty holdout) — nothing was optimized against, so nothing can be certified.anticheat_scan: any non-cosmetic self-edit to the scanner's own source is flagged high for human review, closing the one-linereturn Falseself-neuter hole (conservative diff-layer invariant; documented rationale for not special-casing docstrings).Scaffold + templates (TEMPLATES + SCAFFOLD)
python3 -m loop scaffold <dir>: renderstemplates/with valid defaults; output passesdoctorunedited (pinned test, both dependency modes).schemakey, real terminal shape); scaffold task status emits the schema-validpending.Docs honesty (RECEIPTS)
examples/coverage-repairand CHANGELOG no longer assert a receipts trail the frozen example doesn't ship; corrected via a CHANGELOG Errata section (history intact) and guarded by a new test.Proof (archived in the launch loop workbench)
verify-full: PASS— plugin validate --strict, doctor on the example and the dogfood.loop, self_eval 13/13, frontmatter 9/9.One integration repair was needed after merging the six independently-built clusters (scaffold status enum vs the now-enforced schema enum):
63c993d.