Skip to content

fix(credibility): M1 slice — validator cross-checks, invocation-graded inspect, gate hardening, honest schemas + docs#8

Merged
SollanSystems merged 16 commits into
mainfrom
launch/m1-credibility
Jul 2, 2026
Merged

fix(credibility): M1 slice — validator cross-checks, invocation-graded inspect, gate hardening, honest schemas + docs#8
SollanSystems merged 16 commits into
mainfrom
launch/m1-credibility

Conversation

@SollanSystems

Copy link
Copy Markdown
Owner

M1 credibility slice of the v1.0 launch cut-line: make every "the loop proves its work" claim mechanically true, and stop the toolkit from accepting self-asserted success anywhere.

What changed

Validator (G1 + SCHEMAS)loop/contract.py

  • A Succeeded terminal now requires false_completion: false AND at least one true criteria_met entry; contradictory terminals emit a doctor issue.
  • Real JSON-Schema validation against schemas/*.json via the new optional [schemas] extra (pip install -e ".[schemas]"); the report now carries an honest validation_mode field (jsonschema vs structural-fallback) instead of implying schema validation that wasn't running. A field-agreement test pins every schema required field to actual enforcement.

Inspector (INSPECT)scripts/inspect_loop.py

  • False-completion-defense credit is now graded on invocation evidence (a verify-script line or recorded RUNLOG/receipt run), not claims: invoked (full) / wired (half) / none (zero). A bare false_completion: false flag or prose mention earns nothing.
  • Consequence honestly documented: the flagship example's inspect score drops 90→76 ("strong"→"ok") until M2 wires a real gate invocation into it.

Gates (G2 + G3)

  • holdout_gate: an empty visible set now returns NotReady (symmetric with empty holdout) — nothing was optimized against, so nothing can be certified.
  • anticheat_scan: any non-cosmetic self-edit to the scanner's own source is flagged high for human review, closing the one-line return False self-neuter hole (conservative diff-layer invariant; documented rationale for not special-casing docstrings).

Scaffold + templates (TEMPLATES + SCAFFOLD)

  • New deterministic python3 -m loop scaffold <dir>: renders templates/ with valid defaults; output passes doctor unedited (pinned test, both dependency modes).
  • Templates aligned with the validator's real shape (schema key, real terminal shape); scaffold task status emits the schema-valid pending.

Docs honesty (RECEIPTS)

  • examples/coverage-repair and CHANGELOG no longer assert a receipts trail the frozen example doesn't ship; corrected via a CHANGELOG Errata section (history intact) and guarded by a new test.

Proof (archived in the launch loop workbench)

  • Mechanical fail-before/pass-after: pre-M1 tree + only the six test files → 7 FAILED; this branch → 7 passed. Every pinned test exercises new behavior, none assert the status quo.
  • Full suite: 118 passed + 2 skipped (stdlib-only) / 120 passed (with jsonschema).
  • verify-full: PASS — plugin validate --strict, doctor on the example and the dogfood .loop, self_eval 13/13, frontmatter 9/9.

One integration repair was needed after merging the six independently-built clusters (scaffold status enum vs the now-enforced schema enum): 63c993d.

SollanSystems and others added 15 commits July 1, 2026 21:20
An empty visible set made _all_passed([]) vacuously True, so
decide([], [green_holdout]) wrongly certified Succeeded. Add a
symmetric guard mirroring the empty-holdout case: no visible
checks means nothing was optimized against, so the gate is NotReady.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
examples/coverage-repair ships no receipts trail, but WORKFLOW.md,
README.md, and the CHANGELOG 0.3.4 note asserted it "records receipts at
.loop/receipts/*.jsonl". Soften the two example docs to describe the
mechanism ("a live run appends receipts to ...; this frozen example ships
the contract artifacts, not a receipts trail") and add a dated ## Errata
to CHANGELOG correcting the claim without rewriting 0.3.4 history.

Add scripts/test_docs_claims.py: a behavioral guard that flags any
present-tense "records receipts"/"receipts land" assertion adjacent to the
.loop/receipts glob and requires the referenced example to actually ship
receipt files (changelog history exonerated by a receipts Errata).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The collection-shape and severity-mapping self-checks only catch two
known shapes; a diff inserting `return False` into _is_gate_path's body
rewrote the decision logic itself and certified clean:true. Add a
diff-layer invariant: any hunk touching anticheat_scan.py that adds or
removes a non-comment, non-blank line is a scanner_self_edit finding
(high -> FailedUnverifiable). Cosmetic-only edits stay clean. Docstrings
are deliberately not exempted — a triple-quote-state heuristic would
itself be a bypass vector, and a false-positive on maintenance is
correct: scanner maintenance should get human eyes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The false-completion-defense check credited a self-asserted
`false_completion: false` terminal flag, a `verifier_gaming` manifest
key, prose mentions, or an unreferenced gate script file — all claims a
loop makes about itself, not evidence. A temp dir with only
`.loop/terminal_state.json={"false_completion":false}` earned full
defense credit.

Grade the credit instead:
- invoked (full): a scripts/verify-* gate invokes a holdout/anti-cheat
  gate on an executable line, or RUNLOG/.loop/receipts records a run.
- wired (half): a gate script exists and is referenced from the verify
  surface but no run is recorded.
- none (zero): a bare terminal flag, prose, or unreferenced script.

examples/coverage-repair loses its self-asserted credit (90/strong →
76/ok) and gains an honest, actionable gap; README snippet updated to
the real output.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
state.json.tmpl and terminal_state.json.tmpl emitted schema_version: "1.0"
while contract.py checks the schema key against loop-engineer/state@1 and
loop-engineer/terminal@1; terminal_state.json.tmpl was a wholesale-obsolete
shape missing criteria_met/false_completion/evidence/state. Rewrite both to the
validators real field names. Replace the STUB-marked verify scripts with real,
dependency-free minimal gates so a fresh scaffold passes the products own doctor.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add python -m loop scaffold <dir>: copies templates/ into the standard repo-OS
layout with every {{PLACEHOLDER}} filled by an honest, valid default (goal
REPLACE: one-line goal, empty-but-valid structures, project name from the target
dir), resolving templates/ relative to the package root so it works from an
editable install. It never writes terminal_state.json (written once at loop end)
and refuses to overwrite an existing contract dir.

Make validate_contract treat a missing terminal_state.json as valid-in-flight
when state.json declares terminal_state: null, so a fresh scaffold passes doctor
unedited; a state that names a terminal with the file missing still flags.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
A Succeeded terminal is the loop's strongest claim, but _validate_terminal
accepted one with false_completion=true or with no met criterion. Add a
cross-field check (runs in both validation modes) that emits a
contradictory_terminal issue naming exactly what contradicts.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e (M1-SCHEMAS)

schemas/*.json were never loaded; doctor emitted schemas_checked implying
validation that never happened. Now: when jsonschema is importable, validate
the manifest/state/tasks/terminal artifacts against schemas/*.json (resolved
relative to the package repo root); otherwise fall back to the stdlib
structural hand checks. The report gains validation_mode ("jsonschema" |
"structural-fallback") stating what actually ran.

Reconcile the schema files with the real shipped contracts (examples/coverage-repair,
roadmap/v1.0) so both pass in BOTH modes -- narrowing over-required fields to
optional and widening a few types, each documented in the schema description.
Cross-field rules JSON Schema cannot express (terminal contradiction, task id
uniqueness, evidence-before-done) run in both modes.

Add a schemas extra (jsonschema) to pyproject and jsonschema to the CI install
so CI exercises the real-validation path.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…e root .loop to validator shape

Scaffold rendered TASKS.json task status "todo", but tasks.schema.json's
enum is [pending, active, blocked, done, abandoned] — enforced via the
jsonschema validation path in loop/contract.py. Emit "pending" instead so
a fresh scaffold passes doctor unedited both with and without jsonschema.

Also migrate the repo-root .loop/ v0.3 dogfood contract in place to the
validator's shapes (schema fields on state/tasks/terminal; terminal now
carries state/criteria_met/evidence/false_completion; tasks map to done +
real evidence) so `python -m loop doctor .loop` exits ok:true in both
validation modes. (.loop/ is gitignored run telemetry, so the migration
lives in the working tree.)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings July 2, 2026 17:00

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 63c993deae

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread scripts/anticheat_scan.py
Comment on lines +331 to +332
target = m.group("b") or pending_minus
in_self = bool(target) and _basename(target) == "anticheat_scan.py"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Flag scanner renames as self-edits

When a diff only renames scripts/anticheat_scan.py (for example to scripts/_disabled_scan.py), this logic keys self-edit detection solely off the +++ b/... path and then sees no added/removed hunk lines, while parse_changed_files() still excludes the old scanner basename from gate-tampering via _SELF_FILES. That means a pure scanner rename returns clean: true and can disable the anti-cheat gate without human review; please treat rename-from/rename-to metadata involving the scanner as a scanner_self_edit.

Useful? React with 👍 / 👎.

Comment thread scripts/inspect_loop.py
stripped = line.strip()
if not stripped or stripped.startswith("#"):
continue
if any(token in stripped for token in _GATE_TOKENS):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Require real gate commands before invoked credit

When a verify-* script merely prints or documents a gate name on a non-comment line, such as echo "TODO: run holdout_gate.py", this substring check awards full false-completion defense (invoked) credit and can raise the inspector verdict to strong even though no holdout/anti-cheat gate ran. Since the new scoring is supposed to be based on invocation evidence rather than claims, please distinguish actual command invocations from echo/assignment/prose lines before granting full credit.

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the “credibility slice” of the loop contract tooling by making success/verification claims mechanically enforceable: it tightens validator semantics (including optional real JSON-Schema validation), makes the inspector grade false-completion defense based on invocation evidence, hardens holdout/anti-cheat gates, adds a deterministic scaffold command, and corrects docs to avoid overstating shipped receipts.

Changes:

  • Add optional jsonschema-backed validation with an explicit validation_mode, plus stronger cross-field/cross-task enforcement in the core contract validator.
  • Update the inspector and gates to prevent self-asserted success/defense signals from receiving credit without evidence.
  • Introduce a deterministic python -m loop scaffold <dir> path with templates aligned to the enforced contract shape, and add tests guarding docs honesty.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
templates/verify-full.sh Adjust verify-full template to compose verify-fast and remove stub markers.
templates/verify-fast.sh Add a real fast check (contract files present) and remove stub markers.
templates/terminal_state.json.tmpl Update terminal template to the new schema/key shape.
templates/state.json.tmpl Update state template to use schema key (v1 shape).
scripts/test_scaffold.py Add scaffold regression tests (doctor-clean output, layout, CLI).
scripts/test_loop_contract_core.py Add tests for validation_mode, jsonschema enforcement, and terminal contradiction rules.
scripts/test_inspect_loop.py Add tests for invocation-graded false-completion defense scoring.
scripts/test_holdout_gate.py Add test for empty visible set returning NotReady.
scripts/test_docs_claims.py Add guard test preventing docs from claiming shipped receipts that don’t exist.
scripts/test_anticheat_scan.py Expand tests to require scanner self-edits be flagged for human review (non-cosmetic).
scripts/inspect_loop.py Implement invocation/wiring/none grading for false-completion defense; update scoring output.
scripts/holdout_gate.py Make empty visible set return NotReady (cannot certify).
scripts/anticheat_scan.py Add scanner self-edit detection for non-cosmetic edits to the scanner source.
schemas/terminal.schema.json Reconcile terminal schema required fields with real contracts; clarify description.
schemas/tasks.schema.json Broaden task evidence type to allow arrays; clarify description.
schemas/state.schema.json Narrow required fields and broaden types to match shipped contracts; clarify description.
schemas/manifest.schema.json Broaden permissions item type; clarify description.
README.md Document validation_mode and updated inspect scoring for the example.
pyproject.toml Add [schemas] extra (jsonschema) and update optional-deps documentation comments.
loop/scaffold.py Add scaffold implementation (template rendering + verify script installation).
loop/contract.py Add contradiction checks, jsonschema validation mode, schema loading, and in-flight terminal handling.
loop/main.py Add scaffold CLI subcommand and update usage.
examples/coverage-repair/WORKFLOW.md Reword receipts language to mechanism description (no shipped receipts claim).
examples/coverage-repair/README.md Reword receipts language to mechanism description (no shipped receipts claim).
CHANGELOG.md Add Errata entry correcting prior receipts claim.
.gitignore Ignore review/ and roadmap/ workbench directories.
.github/workflows/ci.yml Install jsonschema in CI to exercise jsonschema-mode tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/inspect_loop.py
Comment on lines +190 to +205
def _gate_run_recorded(paths) -> bool:
"""RUNLOG.md / .loop/receipts/*.jsonl record an actual gate run."""
texts = [_read_text(paths.runlog)]
receipts = paths.loop_dir / "receipts"
if receipts.is_dir():
texts.extend(_read_text(p) for p in sorted(receipts.glob("*.jsonl")))
for text in texts:
for line in text.splitlines():
low = line.lower()
if any(token in low for token in _GATE_TOKENS):
return True
if ("holdout" in low or "anticheat" in low or "anti-cheat" in low) and any(
word in low for word in _GATE_RUN_WORDS
):
return True
return False
Comment thread loop/contract.py
Comment on lines +276 to +282
def _validation_mode() -> str:
try:
import jsonschema # type: ignore # noqa: F401
except Exception:
return "structural-fallback"
return "jsonschema"

Comment thread loop/scaffold.py
Comment on lines +101 to +109
def scaffold(target: str | Path) -> dict[str, Any]:
"""Write a fresh, doctor-clean repo-OS contract into ``target``.

Refuses to overwrite an existing contract dir (a live loop owns its state).
"""

target = Path(target)
if target.exists() and _has_existing_contract(target):
raise FileExistsError(f"contract already exists at {target}")
Comment thread pyproject.toml
Comment on lines +15 to +23
# The core is pure-stdlib. Two optional extras enrich validation when present:
# yaml — PyYAML parses the manifest; absent, loop/contract.py falls back to
# a stdlib subset parser.
# schemas — jsonschema runs real JSON-Schema validation against schemas/*.json;
# absent, loop/contract.py falls back to structural hand checks.
# So `pip install -e .` pulls in zero third-party runtime dependencies.
[project.optional-dependencies]
yaml = ["pyyaml>=6"]
schemas = ["jsonschema>=4"]
…oval)

Provenance note added: review/ and roadmap/ workbenches stay untracked; pointers are maintainer-facing.
@SollanSystems SollanSystems merged commit 70408d6 into main Jul 2, 2026
4 checks passed
@SollanSystems SollanSystems deleted the launch/m1-credibility branch July 2, 2026 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants