Releases: SollanSystems/loop-engineer
Release list
loop-engineer v0.6.0 — metrics real
"Metrics real" — false-completion-rate and repair-productivity graduate from claims to derivations, red-teamed before release.
loop metrics <loop-dir> derives FCR and RP from a loop's real on-disk evidence — RUNLOG claims × verify bundles, held-out verdicts, repair records, receipts — never from agent narration. FCR is computed two independent ways and disagreement is surfaced, not resolved; unmatched success claims fail closed. Every number ships with a provenance block naming its input files.
loop metrics --baseline publishes docs/metrics-baseline.json and refuses over anything not genuinely gate-backed: a structurally-valid held-out verdict artifact is mandatory, rejected/unanchored repair records block it, disagreeing FCR methods block it, and a vacuous zero-claim run cannot baseline.
Published baseline (gate-backed examples/coverage-repair): FCR 0.0 · RP 1.0 — reproducible with python3 -m loop metrics examples/coverage-repair; a test binds the README literals to the committed JSON.
Honesty invariants:
productiveis recomputed from each record's own evidence, never trusted; repair records must anchor to a same-task red→green verify-bundle pair.- Claim semantics are outcome-class aware: completion-class claims (
task_passed/succeeded/terminal) require every attached bundle green, no exceptions; progress-class (advanced) tolerates a red intermediate only if the same task goes green in a strictly later iteration. - Canonical record schemas (
loop-engineer/repair@1,loop-engineer/rollout@1) end the two-shapes-called-"the repair record" ambiguity.
Red-teamed before merge: two adversarial review rounds confirmed 17 issues in the metrics implementation — including a --baseline that would have published a clean FCR over a run its own gate flagged — all fixed and pinned as regression tests. Documented residual: a committed verdict is evidence, not proof; tamper detection belongs to the anti-cheat layer.
Also: loop console script (pip install -e .), doctor reports validated record schemas, 217-test suite (+67).
Full details in CHANGELOG.md.
loop-engineer v0.5.0 — credibility enforcement + first screen
The two pre-launch milestones of the v1.0 roadmap, landed together.
Enforce the wedge — false-completion defense is now enforced by validators, not asserted by docs:
- A
Succeededterminal no longer validates withfalse_completion: trueor an empty/falsecriteria_met; the inspector grades false-completion defense on invocation evidence, never a self-asserted flag (#8). - The held-out gate returns
NotReadyon an empty visible set; the anti-cheat scanner detects edits that neuter its own gate-decision functions and reports gate tampering with a distinct exit code (#8). examples/coverage-repairruns end-to-end through the real held-out gate — itsfalse_completion: falseis backed by a committed gate verdict, not a hand-set flag (#9).- The repo's own live contract passes its own gate:
python3 -m loop doctor .loop→ ok (#8).
First screen — a stranger gets it in 10 seconds, scores a loop in 30:
- README rebuilt: tagline, concrete failure modes, zero-install first command, stack diagram, comparison table (#7, #12).
- Weak→strong demo filmed live on the real tools:
docs/demo.gif— a self-asserted DIY loop scores 0/weak, the gate-backed example 90/strong (#13). - New
loop scaffoldcommand + JSON Schemas for the contract artifacts; promised templates shipped (#8, #11). - CLI polish:
--help/--version, distinct errors, explicit exit codes (#10).
Plus: the v1.0 master roadmap and four strategic design specs are now committed under docs/superpowers/ (#15).
Version jumps 0.3.4 → 0.5.0 to match the roadmap's milestone numbering; there is no 0.4.x tag.
Full details in CHANGELOG.md.
v0.3.4 — dogfood-driven hardening + portable core
Dogfood-driven hardening: ran loop-inspector + loop-runtime-monitor against 9 real
on-disk loops (foreign and in-house). The tools had been built and tested only against
this suite's own well-formed loops, so first contact with foreign/edge-case inputs exposed
six defects — all fixed here under TDD, each pinned by a regression test.
Fixed
- (P1)
inspect_loopno longer crashes on a malformedmanifest.yaml.read_manifest
(loop/contract.py) ranyaml.safe_loadwithout a guard — the one read path missing the
json.JSONDecodeErrorguard every JSON read already had — so a malformed manifest in an
untrusted/foreign loop dir killed the inspector with a traceback instead of returning a
report. It now fails safe to{}, fixing the crash forinspect_loop,validate_contract,
anddoctor_reportat once. inspect_loopnow scoresSPEC.md/WORKFLOW.md/TASKS.jsondual-location (.loop/
∪ workspace root), likemanifest/statealready resolved. Previously SPEC/WORKFLOW were
hard-coded to the workspace root, so a loop whose contract lives under.loop/(including
loop-engineer's own repo) was falsely scored as having "no success criteria" / "no
independent verification." Scores on substance, not on where the file sits.inspect_looprecognizes a single-fileloop-contract.mdas a contract-owned source
for success criteria, approval gates, plan-then-execute, and terminal-state coverage — a
committed minimal-contract loop that names all 7 terminal states is no longer scored 0/7.runtime_monitoris terminal-state-aware. It now readsterminal_state/state == "terminal"and reportsrecommendation: "done"(surfacing the terminal state) instead of
advisingcontinueon a loop that has already finished.runtime_monitorno longer reports an unparseable RUNLOG as healthy. A non-empty
RUNLOG that yields zero parseable iteration records now returnsstatus: "degraded"/
recommendation: "replan"(with evidence) instead of the benignok/continue/[]that
was byte-identical to a healthy loop — making the silent inertness of stall/repair-churn
detection on prose RUNLOGs visible.
Changed
- Removed the unreferenced broad-substring corpus scoring path from
scripts/inspect_loop.py
(_gather_corpus,_walk_bounded,_evaluate_checks,_terminal_states_covered) — dead
code since the keyword-stuffing fix replaced it with the typed-contract path. Corrected
loop-inspector/SKILL.mdandreference/patterns.md§4 to describe the actual named,
typed, dual-located contract file set the inspector reads, rather than a "reads any foreign
harness shape semantically" claim the implementation never honored.
Added
pyproject.toml— the portable core is now installable withpip install -e .
(optionalpip install -e ".[yaml]"for faster manifest parsing), so
python3 -m loop doctor|inspect <workspace>runs from any directory rather than only the
repo root. The core stays pure-stdlib; PyYAML remains an optional extra. A new
test_docs_versioncheck pins thepyproject.tomlversion to.claude-plugin/plugin.json.
Documentation
- README: the Portable validator / inspector section documents the editable install for
running outside the repo root; the 30-secondinspectdemo now shows the full
target/present/gapsreport; thedoctorblock notes the omittedpathsobject;
validate/verifyare documented asdoctoraliases;terminal_state.jsonis noted as
resolving in either.loop/or the workspace root. examples/coverage-repairrecords receipts at the canonical.loop/receipts/*.jsonl(was the
stale pre-decoupling.gsd/audit/receipts/path, inconsistent with the example's own.loop/
layout).loop-runtime-monitor/SKILL.mdframes its position generically ("vs a loop-driving operator")
instead of naming a private plugin agent.Dogfood-driven hardening: ranloop-inspector+loop-runtime-monitoragainst 9 real
on-disk loops (foreign and in-house). The tools had been built and tested only against
this suite's own well-formed loops, so first contact with foreign/edge-case inputs exposed
six defects — all fixed here under TDD, each pinned by a regression test.
Fixed
- (P1)
inspect_loopno longer crashes on a malformedmanifest.yaml.read_manifest
(loop/contract.py) ranyaml.safe_loadwithout a guard — the one read path missing the
json.JSONDecodeErrorguard every JSON read already had — so a malformed manifest in an
untrusted/foreign loop dir killed the inspector with a traceback instead of returning a
report. It now fails safe to{}, fixing the crash forinspect_loop,validate_contract,
anddoctor_reportat once. inspect_loopnow scoresSPEC.md/WORKFLOW.md/TASKS.jsondual-location (.loop/
∪ workspace root), likemanifest/statealready resolved. Previously SPEC/WORKFLOW were
hard-coded to the workspace root, so a loop whose contract lives under.loop/(including
loop-engineer's own repo) was falsely scored as having "no success criteria" / "no
independent verification." Scores on substance, not on where the file sits.inspect_looprecognizes a single-fileloop-contract.mdas a contract-owned source
for success criteria, approval gates, plan-then-execute, and terminal-state coverage — a
committed minimal-contract loop that names all 7 terminal states is no longer scored 0/7.runtime_monitoris terminal-state-aware. It now readsterminal_state/state == "terminal"and reportsrecommendation: "done"(surfacing the terminal state) instead of
advisingcontinueon a loop that has already finished.runtime_monitorno longer reports an unparseable RUNLOG as healthy. A non-empty
RUNLOG that yields zero parseable iteration records now returnsstatus: "degraded"/
recommendation: "replan"(with evidence) instead of the benignok/continue/[]that
was byte-identical to a healthy loop — making the silent inertness of stall/repair-churn
detection on prose RUNLOGs visible.
Changed
- Removed the unreferenced broad-substring corpus scoring path from
scripts/inspect_loop.py
(_gather_corpus,_walk_bounded,_evaluate_checks,_terminal_states_covered) — dead
code since the keyword-stuffing fix replaced it with the typed-contract path. Corrected
loop-inspector/SKILL.mdandreference/patterns.md§4 to describe the actual named,
typed, dual-located contract file set the inspector reads, rather than a "reads any foreign
harness shape semantically" claim the implementation never honored.
Added
pyproject.toml— the portable core is now installable withpip install -e .
(optionalpip install -e ".[yaml]"for faster manifest parsing), so
python3 -m loop doctor|inspect <workspace>runs from any directory rather than only the
repo root. The core stays pure-stdlib; PyYAML remains an optional extra. A new
test_docs_versioncheck pins thepyproject.tomlversion to.claude-plugin/plugin.json.
Documentation
- README: the Portable validator / inspector section documents the editable install for
running outside the repo root; the 30-secondinspectdemo now shows the full
target/present/gapsreport; thedoctorblock notes the omittedpathsobject;
validate/verifyare documented asdoctoraliases;terminal_state.jsonis noted as
resolving in either.loop/or the workspace root. examples/coverage-repairrecords receipts at the canonical.loop/receipts/*.jsonl(was the
stale pre-decoupling.gsd/audit/receipts/path, inconsistent with the example's own.loop/
layout).loop-runtime-monitor/SKILL.mdframes its position generically ("vs a loop-driving operator")
instead of naming a private plugin agent.
Erratum (2026-06-30): the Documentation note above overstated examples/coverage-repair — the frozen example ships contract artifacts, not a receipts trail. Corrected in the CHANGELOG Errata section; as of the M2 launch slice the example is fully runnable with a committed real holdout-gate verdict.
v0.3.3 — citation fixes
Changed
- Citation accuracy: corrected three over-reaching attributions to real sources
(no citations removed, no IDs changed). The "A/B trigger policy / cost-benefit
knob" and "cuts wasted edits" are reframed as this suite's own design choices
rather than PreFlect (arXiv 2602.07187) findings — PreFlect reflects on every
plan unconditionally and reports no edit-efficiency metric. The "repo-native
run-ledger over a vendor eval UI" is attributed to this suite as its answer to
the open challenge posed by Code as Agent Harness (arXiv 2605.18747), not as
that paper's claim.
Fixed
- Standalone scripts now resolve the
looppackage when run by path. The
documented invocationspython3 scripts/runtime_monitor.py <loop>and
python3 scripts/inspect_loop.py <loop>putscripts/onsys.path(not the
repo root), so the siblinglooppackage was unimportable and the scripts
silently used their degraded fallbacks —runtime_monitorreported
missing RUNLOG.mdon the canonical.loop/RUNLOG.mdlayout, and
inspect_loopcould not readplan_then_executefrom.loop/manifest.yaml.
Both scripts now self-bootstrap the repo root ontosys.pathbefore importing
loop.*, matchingpython -m loopbehaviour. The bug was invisible to CI
becausepython -m pytestalready places the repo root onsys.path; added
by-path subprocess regression tests that reproduce the real standalone call.
v0.3.2 — public cut + BYO decoupling
Loop Contract Core plus a public open-source readiness pass: every skill now runs
on the bundled portable core with no private tooling, and the repo ships CI and
standard community files.
Added
- Loop Contract Core. The portable
loop/package with
python3 -m loop doctor|validate|verify|inspect, shared workspace/.loop
path resolution, and JSON schemas formanifest@1,state@1,tasks@1, and
terminal@1. - Generic receipt schema (
schemas/receipt.schema.json,receipt@1) — an
engine-neutral dispatch/cost record at.loop/receipts/*.jsonlso the flywheel,
evals, and runtime-monitor compute routing + cost metrics without any private
telemetry. byo-defaultstructural check (the 13th self-eval check) — fails if any
skill depends on an unbundled tool without also naming the bundled default path.- Continuous integration (
.github/workflows/ci.yml) — runs the frontmatter,
self-eval, pytest, compile, JSON-validity, and quickstart-smoke gates on Python
3.10 / 3.11 / 3.12. - Community files — CONTRIBUTING, CODE_OF_CONDUCT, SECURITY, and issue/PR
templates. - Discoverability metadata in
plugin.json(homepage, repository, keywords).
Changed
- Bring-your-own-verifier decoupling. Skills and reference docs now default to
the bundled gate (scripts/verify-fast→verify-full,python3 -m loop verify)
and.loop/receipts/*.jsonl./verify-slice,/verify-milestone,.gsd/
receipts,model_routing.py/workflow_routing.py, Harmony, and Hermes are now
documented as optional integrations / example realizations, never requirements. - Install is now
claude plugin marketplace add SollanSystems/loop-engineer;
the marketplace is renamed fromloop-engineer-localtoloop-engineer. .claude-plugin/plugin.jsonversion0.3.1→0.3.2.
Fixed
scripts/inspect_loop.pynow scores contract-owned artifacts instead of broad
README/prose keyword matches;plan_then_execute: falseno longer receives
credit by substring.scripts/runtime_monitor.pynow resolves canonical rootRUNLOG.md, returns
structured reports for partial loop state, and avoids cross-task repair-churn
false positives.scripts/benchmark_harness.pyrejects duplicate task ids before computing A/B
metrics.scripts/anticheat_scan.pyflags semantic self-weakening of safety ranking or
downgrade mapping asFailedSafety.
v0.3.1 — inspector + runtime monitor
Adversarial-fix milestone. The v0.3.0 release closed two false-POSITIVE classes
in the anti-cheat scanner; a GPT-5.5/xhigh codex challenge over the v0.3.0 diff
then found the blind side — evasion paths the scanner failed to flag, plus
boundary-validation gaps in three harness scripts. This patch closes them.
Fixed
Anti-cheat scanner false-negatives (P1.1–P1.5) — scripts/anticheat_scan.py
- Scoped self-exclusion (P1.1). A scanner self-edit that empties or shrinks
DEFAULT_GATE_PATHS/_ADDED_LINE_SIGNATURESis now graded critical
(FailedSafety); additive and comment-only self-edits stay clean. Removed
entries are compared semantically, so a reorder or reformat does not flag. - Delete + rename evidence (P1.2).
parse_changed_filesnow also captures
gate files that are deleted (+++ /dev/null) or renamed
(rename from/rename to); both of Codex's exact exploit diffs now return
clean:false. - verify-* gate coverage (P1.3). Gate-path matching now covers
verify-fast/verify-full/verify-safety; tampering one to bypass it is
flagged. - Broader tautology detection (P1.4). Identical-operand assertions (a literal
or an identifier compared against itself) and always-true unittest calls now
downgrade toFailedUnverifiable; honest asserts with distinct operands stay clean. - Path-shaped hidden-answer names (P1.5). Trajectory reads of held-out /
hold_out / answer-key / golden / expected-output paths are flagged, while a
plainassert result == expectedstays clean.
Boundary validation (P1.6, P2.1–P2.4)
scripts/benchmark_harness.py—compare()raises on a mismatched A/B
task-set instead of reporting a silent delta; non-boolclaimed_done/
verification_passedand out-of-range repair / criteria counts are rejected.scripts/runtime_monitor.py— robust score parsing for1e-3, negatives, and
malformed input (no crash); tests pin the exact intervention per scenario.scripts/inspect_loop.py— bounded shallow walk with a per-file read cap
replaces the unbounded full-tree traversal.
Changed (P2.5)
README.md— present-tense install note corrected to "all 9 skills"..claude-plugin/plugin.json— version0.3.0→0.3.1.
Credits
- The false-negative and boundary findings came from the GPT-5.5/xhigh
codexadversarial review over the v0.3.0 release diff.