Releases · SollanSystems/loop-engineer

Release list

loop-engineer v0.6.0 — metrics real Latest

Latest

SollanSystems released this 04 Jul 01:16

loop-engineer--v0.6.0

dc9d9cc

"Metrics real" — false-completion-rate and repair-productivity graduate from claims to derivations, red-teamed before release.

loop metrics <loop-dir> derives FCR and RP from a loop's real on-disk evidence — RUNLOG claims × verify bundles, held-out verdicts, repair records, receipts — never from agent narration. FCR is computed two independent ways and disagreement is surfaced, not resolved; unmatched success claims fail closed. Every number ships with a provenance block naming its input files.

loop metrics --baseline publishes docs/metrics-baseline.json and refuses over anything not genuinely gate-backed: a structurally-valid held-out verdict artifact is mandatory, rejected/unanchored repair records block it, disagreeing FCR methods block it, and a vacuous zero-claim run cannot baseline.

Published baseline (gate-backed examples/coverage-repair): FCR 0.0 · RP 1.0 — reproducible with python3 -m loop metrics examples/coverage-repair; a test binds the README literals to the committed JSON.

Honesty invariants:

productive is recomputed from each record's own evidence, never trusted; repair records must anchor to a same-task red→green verify-bundle pair.
Claim semantics are outcome-class aware: completion-class claims (task_passed/succeeded/terminal) require every attached bundle green, no exceptions; progress-class (advanced) tolerates a red intermediate only if the same task goes green in a strictly later iteration.
Canonical record schemas (loop-engineer/repair@1, loop-engineer/rollout@1) end the two-shapes-called-"the repair record" ambiguity.

Red-teamed before merge: two adversarial review rounds confirmed 17 issues in the metrics implementation — including a --baseline that would have published a clean FCR over a run its own gate flagged — all fixed and pinned as regression tests. Documented residual: a committed verdict is evidence, not proof; tamper detection belongs to the anti-cheat layer.

Also: loop console script (pip install -e .), doctor reports validated record schemas, 217-test suite (+67).

Full details in CHANGELOG.md.

Assets 2

loop-engineer v0.5.0 — credibility enforcement + first screen

SollanSystems released this 03 Jul 18:48

loop-engineer--v0.5.0

58da373

The two pre-launch milestones of the v1.0 roadmap, landed together.

Enforce the wedge — false-completion defense is now enforced by validators, not asserted by docs:

A Succeeded terminal no longer validates with false_completion: true or an empty/false criteria_met; the inspector grades false-completion defense on invocation evidence, never a self-asserted flag (#8).
The held-out gate returns NotReady on an empty visible set; the anti-cheat scanner detects edits that neuter its own gate-decision functions and reports gate tampering with a distinct exit code (#8).
examples/coverage-repair runs end-to-end through the real held-out gate — its false_completion: false is backed by a committed gate verdict, not a hand-set flag (#9).
The repo's own live contract passes its own gate: python3 -m loop doctor .loop → ok (#8).

First screen — a stranger gets it in 10 seconds, scores a loop in 30:

README rebuilt: tagline, concrete failure modes, zero-install first command, stack diagram, comparison table (#7, #12).
Weak→strong demo filmed live on the real tools: docs/demo.gif — a self-asserted DIY loop scores 0/weak, the gate-backed example 90/strong (#13).
New loop scaffold command + JSON Schemas for the contract artifacts; promised templates shipped (#8, #11).
CLI polish: --help/--version, distinct errors, explicit exit codes (#10).

Plus: the v1.0 master roadmap and four strategic design specs are now committed under docs/superpowers/ (#15).

Version jumps 0.3.4 → 0.5.0 to match the roadmap's milestone numbering; there is no 0.4.x tag.

Full details in CHANGELOG.md.

Assets 2

v0.3.4 — dogfood-driven hardening + portable core

SollanSystems released this 03 Jul 16:58

loop-engineer--v0.3.4

3a7c152

Dogfood-driven hardening: ran loop-inspector + loop-runtime-monitor against 9 real
on-disk loops (foreign and in-house). The tools had been built and tested only against
this suite's own well-formed loops, so first contact with foreign/edge-case inputs exposed
six defects — all fixed here under TDD, each pinned by a regression test.

Fixed

(P1) inspect_loop no longer crashes on a malformed manifest.yaml. read_manifest
(loop/contract.py) ran yaml.safe_load without a guard — the one read path missing the
json.JSONDecodeError guard every JSON read already had — so a malformed manifest in an
untrusted/foreign loop dir killed the inspector with a traceback instead of returning a
report. It now fails safe to {}, fixing the crash for inspect_loop, validate_contract,
and doctor_report at once.
inspect_loop now scores SPEC.md / WORKFLOW.md / TASKS.json dual-location (.loop/
∪ workspace root), like manifest/state already resolved. Previously SPEC/WORKFLOW were
hard-coded to the workspace root, so a loop whose contract lives under .loop/ (including
loop-engineer's own repo) was falsely scored as having "no success criteria" / "no
independent verification." Scores on substance, not on where the file sits.
inspect_loop recognizes a single-file loop-contract.md as a contract-owned source
for success criteria, approval gates, plan-then-execute, and terminal-state coverage — a
committed minimal-contract loop that names all 7 terminal states is no longer scored 0/7.
runtime_monitor is terminal-state-aware. It now reads terminal_state / state == "terminal" and reports recommendation: "done" (surfacing the terminal state) instead of
advising continue on a loop that has already finished.
runtime_monitor no longer reports an unparseable RUNLOG as healthy. A non-empty
RUNLOG that yields zero parseable iteration records now returns status: "degraded" /
recommendation: "replan" (with evidence) instead of the benign ok/continue/[] that
was byte-identical to a healthy loop — making the silent inertness of stall/repair-churn
detection on prose RUNLOGs visible.

Changed

Removed the unreferenced broad-substring corpus scoring path from scripts/inspect_loop.py
(_gather_corpus, _walk_bounded, _evaluate_checks, _terminal_states_covered) — dead
code since the keyword-stuffing fix replaced it with the typed-contract path. Corrected
loop-inspector/SKILL.md and reference/patterns.md §4 to describe the actual named,
typed, dual-located contract file set the inspector reads, rather than a "reads any foreign
harness shape semantically" claim the implementation never honored.

Added

pyproject.toml — the portable core is now installable with pip install -e .
(optional pip install -e ".[yaml]" for faster manifest parsing), so
python3 -m loop doctor|inspect <workspace> runs from any directory rather than only the
repo root. The core stays pure-stdlib; PyYAML remains an optional extra. A new
test_docs_version check pins the pyproject.toml version to .claude-plugin/plugin.json.

Documentation

README: the Portable validator / inspector section documents the editable install for
running outside the repo root; the 30-second inspect demo now shows the full
target / present / gaps report; the doctor block notes the omitted paths object;
validate / verify are documented as doctor aliases; terminal_state.json is noted as
resolving in either .loop/ or the workspace root.
examples/coverage-repair records receipts at the canonical .loop/receipts/*.jsonl (was the
stale pre-decoupling .gsd/audit/receipts/ path, inconsistent with the example's own .loop/
layout).
loop-runtime-monitor/SKILL.md frames its position generically ("vs a loop-driving operator")
instead of naming a private plugin agent.Dogfood-driven hardening: ran loop-inspector + loop-runtime-monitor against 9 real
on-disk loops (foreign and in-house). The tools had been built and tested only against
this suite's own well-formed loops, so first contact with foreign/edge-case inputs exposed
six defects — all fixed here under TDD, each pinned by a regression test.

Fixed

(P1) inspect_loop no longer crashes on a malformed manifest.yaml. read_manifest
(loop/contract.py) ran yaml.safe_load without a guard — the one read path missing the
json.JSONDecodeError guard every JSON read already had — so a malformed manifest in an
untrusted/foreign loop dir killed the inspector with a traceback instead of returning a
report. It now fails safe to {}, fixing the crash for inspect_loop, validate_contract,
and doctor_report at once.
inspect_loop now scores SPEC.md / WORKFLOW.md / TASKS.json dual-location (.loop/
∪ workspace root), like manifest/state already resolved. Previously SPEC/WORKFLOW were
hard-coded to the workspace root, so a loop whose contract lives under .loop/ (including
loop-engineer's own repo) was falsely scored as having "no success criteria" / "no
independent verification." Scores on substance, not on where the file sits.
inspect_loop recognizes a single-file loop-contract.md as a contract-owned source
for success criteria, approval gates, plan-then-execute, and terminal-state coverage — a
committed minimal-contract loop that names all 7 terminal states is no longer scored 0/7.
runtime_monitor is terminal-state-aware. It now reads terminal_state / state == "terminal" and reports recommendation: "done" (surfacing the terminal state) instead of
advising continue on a loop that has already finished.
runtime_monitor no longer reports an unparseable RUNLOG as healthy. A non-empty
RUNLOG that yields zero parseable iteration records now returns status: "degraded" /
recommendation: "replan" (with evidence) instead of the benign ok/continue/[] that
was byte-identical to a healthy loop — making the silent inertness of stall/repair-churn
detection on prose RUNLOGs visible.

Changed

Removed the unreferenced broad-substring corpus scoring path from scripts/inspect_loop.py
(_gather_corpus, _walk_bounded, _evaluate_checks, _terminal_states_covered) — dead
code since the keyword-stuffing fix replaced it with the typed-contract path. Corrected
loop-inspector/SKILL.md and reference/patterns.md §4 to describe the actual named,
typed, dual-located contract file set the inspector reads, rather than a "reads any foreign
harness shape semantically" claim the implementation never honored.

Added

pyproject.toml — the portable core is now installable with pip install -e .
(optional pip install -e ".[yaml]" for faster manifest parsing), so
python3 -m loop doctor|inspect <workspace> runs from any directory rather than only the
repo root. The core stays pure-stdlib; PyYAML remains an optional extra. A new
test_docs_version check pins the pyproject.toml version to .claude-plugin/plugin.json.

Documentation

README: the Portable validator / inspector section documents the editable install for
running outside the repo root; the 30-second inspect demo now shows the full
target / present / gaps report; the doctor block notes the omitted paths object;
validate / verify are documented as doctor aliases; terminal_state.json is noted as
resolving in either .loop/ or the workspace root.
examples/coverage-repair records receipts at the canonical .loop/receipts/*.jsonl (was the
stale pre-decoupling .gsd/audit/receipts/ path, inconsistent with the example's own .loop/
layout).
loop-runtime-monitor/SKILL.md frames its position generically ("vs a loop-driving operator")
instead of naming a private plugin agent.

Erratum (2026-06-30): the Documentation note above overstated examples/coverage-repair — the frozen example ships contract artifacts, not a receipts trail. Corrected in the CHANGELOG Errata section; as of the M2 launch slice the example is fully runnable with a committed real holdout-gate verdict.

Assets 2

v0.3.3 — citation fixes

SollanSystems released this 03 Jul 16:58

loop-engineer--v0.3.3

88d12dd

Changed

Citation accuracy: corrected three over-reaching attributions to real sources
(no citations removed, no IDs changed). The "A/B trigger policy / cost-benefit
knob" and "cuts wasted edits" are reframed as this suite's own design choices
rather than PreFlect (arXiv 2602.07187) findings — PreFlect reflects on every
plan unconditionally and reports no edit-efficiency metric. The "repo-native
run-ledger over a vendor eval UI" is attributed to this suite as its answer to
the open challenge posed by Code as Agent Harness (arXiv 2605.18747), not as
that paper's claim.

Fixed

Standalone scripts now resolve the loop package when run by path. The
documented invocations python3 scripts/runtime_monitor.py <loop> and
python3 scripts/inspect_loop.py <loop> put scripts/ on sys.path (not the
repo root), so the sibling loop package was unimportable and the scripts
silently used their degraded fallbacks — runtime_monitor reported
missing RUNLOG.md on the canonical .loop/RUNLOG.md layout, and
inspect_loop could not read plan_then_execute from .loop/manifest.yaml.
Both scripts now self-bootstrap the repo root onto sys.path before importing
loop.*, matching python -m loop behaviour. The bug was invisible to CI
because python -m pytest already places the repo root on sys.path; added
by-path subprocess regression tests that reproduce the real standalone call.

Assets 2

v0.3.2 — public cut + BYO decoupling

SollanSystems released this 03 Jul 16:58

loop-engineer--v0.3.2

6d8bbbd

Loop Contract Core plus a public open-source readiness pass: every skill now runs
on the bundled portable core with no private tooling, and the repo ships CI and
standard community files.

Added

Loop Contract Core. The portable loop/ package with
python3 -m loop doctor|validate|verify|inspect, shared workspace/.loop
path resolution, and JSON schemas for manifest@1, state@1, tasks@1, and
terminal@1.
Generic receipt schema (schemas/receipt.schema.json, receipt@1) — an
engine-neutral dispatch/cost record at .loop/receipts/*.jsonl so the flywheel,
evals, and runtime-monitor compute routing + cost metrics without any private
telemetry.
byo-default structural check (the 13th self-eval check) — fails if any
skill depends on an unbundled tool without also naming the bundled default path.
Continuous integration (.github/workflows/ci.yml) — runs the frontmatter,
self-eval, pytest, compile, JSON-validity, and quickstart-smoke gates on Python
3.10 / 3.11 / 3.12.
Community files — CONTRIBUTING, CODE_OF_CONDUCT, SECURITY, and issue/PR
templates.
Discoverability metadata in plugin.json (homepage, repository, keywords).

Changed

Bring-your-own-verifier decoupling. Skills and reference docs now default to
the bundled gate (scripts/verify-fast → verify-full, python3 -m loop verify)
and .loop/receipts/*.jsonl. /verify-slice, /verify-milestone, .gsd/
receipts, model_routing.py / workflow_routing.py, Harmony, and Hermes are now
documented as optional integrations / example realizations, never requirements.
Install is now claude plugin marketplace add SollanSystems/loop-engineer;
the marketplace is renamed from loop-engineer-local to loop-engineer.
.claude-plugin/plugin.json version 0.3.1 → 0.3.2.

Fixed

scripts/inspect_loop.py now scores contract-owned artifacts instead of broad
README/prose keyword matches; plan_then_execute: false no longer receives
credit by substring.
scripts/runtime_monitor.py now resolves canonical root RUNLOG.md, returns
structured reports for partial loop state, and avoids cross-task repair-churn
false positives.
scripts/benchmark_harness.py rejects duplicate task ids before computing A/B
metrics.
scripts/anticheat_scan.py flags semantic self-weakening of safety ranking or
downgrade mapping as FailedSafety.

Assets 2

v0.3.1 — inspector + runtime monitor

SollanSystems released this 03 Jul 16:58

loop-engineer--v0.3.1

1793a47

Adversarial-fix milestone. The v0.3.0 release closed two false-POSITIVE classes
in the anti-cheat scanner; a GPT-5.5/xhigh codex challenge over the v0.3.0 diff
then found the blind side — evasion paths the scanner failed to flag, plus
boundary-validation gaps in three harness scripts. This patch closes them.

Fixed

Anti-cheat scanner false-negatives (P1.1–P1.5) — scripts/anticheat_scan.py

Scoped self-exclusion (P1.1). A scanner self-edit that empties or shrinks
DEFAULT_GATE_PATHS / _ADDED_LINE_SIGNATURES is now graded critical
(FailedSafety); additive and comment-only self-edits stay clean. Removed
entries are compared semantically, so a reorder or reformat does not flag.
Delete + rename evidence (P1.2). parse_changed_files now also captures
gate files that are deleted (+++ /dev/null) or renamed
(rename from/rename to); both of Codex's exact exploit diffs now return
clean:false.
verify-* gate coverage (P1.3). Gate-path matching now covers
verify-fast / verify-full / verify-safety; tampering one to bypass it is
flagged.
Broader tautology detection (P1.4). Identical-operand assertions (a literal
or an identifier compared against itself) and always-true unittest calls now
downgrade to FailedUnverifiable; honest asserts with distinct operands stay clean.
Path-shaped hidden-answer names (P1.5). Trajectory reads of held-out /
hold_out / answer-key / golden / expected-output paths are flagged, while a
plain assert result == expected stays clean.

Boundary validation (P1.6, P2.1–P2.4)

scripts/benchmark_harness.py — compare() raises on a mismatched A/B
task-set instead of reporting a silent delta; non-bool claimed_done /
verification_passed and out-of-range repair / criteria counts are rejected.
scripts/runtime_monitor.py — robust score parsing for 1e-3, negatives, and
malformed input (no crash); tests pin the exact intervention per scenario.
scripts/inspect_loop.py — bounded shallow walk with a per-file read cap
replaces the unbounded full-tree traversal.

Changed (P2.5)

README.md — present-tense install note corrected to "all 9 skills".
.claude-plugin/plugin.json — version 0.3.0 → 0.3.1.

Credits

The false-negative and boundary findings came from the GPT-5.5/xhigh
codex adversarial review over the v0.3.0 release diff.

Assets 2

Releases: SollanSystems/loop-engineer

Release list

loop-engineer v0.6.0 — metrics real

Uh oh!

loop-engineer v0.5.0 — credibility enforcement + first screen

Uh oh!

v0.3.4 — dogfood-driven hardening + portable core

Fixed

Changed

Added

Documentation

Fixed

Changed

Added

Documentation

Uh oh!

v0.3.3 — citation fixes

Changed

Fixed

Uh oh!

v0.3.2 — public cut + BYO decoupling

Added

Changed

Fixed

Uh oh!

v0.3.1 — inspector + runtime monitor

Fixed

Changed (P2.5)

Credits

Uh oh!