Skip to content

v0.41.0

Choose a tag to compare

@github-actions github-actions released this 07 Jun 09:30
· 60 commits to main since this release

v0.41.0 — Run it anywhere, and trust what it ships

A week ago, ooo auto learned to finish the job on its own. This release makes
that autonomy something you can actually rely on: it runs on one more runtime,
it refuses to start building until the goal is unambiguous, and the verdict that
decides "is this actually done?" can no longer be gamed.

The headline

Autonomy is only worth as much as the trust behind it. v0.40.0 closed the loop —
goal in, product out. v0.41.0 spends its week hardening the two ends of that
loop and widening the floor it runs on.

  • Run it anywhere. Pi joins Claude, Codex, Gemini, OpenCode, Goose, and
    Copilot as a first-class runtime. Ouroboros stays the workflow engine; the
    runtime is a swappable kernel. Installing it got more reliable, and every
    default model pin now lives in one place.
  • Trust what it ships — at the input. The Socratic interview no longer thinks
    alone. At every ambiguity milestone it convenes a panel — a researcher, a
    contrarian, a simplifier — to surface hidden assumptions before the question
    reaches you. And ooo auto will not start building until the Seed is genuinely
    low-ambiguity and passes QA.
  • Trust what it ships — at the output. The verifier's verdict is now typed,
    audited, and routed by an explicit admission policy. A test that really ran but
    reported the wrong evidence form is no longer smeared as "fabrication," and a
    faked clean run still doesn't pass.

🖥️ Run it anywhere — the Agent OS gets a new kernel

Pi is now a first-class Ouroboros runtime. Ouroboros owns the workflow engine,
Seed decomposition, checkpointing, evaluation handoff, and ooo skill dispatch;
for each runtime task it shells out to pi --mode json and normalizes Pi's JSONL
events into Ouroboros AgentMessage values. As the new runtime guide puts it:
"Pi is an Ouroboros runtime" means the runtime is selectable — not that Pi is
imported into Ouroboros.
That is the whole Agent OS thesis in one sentence.

  • PiLLMAdapter for --llm-backend pi; pi / pi_cli registered as LLM- and
    interview-driver-capable in the backend registry and provider factory (#1326)
  • Pi backend-aware default-model normalization — default --llm-backend pi uses
    Pi's own backend default instead of forwarding an Anthropic model name (#1326)
  • Align the Pi runtime with documented JSON mode (#1321)
  • Report malformed Pi runtime events as a typed ProviderError instead of
    failing opaquely (#1325)
  • Wire the Pi runtime setup surface — ouroboros setup --runtime pi installs the
    managed Pi bridge (5c674c1)
  • Opt-in native Pi CLI smoke test for end-to-end confidence (#1329)

Installing and updating got more trustworthy. The week's two same-day releases
surfaced real install-path risk; this closes it.

  • Run setup with the freshly installed ouroboros binary, not a stale one
    left on PATH (#1345)
  • Installer UX improvements; pipx/pip install paths now preserve existing PATH
    precedence (#1343)

One source of truth for model pins. The same default model strings were
hand-copied across three layers, so Opus had silently been frozen at 4.6 since
February. They now live in a single _model_defaults.py.

  • Centralize every default Claude model pin into one source of truth
    (_model_defaults.py) and pin exact snapshots rather than the "default"
    sentinel, so evaluation/consensus grading stays reproducible. Net move: the
    Opus reasoning tier → 4.8 (interview, seed, ontology, evaluation, consensus
    advocate); the Sonnet judgment tier (qa_model) stays pinned at 4.6,
    retiring the dated claude-sonnet-4-20250514 (#1324, #1323)

Roadmap, in the open. A point-in-time AgentOS issue-sequencing graph
(Track A / B / C) is now published so you can see which merged PRs resolved which
roadmap tracks. #961 remains the canonical roadmap SSOT (#1293).

Housekeeping. Prune unused optional packages (#1301); pin typer before the
vendored click to stabilize resolution (#1300).


🧠 Trust what it ships — at the input: the interview stops thinking alone

Ouroboros has always opened with a single questioner. Now that questioner has a
panel. Milestone lateral review is promoted from a non-blocking advisory to a
required lightweight subagent pass
at exactly the moments hidden assumptions
start to bite.

  • When an interview crosses an ambiguity milestone — initial → progress,
    progress → refined, refined → ready — the main session dispatches
    ouroboros_lateral_think with researcher, contrarian, and simplifier
    personas (adding architect when the answer changes system shape or ownership)
    before answering or asking the returned question (9d229c4)
  • This is the supported "deep research style" interview experience: multiple
    perspectives visibly help, while the final prompt stays easy to answer. Results
    are folded into 2–3 concrete options or one recommended draft — not dumped
    as a report
  • Lateral review also fires whenever the main session would otherwise compress a
    user's free-text into a decision, or when the question is about tradeoffs,
    priorities, non-goals, risk, success criteria, or rollout
  • run_lateral_review is now a declared interview capability, with per-runtime
    capability/instruction artifacts wired in (9d229c4)

ooo auto won't build something underspecified. The interview no longer
closes on ledger completeness alone.

  • Gate auto runs on backend-confirmed low ambiguity (≤ 0.20) plus a pre-run
    Seed QA pass for both the MCP and CLI entrypoints; QA findings feed back into
    bounded Seed-repair attempts before blocking, so failures are actionable and
    resumable (#1302)
  • Normalize natural worktree-policy names (e.g. create_isolated_worktree → always)
    and fail fast when complete_product=true is paired with a too-short timeout,
    instead of burning the budget in the interview and blocking late (#1305)

🛡️ Trust what it ships — at the output: a verdict you can't game

The more autonomous the loop, the more its "done" has to mean done. This release
makes the verifier's decision typed, auditable, and policy-routed (RFC #814,
Verdict Envelope v1).

  • Promote TraceGuard verdict admission into VerifierVerdict: H1 verifier
    output now carries a typed status, evidence refs, and a retry_admission, and
    ACCEPT / RETRY / REDISPATCH / ESCALATE_MODEL / ESCALATE_HUMAN / BLOCK decisions
    are persisted on atomic typed-evidence events (#1330)
    • Benchmark fixtures: accepted → ACCEPT, missing evidence → EVIDENCE_MISSING / RETRY, semantic miss → SCOPE_CREEP / REDISPATCH, repeated fabrication → FABRICATION_SUSPECTED / ESCALATE_MODEL
  • Prefer the verifier's retry-admission policy (H7): re-run the same leaf only
    when retry_admission=RETRY; honor intentional divergence between
    failure_class and retry_admission (e.g. FABRICATION_SUSPECTED +
    REDISPATCH) instead of inferring policy from the failure class alone (#1331)
  • Classify masked test evidence fairly (#1292): a transcript that clearly ran
    the test command but masked its status behind an output filter (… | tail) is
    now EVIDENCE_FORM_MISMATCH — retryable, with actionable feedback (e.g. add
    set -o pipefail) — rather than FABRICATION_SUSPECTED. The #1208 guard holds:
    unprotected output-filter pipelines still don't prove a clean commands_run
    claim. The verifier's evidence boundary is now codified in docs so core stays
    language- and runner-agnostic

What's Changed

Runtimes & Agent OS

  • feat(providers): add Pi LLM adapter (#1326)
  • fix(pi): align runtime with documented JSON mode (#1321)
  • fix(pi): report malformed runtime events (#1325)
  • fix(setup): wire Pi runtime setup surface (5c674c1)
  • test(orchestrator): add opt-in Pi CLI smoke test (#1329)
  • fix(installer): prefer freshly installed ouroboros for setup (#1345)
  • feat(installer): improve install script UX (#1343)
  • refactor(config): centralize Claude model pins into a single source of truth (align to 4.8) (#1324)
  • fix(config): replace retiring qa_model default with claude-sonnet-4-6 (#1323)
  • chore(deps): prune unused optional packages (#1301)
  • fix(deps): pin typer before vendored click (#1300)
  • fix(opencode): cover Windows cleanup review blockers (#1320)
  • fix(goose): keep LLM completion calls profile-free (#1303)
  • fix(run): guard home dir in _detect_project_root_from_seed_path (#1313)

Interview (the philosophy layer)

  • feat(interview): dispatch lateral review at milestones (9d229c4)
  • fix(auto): gate runs on low-ambiguity seed QA (#1302)
  • Harden ooo auto policy aliases and timeout preflight (#1305)

Verifier & harness integrity

  • feat(harness): promote TraceGuard verdict admission (#1330, refs #814)
  • fix(h7): prefer verifier retry admission policy (#1331)
  • fix(orchestrator): classify masked test evidence forms (#1292, refs #1234)

Docs

  • docs(providers): document Pi provider surfaces (#1327)
  • docs(runtime): fix shipped backend wording (#1332)
  • docs(agentos): add issue sequencing graph snapshot (#1293)
  • Verdict Envelope v1 RFC, verifier-evidence-policy, runtime-capability-matrix,
    Pi runtime guide, and contributing/key-patterns updates

What's Changed

  • fix(orchestrator): classify masked test evidence forms by @Q00 in #1292
  • docs(agentos): add issue sequencing graph snapshot by @Q00 in #1293
  • fix(deps): pin typer before vendored click by @Q00 in #1300
  • chore(deps): prune unused optional packages by @Q00 in #1301
  • fix(goose): keep LLM completion calls profile-free by @mdc2122 in #1303
  • fix(run): guard home dir in _detect_project_root_from_seed_path by @kenlin8827 in #1313
  • fix(opencode): cover Windows cleanup review blockers by @shaun0927 in #1320
  • fix(pi): align runtime with documented JSON mode by @shaun0927 in #1321
  • fix(auto): gate runs on low-ambiguity seed QA by @Q00 in #1302
  • Harden ooo auto policy aliases and timeout preflight by @shaun0927 in #1305
  • fix(config): replace retiring qa_model default with claude-sonnet-4-6 by @shaun0927 in #1323
  • fix(pi): report malformed runtime events by @Q00 in #1325
  • refactor(config): centralize Claude model pins into a single source of truth (align to 4.8) by @shaun0927 in #1324
  • feat(providers): add Pi LLM adapter by @Q00 in #1326
  • feat(harness): promote TraceGuard verdict admission by @Q00 in #1330
  • fix(h7): prefer verifier retry admission policy by @Q00 in #1331
  • docs(providers): document Pi provider surfaces by @Q00 in #1327
  • test(orchestrator): add opt-in Pi CLI smoke test by @Q00 in #1329
  • docs(runtime): fix shipped backend wording by @Q00 in #1332
  • fix(setup): wire Pi runtime setup surface by @Q00 in #1333
  • feat(interview): dispatch lateral review at milestones by @Q00 in #1334
  • feat(installer): improve install script UX by @Q00 in #1343
  • fix(installer): prefer freshly installed ouroboros for setup by @Q00 in #1345

New Contributors

Full Changelog: v0.40.1...v0.41.0