Ship agents that don't go off the rails.
A drop-in MCP supervisor that gates every agent action with a verdict — auto / supervised / escalate — and a per-category trust score that rises and falls based on observed outcomes, so risky categories stay held back while proven ones run unattended. The safety floor is a pure deterministic function: irreversibility (wire money, delete prod, post publicly) always escalates, even when the supervisor LLM is 99% confident it's fine.
Important
2026-05-07 — Source-available preview (Phase 1).
The code is here for evaluation, transparency, and validation. Pull requests and public issues are not yet accepted — see CONTRIBUTING.md for the phased OSS plan and SECURITY.md for vulnerability reports.
LLM-agnostic. Swap Anthropic, OpenAI, Ollama, or any OpenAI-compatible endpoint (Groq, Together, vLLM, llama.cpp) with one env var. Persists to SQLite, writes auto-execute actions to a filesystem outbox, queues risky ones for human approval.
This is the reference implementation of YC RFS #04 (Company Brain) and #12 (Software for Agents): machine-native paths, permissions, recovery.
npm install -g @reaves-labs/agent-os# Submit an action — supervisor decides verdict
agent-os submit send_email "Email Jane the Q3 deck" --irreversibility=external
# Outcome reporting moves the trust score for the category
agent-os outcome <actionId> true --capability=1 --quality=0.85 --impact=0.7
# Inspect state (recent actions, pending approvals, trust scores)
agent-os status
# Or run as an MCP server on stdio
agent-os serveThe "supervisor" is whichever LLM you want grading proposed actions and returning verdicts. agent-os doesn't care which — it's a decision-maker role, not a brand. Pick the one whose latency, cost, and quality fit the risk profile of your agent:
- Ollama — free, local, no network. Best default for development and for any agent supervising irreversible actions where you don't want prompts leaving the machine.
- Anthropic — strongest verdict quality at production cost. Best when the supervisor's "why" matters because a human will read it.
- OpenAI — fast, cheap with
gpt-4o-mini. Best for high-volume, low-stakes supervision. - Generic — anything OpenAI-compatible (Groq for ~10× faster, vLLM for self-hosted, Together / llama.cpp for whatever fits your infra).
export AGENT_OS_SUPERVISOR=ollama # free local (default fallback)
export AGENT_OS_SUPERVISOR=anthropic # ANTHROPIC_API_KEY
export AGENT_OS_SUPERVISOR=openai # OPENAI_API_KEY
export AGENT_OS_SUPERVISOR=generic # AGENT_OS_BASE_URL + AGENT_OS_API_KEYIf you don't set the env var, agent-os auto-detects in order:
ANTHROPIC_API_KEY → OPENAI_API_KEY → Ollama at localhost:11434.
{
"mcpServers": {
"agent-os": {
"command": "agent-os-mcp",
"env": {
"AGENT_OS_SUPERVISOR": "ollama"
}
}
}
}📖 Full 5-minute walkthrough: examples/claude-code/README.md — install → wire up → submit a real action → watch the trust score evolve.
Same shape, in ~/.cursor/mcp.json.
agent-os-mcp runs as a stdio MCP server — point any compliant client at it.
Honest comparison with no hand-waving: docs/COMPARE.md. Tells you exactly when to pick agent-os and when to pick the alternatives.
These are the four things your agent can call. Two are part of every
turn (submit_action + record_outcome); two are situational (get_routing
when you want a routing suggestion, recover when an action fails).
Your agent describes a proposed action and gets back a verdict. agent-os
routes it to the right role, asks the supervisor to grade it, applies the
irreversibility floor, persists the decision, and writes the action to
either the outbox (auto) or the approval queue (supervised / escalate).
| input | type | description |
|---|---|---|
agentId |
string | who's calling |
category |
string | trust scores accrue per category |
action |
string | plain-English description |
irreversibility |
reversible | external | irreversible |
safety floor |
context |
string? | optional extra context |
Returns: { actionId, verdict, score, why, routedTo, effectPath, trustScoreAfter }.
auto→ file written to~/.agent-os/outbox/<role>/— execute itsupervisedorescalate→ file queued at~/.agent-os/queues/approval/— wait for human approval
After the action actually runs, your agent reports the result. agent-os
computes worth = capability × quality × impact and updates the trust
score for that category. Without outcomes, trust never moves — so the
agent never earns auto-execute, and proven categories never get rewarded.
Always pair submit_action with record_outcome.
| input | type | description |
|---|---|---|
actionId |
string | from submit_action |
success |
boolean | did the execution succeed? |
capability |
0..1 | did the output exist? |
quality |
0..1 | was it good? |
impact |
0..1 | did it produce value? |
notes |
string? | freeform |
Returns: { worth, trustScoreBefore, trustScoreAfter, category }.
Hand it a free-text description; get back a role recommendation
(writer, code, analyst, etc.) plus a confidence. Useful when your
agent needs to dispatch work but doesn't already know which sub-agent
should own it. Most callers don't need this — submit_action does its
own routing internally.
Submit a failed actionId plus the error. The supervisor returns a
structured plan — retry with different inputs, rollback what happened,
or escalate to a human. Designed so your agent doesn't have to invent a
recovery strategy on the fly.
If you'd rather embed agent-os directly in your own Node code than run it as an MCP server, the entire surface is one class. Same supervision, same trust scores, same safety floor — no MCP server process required.
import { AgentOS } from "@reaves-labs/agent-os";
const os = new AgentOS({
supervisor: { backend: "ollama", model: "llama3.2" },
workdir: "./.agent-os",
});
// Your agent proposes an action — gets back a verdict
const verdict = await os.submit({
agentId: "trader-7",
category: "place_trade",
action: "Open $200 BTC long at market",
irreversibility: "irreversible",
});
// verdict.verdict === "escalate" (irreversibility floor — even if
// the supervisor LLM said "auto/0.99", the gate forces escalation)
// After the action actually runs (or doesn't), report back
await os.recordOutcome({
actionId: verdict.actionId,
success: true,
capability: 1, quality: 0.9, impact: 0.7,
});This is the same code path the MCP server uses internally. Pick MCP for multi-agent / cross-language setups; pick library for tightly-coupled single-process Node agents.
Three things decide whether an action runs unattended, gets held for review, or gets bounced back: how hard the action is to undo, how confident the LLM is that it's safe right now, and how trustworthy this kind of action has been historically.
1. The safety floor — risk of undoing. Every action is tagged with one of three reversibility levels:
reversible— file write, draft email, scratch directory ops; anything you couldCtrl-Zexternal— anything visible to the outside world (post a tweet, send an email, charge a card)irreversible— wire money, delete production, fire someone
Irreversible actions always escalate, no matter how confident the
LLM is. External actions get bumped from auto down to supervised
even if the LLM says they look fine. This is the load-bearing safety
property: the LLM is an advisor, not the decision-maker. Confidence
cannot override category.
2. The verdict — LLM's read on this specific request. The
supervisor returns a verdict (auto / supervised / escalate) and
a confidence score (0–1). Below 0.5, we escalate even reversible
actions. Between 0.5 and 0.8, the strongest verdict we trust is
"supervised." At 0.8 and above, we accept the LLM's verdict directly
(but the irreversibility floor still applies — an irreversible action
at 0.99 still escalates).
3. Trust score — has this kind of action been trustworthy
before? For each category (send_email, wire_money, commit_pr,
etc.) we track how well past actions of that kind turned out. New
categories start at low trust (~0.06) — you earn auto-execute by
recording good outcomes. Bad outcomes pull the score down; old
outcomes decay, so a category that was reliable six months ago and
flaky now is treated as flaky.
The complete decision rule, in code:
irreversibility="irreversible" → always "escalate"
irreversibility="external" + verdict="auto" → downgrade to "supervised"
score < 0.5 → "escalate"
score < 0.8 → "supervised"
score >= 0.8 → whatever the LLM said
For the math. The trust score is the lower confidence bound of
a Beta-Bernoulli posterior over per-category outcomes, with a (α=2, β=2)
uniform-ish prior and time-decayed evidence (50-observation half-life).
Variance penalty is z=1.96 (95% LCB). See src/store.ts
for the exact update rule and test/e2e.test.mjs
for the behavioral invariants the math has to satisfy (conservative
start, LCB asymptote, decay-driven drop, uncertainty penalty).
Most guardrails libraries (NeMo Guardrails, Guardrails AI, Llama Guard) make a pass/fail decision per call against rules you wrote up front. That works — but it's stateless: every call is graded fresh, with no memory of how this kind of action has gone before, and the LLM is in the safety decision (which means a confident-but-wrong LLM can talk itself past the rule). agent-os is built around the opposite shape:
| Guardrails libs | agent-os |
|---|---|
| Static rules per call | Per-category trust scores that compound |
| Pass/fail at inference time | Three-axis verdict (auto / supervised / escalate) |
| No memory across actions | SQLite memory of every action and outcome |
| No notion of irreversibility | Irreversibility floor: irreversible always escalates |
| Raw errors on failure | Structured recovery plans |
Many production stacks use both. Guardrails libraries to validate the
content of an LLM's output ("is this JSON valid, does it leak PII");
agent-os to gate whether the resulting action runs ("we've seen 50
bad outcomes from this category lately — hold this one for review").
See docs/COMPARE.md for the full head-to-head.
MIT © Reaves Labs and Learning, LLC