-
-
Notifications
You must be signed in to change notification settings - Fork 5
FinOps
Organ: metabolism — every token has a cost, every cost has an arm, and the organism stops burning when the cap is reached.
Octorato is an open-source agent operating system that is billed per token, run by one operator across multiple isolated client workspaces (Architecture). FinOps is the discipline that answers a single, unforgiving question: when a token is spent, which client incurred it — and have they run out of budget?
This page is the cost-governance reference. It documents the shipped pipeline end to end: how every action is traced and tagged with the arm that incurred it, how those traces roll up to per-client USD, how a statistical watchdog catches runaway spend, how a budget cap halts an agent mid-run, and how the brain's own list-price estimate is reconciled against Anthropic's actual invoice.
Everything described here is local-first and SaaS-free. The ledger lives on the operator's filesystem. No third-party FinOps platform ever sees a client's name or token counts.
A consultant running Claude Code across several engagements through one API key faces a problem that traditional cloud FinOps never had to solve at this granularity: the unit of cost is a token, and tokens are fungible across clients. Anthropic's invoice tells you the org spent N dollars. It does not tell you that Client A's refactor marathon was 70% of it and Client B barely touched the agent.
Without attribution you cannot do three things every consultant must do:
- Bill honestly. "I spent roughly this much on you this month" is not a number you can defend. A per-arm USD rollup, derived from the actual token usage of sessions run inside that client's repository, is.
- Cap spend per client. A budget is meaningless if it is org-wide. Client A's generous allowance should not let an agent quietly torch Client B's tight one.
- Detect runaways before the invoice arrives. A cost spike caught at 24h is recoverable. One discovered on next month's bill is a write-off.
The architectural answer is the same one the Architecture gives for isolation: every trace event carries the arm that incurred it. Attribution is not bolted on after the fact — it is captured at the moment the action happens, from the working directory the agent was standing in. That is the foundation the entire pipeline is built on. See Architecture for how arm isolation works and why the arm tag is trustworthy.
┌─────────────────────────────────────────────┐
8 hook points ───▶ │ trace-hook.py (capture, per event) │
(per skill / │ → ~/.claude/traces/YYYY-MM-DD.jsonl │
agent / 4D phase) │ every record tagged with `arm` │
└───────────────────┬─────────────────────────┘
│
native session JSONLs │ trace records
(~/.claude/projects/.../*.jsonl) │
│ ▼
▼ ┌───────────────────────────┐
┌──────────────────────┐ │ watchdog.py (cost_spike) │
│ skill-cost-profiler │───▶ │ z-score, 30d baseline │──▶ GitHub issue
│ per-arm USD rollup │ └───────────────────────────┘
│ (_pricing.py) │
└──────────┬───────────┘ ┌───────────────────────────┐
│ │ budget-check.py │
├────────────────▶│ budgets.yaml → exit 2 │──▶ PreToolUse HARD-STOP
│ └───────────────────────────┘
▼
┌──────────────────────┐ ┌───────────────────────────┐
│ brain-digest.py │◀────│ anthropic-analytics-pull │
│ morning report │ │ Admin API → analytics/ │
│ estimated vs billed │ └───────────────────────────┘
└──────────────────────┘
Five stages, each independently shippable, each reading the substrate the one before it produced:
| Stage | Component | Role |
|---|---|---|
| 1. Capture |
trace-hook.py (8 hook points) |
Emit a trace record per skill / agent / 4D phase, tagged with the arm |
| 2. Roll up |
skill-cost-profiler.py + _pricing.py
|
Convert tokens → USD, aggregate per arm and per skill |
| 3. Detect |
watchdog.py (cost_spike) |
z-score current 24h vs 30d baseline; open an issue on a spike |
| 4. Cap |
budget-check.py + budgets.yaml
|
Hard-stop a tool invocation when an arm burns its cap (exit 2) |
| 5. Reconcile |
anthropic-analytics-pull.py + brain-digest.py
|
Estimated (list price) vs billed (Anthropic Admin API) |
The single thread running through all five is the arm tag and the _pricing.py USD table — one source of truth for "who" and one for "how much," shared by every layer so the digest, the watchdog, and the budget cap can never disagree.
scripts/trace-hook.py is the observability capture surface. It reads a Claude Code hook event from stdin and appends one JSONL record to ~/.claude/traces/YYYY-MM-DD.jsonl. It is best-effort by contract: any malformed input or internal error is silently swallowed (return 0) so the hook can never block the underlying tool call. A FinOps ledger that breaks the agent is worse than no ledger.
event |
Emitted when | Builder |
|---|---|---|
skill_fire |
PostToolUse on the Skill tool |
_build_skill_fire |
agent_activate |
PostToolUse on the Agent tool |
_build_agent_activate |
phase_boundary |
invoked with --phase <name> from a lifecycle hook |
_build_phase_boundary |
Every record shares a strict schema (schemas/trace-event.schema.json, v1.0):
{
"schemaVersion": "1.0",
"ts": "2026-05-24T15:04:01.234Z",
"event": "skill_fire",
"name": "querymaster-postgresql",
"task_id": "<sha1(session_id) — 40 hex chars>",
"arm": "client-x",
"duration_ms": null,
"tokens": {"input": 1234, "output": 567},
"status": "ok",
"error": null
}ARM_PATH_RE = re.compile(r"/Documents/github/([^/]+)(/|$)")
def _arm_from_cwd() -> str | None:
m = ARM_PATH_RE.search(os.getcwd())
return m.group(1) if m else NoneThe arm is the directory segment immediately after github/ in the agent's current working directory. Stand the agent inside a client's repository and every action it takes — every skill, every subagent, every 4D phase — is automatically stamped with that client. The operator does nothing; attribution is a consequence of where the work happens, which is exactly the Architecture invariant. No CWD match (e.g. the operator's bare home directory) yields null, bucketed downstream as home or __unknown__.
task_id is SHA-1(session_id) — exactly 40 hex chars, matching the schema. Autonomous turns with no session id get an ad-hoc UUID v4. This id is what later lets the profiler join a skill's token cost to its 3D-Diligent success count for ROI.
The phases of the The-4D-Paradigm are themselves traced. trace-hook.py --phase <name> accepts the six valid phases — describe, delegate, gate, execute, diligent, disclose — plus the two tool-completion captures (Skill, Agent). Wired in ~/.claude/settings.json:
| # | Lifecycle hook | Matcher | Invocation | Records |
|---|---|---|---|---|
| 1 | UserPromptSubmit |
— | trace-hook.py --phase describe |
phase_boundary: describe |
| 2 | UserPromptSubmit |
— | trace-hook.py --phase delegate |
phase_boundary: delegate |
| 3 | PreToolUse |
Write|Edit |
trace-hook.py --phase gate |
phase_boundary: gate |
| 4 | PostToolUse |
Write|Edit |
trace-hook.py --phase execute |
phase_boundary: execute |
| 5 | PostToolUse |
Write|Edit |
trace-hook.py --phase diligent |
phase_boundary: diligent |
| 6 | Stop |
— | trace-hook.py --phase disclose |
phase_boundary: disclose |
| 7 | PostToolUse |
Skill |
trace-hook.py (no flag) |
skill_fire |
| 8 | PostToolUse |
Agent |
trace-hook.py (no flag) |
agent_activate |
This maps the full Describe → Delegate → Gate → Execute → Diligent → Disclose nervous-system flow plus the two tool-fire signals onto a continuous trace. The diligent phase_boundary records (status: ok) are reused downstream as the ROI denominator: tokens spent per successful diligence.
Storage contract. One file per UTC day under ~/.claude/traces/, append-only. POSIX O_APPEND makes sub-4096-byte appends atomic — no file locking needed. The directory is gitignored; traces carry arm names and absolute paths and must never reach the public brain repo. Full layout in docs/architecture/trace-storage.md.
scripts/skill-cost-profiler.py turns raw usage into money. It does not read the trace files for token counts — it reads the native Claude Code session JSONLs under ~/.claude/projects/<sanitized-cwd>/<session-uuid>.jsonl, which carry the authoritative per-turn usage block (input, output, cache-creation, cache-read). The trace files supply the ROI denominator; the session logs supply the cost.
Claude Code stores each session under a directory whose name is the absolute CWD with / replaced by -. The profiler's _arm_from_session_path mirrors trace-hook.py's CWD logic against that encoding:
-
-home-<user>-Documents-github-<arm>→ the segment aftergithubis the arm - a path containing
--claude→ the.claudebrain bucket - bare
-home-<user>→home - anything else →
__unknown__
So the same client identity is recovered whether you look at a trace event or a session log — the two halves of the ledger agree by construction.
For each assistant turn the profiler:
- extracts the rich
usage(input + cache-creation + cache-read summed as "input-incl-cache", plus output); - adds it to the per-arm rollup (sessions, turns, input, output, cache-read, cache-write);
- finds any
Skilltool_useblocks in the turn. If the turn fired M skills, the turn's tokens split equally across them (remainder to the first). Turns with no skill land in the__conversation__bucket.
scripts/_pricing.py is the single USD authority shared by the profiler, the watchdog's cost-spike check, the budget checker, and the digest. It holds Anthropic list price per 1M tokens per model, with four rates each:
| Rate | Meaning |
|---|---|
in |
input tokens |
out |
output tokens |
cache_w |
cache creation — 1.25× input (writing the cache costs more) |
cache_r |
cache read — ~10% of input (a cache hit is cheap; high cache-read volume is good news) |
Unknown model names fall back to a family heuristic (opus / haiku / else sonnet) so cost reporting never goes dark when Anthropic ships a new model before the dict is updated. _normalize_model_name strips [1m]-style suffixes and trailing 8-digit date stamps to find the canonical key.
Billing semantics note (from
_pricing.py): on a Claude Max/Pro subscription the USD figure is "what this would cost at list price" — value extracted, not money owed. Synthetic harness messages (model='<synthetic>') are $0.
python3 ~/.claude/scripts/skill-cost-profiler.py --days 30 # markdown
python3 ~/.claude/scripts/skill-cost-profiler.py --days 30 --json # machine-readableThe markdown report leads with the FinOps headline — Cost by arm — then the per-skill breakdown:
# Cost by arm (client) — last 30d
| Arm | Sessions | Turns | Total tokens | USD (list price) |
|------------|---------:|------:|-------------:|-----------------:|
| `client-x` | 42 | 1,203 | 18,400,221 | $XXX.XX |
| `home` | 15 | 310 | 3,100,004 | $XX.XX |
| **TOTAL** | | | | $XXX.XX |
The --json form ({"by_skill": [...], "by_arm": [...]}) is the API contract consumed by budget-check.py and watchdog.py. Each by_arm row exposes usd_estimate — the field every downstream stage reads.
scripts/watchdog.py is the anomaly detector. It computes three classes — cliff_drop (a skill stopped firing), quality_drop (success rate fell >2σ), and the FinOps-relevant cost_spike — and, in --execute mode, opens a deduplicated GitHub issue.
The check (_detect_cost_spikes) calls skill-cost-profiler.py --json twice — once for the last 24h, once for the 30-day baseline — and compares per (bucket, name) where bucket is by_arm or by_skill:
expected_per_day = baseline_total_30d / 30
z = (observed_24h - expected_per_day) / sqrt(expected_per_day) # Poisson z-scoreAn anomaly fires only when both hold:
-
z ≥ 2.0(SIGMA_THRESHOLD) — today is statistically far above the daily baseline, and -
observed_24h ≥ 100_000tokens (MIN_TOKENS_FOR_COST_SPIKE) — the noise floor.
The floor is the design lesson: below ~100k tokens/day the stddev comparison is hypersensitive — a 2-turn task can sit 2σ above a near-zero baseline and trigger a meaningless alert. Brand-new arms/skills with expected_per_day < 1 are skipped entirely (no baseline → no σ).
Because the spike check runs per-arm, a runaway is attributed to the client who caused it. The issue body carries the 24h token count, the projected daily baseline, the z-score, and the ~USD 24h figure straight from _pricing.py.
python3 ~/.claude/scripts/watchdog.py # dry-run: print the report
python3 ~/.claude/scripts/watchdog.py --execute # open GitHub issuesDedup + suppression state lives under ~/.claude/watchdog/ (gitignored): one issue per name per day, 14-day suppression after the operator dismisses an issue. Suggested cron: 0 14 * * * python3 ~/.claude/scripts/watchdog.py --execute.
The watchdog alerts. It does not stop anything. Stopping is Stage 4's job.
This is the stage a CFO recognizes: the difference between "we have telemetry" and "we will not run a tool that burns through this client's budget."
The config lives at ~/.claude/budgets.yaml and is gitignored — it encodes per-client pricing decisions and possibly client identifiers, so it never enters the public brain. (A budgets.json fallback is read when PyYAML is unavailable.)
budgets:
- arm: client-x
monthly_usd_cap: 200.00
action_on_breach: hard_stop # alert | warn | hard_stop
grace_pct: 110 # allow 10% overage before hard_stop fires
- arm: home # operator's tinkering bucket
monthly_usd_cap: 50.00
action_on_breach: alert # log only, never blocks
default: # applies to any arm not listed
monthly_usd_cap: 100.00
action_on_breach: warn
grace_pct: 120| Action | Trigger | Effect |
|---|---|---|
alert |
spend ≥ cap | tomorrow's digest shows the breach. No execution change. |
warn |
spend ≥ cap | digest warning for the operator. Does not halt. |
hard_stop |
spend ≥ cap × grace_pct/100 |
PreToolUse hook refuses the tool. Operator must edit budgets.yaml to proceed. |
_month_to_date_usd_by_arm() runs skill-cost-profiler.py --days <day-of-month> --json (e.g. on the 24th, --days 24) and reads each by_arm[].usd_estimate. Month-to-date, per arm, in the same USD that every other stage uses. Caps apply even to arms with zero observed spend yet — so a tool dispatched late in the month can't sneak over the line on its first run.
evaluate() returns a structured verdict (OK / WARN / HARD_STOP). The exit code is the contract:
| Exit | Status | Caller obligation |
|---|---|---|
0 |
OK or WARN
|
proceed (a WARN is for the operator's eyes only) |
1 |
config error (malformed YAML) | proceed; misconfig must not block work |
2 |
HARD_STOP |
caller MUST refuse the tool invocation |
HARD_STOP fires only when an arm's spend reaches cap × grace_pct/100 and that arm's action_on_breach == hard_stop. The halt_reason string spells out the arithmetic, e.g. "arm 'client-x' burned $221.40 (grace $220.00 = cap $200.00 × 110%) — refusing tool."
budget-check.py is wired as a PreToolUse[Agent] hook and registered as rule FLOW.budget-halt in registry/rules.yaml. This means brain_doctor asserts its presence on every ai-push; a push whose hook is missing or whose rule entry is absent is blocked by .githooks/pre-push. The budget halt is not advisory prose. It is a first-class registered mechanism, the same way RULE #1 itself is wired.
In ~/.claude/settings.json, gate the expensive tools (Agent, subagent dispatch, browser automation) on the checker:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Agent",
"hooks": [
{ "type": "command",
"command": "python3 ~/.claude/scripts/budget-check.py --tool Agent" }
]
}
]
}
}When the hook exits 2, Claude Code refuses to invoke the tool and surfaces the halt reason. By design there is no --force flag on the checker — the only way through is to edit budgets.yaml (raise the cap, or flip hard_stop → warn), making every override a durable, auditable config change rather than a transient keystroke. The checker is built to run on every invocation (target <200ms), so it stays out of the way until it matters. Full operator playbook: the finops-budget-policy skill.
Anti-patterns to avoid: hard_stop on the home bucket (you'll lock yourself out of routine exploration — use warn); grace_pct: 100 (no grace — latency, cache races, and rounding make a non-zero buffer operationally necessary); and having the agent parse a WARN and self-throttle (WARN is for humans; only exit-2 halts).
Stages 1–4 run on the brain's estimated USD — list-price math over token counts. That is good enough to cap spend and catch spikes, but it is not what Anthropic actually charges (enterprise discounts, exact rate card, cache classification). Stage 5 pulls the vendor truth and reconciles.
scripts/anthropic-analytics-pull.py calls the Anthropic Admin API GET /v1/organizations/usage_report and writes the rows verbatim to ~/.claude/analytics/anthropic-<YYYY-MM-DD>.jsonl (one row per line; the directory is gitignored — it holds vendor billing data).
python3 ~/.claude/scripts/anthropic-analytics-pull.py --dry-run # show URL + env state
python3 ~/.claude/scripts/anthropic-analytics-pull.py --days 7 # write 7 days of rowsSoft-fail by design. It reads ANTHROPIC_ADMIN_API_KEY (an Admin-scope key, distinct from the per-user API key; ANTHROPIC_ORG_ID optional). If the key is unset it prints "not configured" and exits 0 — the rest of the FinOps pipeline keeps running on estimates. On Pro/Max-only accounts this stage is simply dormant.
brain-digest.py (section_anthropic_reconciliation) reads the latest analytics/anthropic-*.jsonl, totals the billed USD, and renders against the estimate:
**Anthropic billed (vendor truth)** vs **Estimated (Octorato)** — 24h
- Estimated (list-price math): $XX.XX
- Billed (Anthropic Admin API): $XX.XX
- Delta: +X.X% ⚠ drift > 20%
- Source: analytics/anthropic-2026-05-24.jsonl (N rows)
A delta over 20% is flagged — it almost always means a stale _pricing.py (Anthropic changed rates) or a cache-read classification mismatch. The Admin pull is an after-the-fact truth-table, not a brake pedal: a runaway burning thousands in an hour shows up on tomorrow's pull, not in time to stop it. Stopping is Stage 4's job; this stage tells you whether your estimates were honest. Full setup: the anthropic-enterprise-analytics skill. Suggested cron (30 min after the digest, so it lands on the next morning's report): 30 15 * * * python3 ~/.claude/scripts/anthropic-analytics-pull.py.
scripts/brain-digest.py is the morning report that assembles every FinOps surface into one place. It runs on the operator's machine, not in CI: it reads local session data, so it must run where that data lives. Scheduling is handled by a systemd --user timer installed by scripts/install-observability-timer.py; the timer carries Persistent=true, which recovers a run missed while the machine was asleep.
- Cost by arm (from the profiler) — the billable rollup.
-
Budget burn this month (from
budget-check.py) — each configured arm's MTD spend vs cap with ✓ / ⚠ / 🛑 markers; a 🛑 means a hard-stop is active on at least one arm. - Cost spikes (from the watchdog) — any 24h anomaly.
- Estimated vs billed (from the analytics pull) — the reconciliation row.
One report, every morning, answering: what did each client cost, is anyone near their cap, did anything spike, and were my estimates right?
| Invariant | Why |
|---|---|
| Trace capture never blocks the agent |
trace-hook.py swallows all errors and returns 0. A broken ledger must never break the work. |
_pricing.py is the only USD authority |
Profiler, watchdog, budget cap, and digest all import it — they can never disagree on cost. |
The arm tag comes from the CWD |
Attribution is a free consequence of [[Architecture |
| Estimate runs without the vendor API | Budget caps and spikes work on Pro/Max with no Admin key. Reconciliation is purely additive. |
| Override is config, not a flag | No --force on budget-check.py. The only way past a hard-stop is an auditable budgets.yaml edit. |
| Cost-bearing files are gitignored |
traces/, analytics/, budgets.yaml, watchdog/ all stay local. The public brain never learns a client name or token count. |
Octorato's brain is open-source; its git history is public forever. None of the FinOps machinery leaks a client. Arm names live only in gitignored runtime files (traces/, analytics/, budgets.yaml); the committed scripts contain only the mechanism — regexes, z-score math, a list-price table, exit codes. The same self-publicity rule that governs Self-Growth governs the ledger: distill the technique into the public brain, keep the client's numbers on the laptop.
-
Architecture — the octopus model, arm isolation, and why the
armtag is trustworthy - The-4D-Paradigm — the six-phase lifecycle the trace hooks capture
- Self-Growth — the daily LLM-gated loop whose near-zero cost this envelope makes viable
-
finops-budget-policy— operator playbook for budget caps -
anthropic-enterprise-analytics— operator playbook for vendor reconciliation
Octorato — open-source AI Agent OS · productized at dataqbs.com Built by Carlos Carrillo — data & AI engineering. Open to work & collaboration. Found this useful? 💜 Sponsor / Donate · MIT licensed · Report a security issue
- Home (central brain)
- Architecture (anatomy)
- The-4D-Paradigm (nervous system)
- Skills (230+) · Skills-System
- Agents (160+) · Agents-System
- Arms-and-Sync (limbs)
- Self-Growth (neurogenesis)
- FinOps (metabolism)
- Security (immune)
- Glossary (genome index)
- Getting-Started (embryology)