Add quality signals (outcome inference + one-shot rate) (closes #6) by willwashburn · Pull Request #53 · AgentWorkforce/burn

willwashburn · 2026-04-24T00:12:54Z

Summary

Resolves issue Design: outcome / quality signal for 'same output, less spend' comparisons #6 (outcome/quality signal design) by adopting the two signals the discussion converged on: outcome inference (agentsview) and one-shot rate (codeburn). They're orthogonal — each catches a different failure mode — and together form the MVP quality axis that separates "cheaper model would have worked" from "cheaper model would have broken everything."
Implements both in packages/analyze/src/quality.ts, wires into burn summary --quality as the first consumer.

Stacked on #52 — this PR uses ToolCall.isError (added in #52) for trailing-failure-streak detection. Merge #52 first, then this rebases onto main cleanly.

Closes #6.

Design decision (the actual resolution of issue #6)

After the discussion on the issue, the candidate ranking settled on:

(a) Explicit tagging — optional enhancement, not primary
(b) Git-state scrape — future option
(c) Benchmark harness — out of scope
(d) Do nothing — rejected (forfeits the differentiator)
(e) Automatic adherence detection (prism) — rejected (doesn't scale beyond hand-coded per-rule checkers)
(f) Outcome inference (agentsview) — primary
(g) One-shot rate (codeburn) — primary

(f) + (g) is the MVP. This PR lands both as concrete implementations, which closes the design issue.

Why this pair

Signal	Captures	Needs content?	Granularity
Outcome inference	abandoned / give-up sessions	Optional (degrades gracefully)	Per-session
One-shot rate	retry-heavy / stuck sessions	No	Per-turn → per-session

The second row matters: a session can be completed with a low one-shot rate, which is exactly the "Sonnet nailed it first try" vs "Haiku got there after 3 retries" distinction that drives model-choice decisions. Neither signal alone catches this; both together do.

Hard constraints honored

No prompt storage required. Outcome inference works from turn metadata; last-assistant-text is used only to downgrade confidence, never required.
Nothing persists to the ledger — both signals are computed lazily at query time so upgrading the rules later doesn't require a rebuild.
Confidence is explicit on every outcome classification so downstream consumers can filter low-confidence signals rather than treat them as noise.

What's not in this PR

Explicit user-tagging UI (burn tag --outcome). Design demoted it to optional enhancement; worth a follow-up issue once the auto-classifier has real usage data.
Git-state scrape (b). Future option, separate PR if demand warrants.

Test plan

pnpm run test:ts — 238 pass (13 new quality tests covering each outcome reason + one-shot edge cases)
pnpm dev:cli summary --quality --since 7d against real sessions — outputs sensible rollup (51 completed / 8 abandoned / 0 errored / 10 unknown, 99.7% one-shot rate across 1265 edit turns in my personal ledger)

🤖 Generated with Claude Code

Resolves the design question in #6 by adopting the two signals that converged in the issue discussion: outcome inference (from agentsview) and one-shot rate (from codeburn). These are orthogonal — each catches a different failure mode — and together form the MVP quality axis that distinguishes burn's "same output, less spend" question from every other usage tracker. - Outcome inference: classifies each session as completed | abandoned | errored | unknown with high/medium/low confidence, using turn count, ending role, trailing failure streak, recency, and (optional) last assistant text for give-up phrase detection. - One-shot rate: per-session `oneShotTurns / editTurns` — robust in hash-only content mode since it needs only tool-call patterns. - Both computed lazily in `@relayburn/analyze`. Nothing persists to the ledger — upgrading rules later doesn't require a rebuild. No prompt storage required (the design's hard constraint). - Wired into `burn summary --quality` as the first consumer. Closes #6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

barryollama

LGTM overall. Clean implementation of the two orthogonal signals from #6. One minor suggestion inline.

Treat sessions whose turns lack a stopReason (e.g. Codex) as completed/low-confidence rather than misclassifying them as user-abandoned. Add 'unknown-ending' outcome, return early from inferOutcome for unknown endings, and make endingRole return 'unknown' when stopReason is undefined so trailing failure-streak detection still works. Add tests to cover the new classification and failure-streak detection for sources without stopReason.

- Extend single-exchange to messageCount <= 2 so a true one-turn "user asks, assistant answers" session is classified completed/medium instead of too-short/unknown. TurnRecord counts assistant turns only; the prior `=== 2` gate only caught tool-mediated round trips. - Add three give-up phrases observed in real Claude/Codex sessions. - Parallelize content sidecar reads in `burn summary --quality` with a concurrency cap of 8 so large ledgers (many sessions, many ENOENT paths) don't serialize I/O. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds [Unreleased] entries for PR #53 to the root, analyze, and cli changelogs covering the outcome inference + one-shot rate module and the new `burn summary --quality` flag. Notes the Codex unknown-stopReason handling and the concurrency cap on sidecar reads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

This comment was marked as resolved.

Sign in to view

willwashburn requested review from barryollama and Copilot April 24, 2026 00:23

Copilot started reviewing on behalf of willwashburn April 24, 2026 00:23 View session

barryollama approved these changes Apr 24, 2026

View reviewed changes

barryollama reviewed Apr 24, 2026

View reviewed changes

Comment thread packages/analyze/src/quality.ts

This comment was marked as resolved.

Sign in to view

willwashburn and others added 3 commits April 24, 2026 06:56

willwashburn merged commit febf40f into feat/waste-patterns-issue-11 Apr 24, 2026

willwashburn deleted the feat/quality-signal-issue-6 branch April 24, 2026 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add quality signals (outcome inference + one-shot rate) (closes #6)#53

Add quality signals (outcome inference + one-shot rate) (closes #6)#53
willwashburn merged 4 commits intofeat/waste-patterns-issue-11from
feat/quality-signal-issue-6

willwashburn commented Apr 24, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

barryollama left a comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

willwashburn commented Apr 24, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design decision (the actual resolution of issue #6)

Why this pair

Hard constraints honored

What's not in this PR

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

barryollama left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willwashburn commented Apr 24, 2026 •

edited by devin-ai-integration Bot

Loading