Add quality signals (outcome inference + one-shot rate) (closes #6)#53
Merged
willwashburn merged 4 commits intofeat/waste-patterns-issue-11from Apr 24, 2026
Merged
Conversation
Resolves the design question in #6 by adopting the two signals that converged in the issue discussion: outcome inference (from agentsview) and one-shot rate (from codeburn). These are orthogonal — each catches a different failure mode — and together form the MVP quality axis that distinguishes burn's "same output, less spend" question from every other usage tracker. - Outcome inference: classifies each session as completed | abandoned | errored | unknown with high/medium/low confidence, using turn count, ending role, trailing failure streak, recency, and (optional) last assistant text for give-up phrase detection. - One-shot rate: per-session `oneShotTurns / editTurns` — robust in hash-only content mode since it needs only tool-call patterns. - Both computed lazily in `@relayburn/analyze`. Nothing persists to the ledger — upgrading rules later doesn't require a rebuild. No prompt storage required (the design's hard constraint). - Wired into `burn summary --quality` as the first consumer. Closes #6. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
barryollama
approved these changes
Apr 24, 2026
barryollama
reviewed
Apr 24, 2026
barryollama
left a comment
There was a problem hiding this comment.
LGTM overall. Clean implementation of the two orthogonal signals from #6. One minor suggestion inline.
Treat sessions whose turns lack a stopReason (e.g. Codex) as completed/low-confidence rather than misclassifying them as user-abandoned. Add 'unknown-ending' outcome, return early from inferOutcome for unknown endings, and make endingRole return 'unknown' when stopReason is undefined so trailing failure-streak detection still works. Add tests to cover the new classification and failure-streak detection for sources without stopReason.
- Extend single-exchange to messageCount <= 2 so a true one-turn "user asks, assistant answers" session is classified completed/medium instead of too-short/unknown. TurnRecord counts assistant turns only; the prior `=== 2` gate only caught tool-mediated round trips. - Add three give-up phrases observed in real Claude/Codex sessions. - Parallelize content sidecar reads in `burn summary --quality` with a concurrency cap of 8 so large ledgers (many sessions, many ENOENT paths) don't serialize I/O. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds [Unreleased] entries for PR #53 to the root, analyze, and cli changelogs covering the outcome inference + one-shot rate module and the new `burn summary --quality` flag. Notes the Codex unknown-stopReason handling and the concurrency cap on sidecar reads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
packages/analyze/src/quality.ts, wires intoburn summary --qualityas the first consumer.Stacked on #52 — this PR uses
ToolCall.isError(added in #52) for trailing-failure-streak detection. Merge #52 first, then this rebases onto main cleanly.Closes #6.
Design decision (the actual resolution of issue #6)
After the discussion on the issue, the candidate ranking settled on:
(f) + (g) is the MVP. This PR lands both as concrete implementations, which closes the design issue.
Why this pair
The second row matters: a session can be
completedwith a low one-shot rate, which is exactly the "Sonnet nailed it first try" vs "Haiku got there after 3 retries" distinction that drives model-choice decisions. Neither signal alone catches this; both together do.Hard constraints honored
What's not in this PR
burn tag --outcome). Design demoted it to optional enhancement; worth a follow-up issue once the auto-classifier has real usage data.Test plan
pnpm run test:ts— 238 pass (13 new quality tests covering each outcome reason + one-shot edge cases)pnpm dev:cli summary --quality --since 7dagainst real sessions — outputs sensible rollup (51 completed / 8 abandoned / 0 errored / 10 unknown, 99.7% one-shot rate across 1265 edit turns in my personal ledger)🤖 Generated with Claude Code