Skip to content

Add quality signals (outcome inference + one-shot rate) (closes #6)#53

Merged
willwashburn merged 4 commits intofeat/waste-patterns-issue-11from
feat/quality-signal-issue-6
Apr 24, 2026
Merged

Add quality signals (outcome inference + one-shot rate) (closes #6)#53
willwashburn merged 4 commits intofeat/waste-patterns-issue-11from
feat/quality-signal-issue-6

Conversation

@willwashburn
Copy link
Copy Markdown
Member

@willwashburn willwashburn commented Apr 24, 2026

Summary

  • Resolves issue Design: outcome / quality signal for 'same output, less spend' comparisons #6 (outcome/quality signal design) by adopting the two signals the discussion converged on: outcome inference (agentsview) and one-shot rate (codeburn). They're orthogonal — each catches a different failure mode — and together form the MVP quality axis that separates "cheaper model would have worked" from "cheaper model would have broken everything."
  • Implements both in packages/analyze/src/quality.ts, wires into burn summary --quality as the first consumer.

Stacked on #52 — this PR uses ToolCall.isError (added in #52) for trailing-failure-streak detection. Merge #52 first, then this rebases onto main cleanly.

Closes #6.

Design decision (the actual resolution of issue #6)

After the discussion on the issue, the candidate ranking settled on:

  • (a) Explicit tagging — optional enhancement, not primary
  • (b) Git-state scrape — future option
  • (c) Benchmark harness — out of scope
  • (d) Do nothing — rejected (forfeits the differentiator)
  • (e) Automatic adherence detection (prism) — rejected (doesn't scale beyond hand-coded per-rule checkers)
  • (f) Outcome inference (agentsview) — primary
  • (g) One-shot rate (codeburn) — primary

(f) + (g) is the MVP. This PR lands both as concrete implementations, which closes the design issue.

Why this pair

Signal Captures Needs content? Granularity
Outcome inference abandoned / give-up sessions Optional (degrades gracefully) Per-session
One-shot rate retry-heavy / stuck sessions No Per-turn → per-session

The second row matters: a session can be completed with a low one-shot rate, which is exactly the "Sonnet nailed it first try" vs "Haiku got there after 3 retries" distinction that drives model-choice decisions. Neither signal alone catches this; both together do.

Hard constraints honored

  • No prompt storage required. Outcome inference works from turn metadata; last-assistant-text is used only to downgrade confidence, never required.
  • Nothing persists to the ledger — both signals are computed lazily at query time so upgrading the rules later doesn't require a rebuild.
  • Confidence is explicit on every outcome classification so downstream consumers can filter low-confidence signals rather than treat them as noise.

What's not in this PR

  • Explicit user-tagging UI (burn tag --outcome). Design demoted it to optional enhancement; worth a follow-up issue once the auto-classifier has real usage data.
  • Git-state scrape (b). Future option, separate PR if demand warrants.

Test plan

  • pnpm run test:ts — 238 pass (13 new quality tests covering each outcome reason + one-shot edge cases)
  • pnpm dev:cli summary --quality --since 7d against real sessions — outputs sensible rollup (51 completed / 8 abandoned / 0 errored / 10 unknown, 99.7% one-shot rate across 1265 edit turns in my personal ledger)

🤖 Generated with Claude Code


Open in Devin Review

Resolves the design question in #6 by adopting the two signals that
converged in the issue discussion: outcome inference (from agentsview) and
one-shot rate (from codeburn). These are orthogonal — each catches a
different failure mode — and together form the MVP quality axis that
distinguishes burn's "same output, less spend" question from every other
usage tracker.

- Outcome inference: classifies each session as completed | abandoned |
  errored | unknown with high/medium/low confidence, using turn count,
  ending role, trailing failure streak, recency, and (optional) last
  assistant text for give-up phrase detection.
- One-shot rate: per-session `oneShotTurns / editTurns` — robust in
  hash-only content mode since it needs only tool-call patterns.
- Both computed lazily in `@relayburn/analyze`. Nothing persists to the
  ledger — upgrading rules later doesn't require a rebuild. No prompt
  storage required (the design's hard constraint).
- Wired into `burn summary --quality` as the first consumer.

Closes #6.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Copy link
Copy Markdown

@barryollama barryollama left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Clean implementation of the two orthogonal signals from #6. One minor suggestion inline.

Comment thread packages/analyze/src/quality.ts

This comment was marked as resolved.

willwashburn and others added 3 commits April 24, 2026 06:56
Treat sessions whose turns lack a stopReason (e.g. Codex) as completed/low-confidence rather than misclassifying them as user-abandoned. Add 'unknown-ending' outcome, return early from inferOutcome for unknown endings, and make endingRole return 'unknown' when stopReason is undefined so trailing failure-streak detection still works. Add tests to cover the new classification and failure-streak detection for sources without stopReason.
- Extend single-exchange to messageCount <= 2 so a true one-turn "user
  asks, assistant answers" session is classified completed/medium
  instead of too-short/unknown. TurnRecord counts assistant turns only;
  the prior `=== 2` gate only caught tool-mediated round trips.
- Add three give-up phrases observed in real Claude/Codex sessions.
- Parallelize content sidecar reads in `burn summary --quality` with
  a concurrency cap of 8 so large ledgers (many sessions, many ENOENT
  paths) don't serialize I/O.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds [Unreleased] entries for PR #53 to the root, analyze, and cli
changelogs covering the outcome inference + one-shot rate module and
the new `burn summary --quality` flag. Notes the Codex
unknown-stopReason handling and the concurrency cap on sidecar reads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@willwashburn willwashburn merged commit febf40f into feat/waste-patterns-issue-11 Apr 24, 2026
@willwashburn willwashburn deleted the feat/quality-signal-issue-6 branch April 24, 2026 11:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants