Skip to content

Design: outcome / quality signal for 'same output, less spend' comparisons #6

@willwashburn

Description

@willwashburn

This is a design issue, not an implementation ticket.

Before building, we need a decision.

Problem

Burn's meta-goal is answering "would the same work have cost less with a different model / harness / tool choice?" Spend data alone can't answer this — we also need a notion of whether the cheaper alternative produced an equivalent result. Otherwise "Haiku would have cost 5× less" is useless if Haiku would have produced broken code.

This is the moat versus every other usage tracker (TokenTracker, ccusage, Cursor stats). None of them have a quality axis. Whichever signal we pick, the data model must accommodate it from the start so it doesn't become a v2 retrofit.

Candidate signals

(a) Explicit outcome tag at session end

burn tag --outcome good|bad|meh [--note \"…\"]
burn claude --tag workflow=refactor -- <claude args>
  # on exit, prompts: \"how did this go? [good/bad/meh/skip]\"
  • Pros: simplest, most reliable when present, user controls the semantics.
  • Cons: relies on discipline. Realistic capture rate probably 20–40% without prompts. Even with prompts, users will skip during flow.

(b) Git-state scrape

Record git HEAD, working-tree status, and test-suite state (npm test, pytest, etc. — whatever's there) at session start and end. Later: check whether the session's commits were reverted within N days, whether a PR containing them merged, whether tests still pass on HEAD.

  • Pros: automatic. Captures the real-world "did this stick" signal.
  • Cons: test harness detection is language/project-specific. Revert-within-N-days requires a delayed check (cron or on-demand). Won't capture quality for non-code tasks.

(c) Benchmark harness

Out of scope for v0.1 but worth considering in the data model. Re-run the same prompt across models with automated scoring. Requires storing prompts — which we explicitly don't do today for privacy.

(d) Do nothing

Report cost and let the user compare quality by hand. Simplest, ships fastest, but forfeits the differentiator.

Hard constraint

Burn does not store prompts or responses — privacy is a deliberate design choice. Whatever signal we pick must not require storing them. This rules out (c) in its naive form; any benchmarking would need explicit opt-in with its own storage story.

Asks of this issue

  1. Pick one (or a hybrid — (a)+(b) is the obvious pairing).
  2. Shape the data model now so TurnRecord / the ledger can carry an outcome without a schema migration later. Candidate: a sidecar outcomes.jsonl indexed by (sessionId, workflowId?, projectKey) with { signal, value, capturedAt, confidence? }.
  3. File implementation sub-issue(s).

Decision criteria

  • How many real sessions will actually produce an outcome signal? (If (a) alone captures < 20%, it's probably not worth the UX cost.)
  • Does the signal survive being wrong? (Test suites lie; user tags are subjective; revert rate has false positives.)
  • How much does it cost to ignore this for now and bolt on later?

Meta: the cheapest path may be (a) as the MVP with (b) as a later automatic augment — but worth arguing through before committing.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions