Design: outcome / quality signal for 'same output, less spend' comparisons

## This is a design issue, not an implementation ticket.

Before building, we need a decision.

## Problem

Burn's meta-goal is answering \"would the same work have cost less with a different model / harness / tool choice?\" Spend data alone can't answer this — we also need a notion of whether the cheaper alternative produced an *equivalent* result. Otherwise \"Haiku would have cost 5× less\" is useless if Haiku would have produced broken code.

This is the moat versus every other usage tracker (TokenTracker, ccusage, Cursor stats). None of them have a quality axis. Whichever signal we pick, the data model must accommodate it from the start so it doesn't become a v2 retrofit.

## Candidate signals

**(a) Explicit outcome tag at session end**

```
burn tag --outcome good|bad|meh [--note \"…\"]
burn claude --tag workflow=refactor -- <claude args>
  # on exit, prompts: \"how did this go? [good/bad/meh/skip]\"
```

- Pros: simplest, most reliable when present, user controls the semantics.
- Cons: relies on discipline. Realistic capture rate probably 20–40% without prompts. Even with prompts, users will skip during flow.

**(b) Git-state scrape**

Record `git HEAD`, working-tree status, and test-suite state (`npm test`, `pytest`, etc. — whatever's there) at session start and end. Later: check whether the session's commits were reverted within N days, whether a PR containing them merged, whether tests still pass on HEAD.

- Pros: automatic. Captures the real-world \"did this stick\" signal.
- Cons: test harness detection is language/project-specific. Revert-within-N-days requires a delayed check (cron or on-demand). Won't capture quality for non-code tasks.

**(c) Benchmark harness**

Out of scope for v0.1 but worth considering in the data model. Re-run the same prompt across models with automated scoring. Requires storing prompts — which we explicitly don't do today for privacy.

**(d) Do nothing**

Report cost and let the user compare quality by hand. Simplest, ships fastest, but forfeits the differentiator.

## Hard constraint

Burn does not store prompts or responses — privacy is a deliberate design choice. Whatever signal we pick must not require storing them. This rules out (c) in its naive form; any benchmarking would need explicit opt-in with its own storage story.

## Asks of this issue

1. **Pick one** (or a hybrid — (a)+(b) is the obvious pairing).
2. Shape the data model now so `TurnRecord` / the ledger can carry an outcome without a schema migration later. Candidate: a sidecar `outcomes.jsonl` indexed by `(sessionId, workflowId?, projectKey)` with `{ signal, value, capturedAt, confidence? }`.
3. File implementation sub-issue(s).

## Decision criteria

- How many real sessions will actually produce an outcome signal? (If (a) alone captures < 20%, it's probably not worth the UX cost.)
- Does the signal survive being wrong? (Test suites lie; user tags are subjective; revert rate has false positives.)
- How much does it cost to ignore this for now and bolt on later?

Meta: the cheapest path may be (a) as the MVP with (b) as a later automatic augment — but worth arguing through before committing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design: outcome / quality signal for 'same output, less spend' comparisons #6

This is a design issue, not an implementation ticket.

Problem

Candidate signals

Hard constraint

Asks of this issue

Decision criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design: outcome / quality signal for 'same output, less spend' comparisons #6

Description

This is a design issue, not an implementation ticket.

Problem

Candidate signals

Hard constraint

Asks of this issue

Decision criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions