Skip to content

summary: expose turn outcome (stop_reason) — end_turn / max_tokens / refusal / pause #437

@willwashburn

Description

@willwashburn

Context

Every Anthropic completion carries a stop_reason: end_turn, max_tokens, pause_turn, stop_sequence, tool_use, refusal. Burn ingests this data but doesn't surface it. That means common failure modes are invisible:

  • max_tokens truncations — the model ran out of output budget. Often the cause of "agent stopped mid-edit" symptoms.
  • refusal — content policy stopped the response. Useful to count for prompt-quality audits.
  • pause_turn — interleaved-thinking pause. Useful for distinguishing real completions from in-flight ones.

Prior art: agent-profiler. Source: ui/src/components/conversation/transforms.ts, deriveTurnOutcome. Reads assistant stop_reason events and yields one of end_turn | max_tokens | pause_turn | refusal | silent.

Proposal

  1. Reader: extract stop_reason from the assistant row of each inference and store it on the inference (or turn) record.
  2. Summary: add a one-line outcome breakdown to burn summary:
    Turn outcomes: 142 end_turn, 3 max_tokens, 1 refusal, 0 pause
    
  3. Per-turn: surface in burn hotspots --explain and in any future per-turn detail verb.
  4. SDK: add stop_reason: Option<StopReason> to the inference / turn struct so it's queryable.

Implementation sketch

pub enum StopReason { EndTurn, MaxTokens, PauseTurn, StopSequence, ToolUse, Refusal, Silent }

Silent covers the case where an inference's row was written but no stop_reason is present (mid-write, sidechain, etc.). The current row already lands somewhere via the reader — just preserve the field.

Open questions

  1. Codex equivalent. Codex rollouts have their own outcome semantics; map to the same enum or keep a CodexStopReason parallel? First-cut: extend the enum or use String-typed raw_stop_reason for non-Anthropic harnesses.
  2. Aggregation level. Per-inference or per-turn? Per-inference is more correct (multi-inference turns can have different outcomes). Roll up to per-turn as max-severity for summary.
  3. Refusal context. Just count, or also surface the user prompt that triggered it? Probably out of scope here — counting is the cheap win.

Acceptance

  • stop_reason parsed and stored at ingest for Claude rows.
  • burn summary shows outcome counts.
  • SDK exposes stop_reason on inference/turn records.
  • Fixture with a max_tokens turn surfaces it in summary output.
  • Codex behavior documented (mapped or raw-passthrough).

References

  • agent-profiler: ui/src/components/conversation/transforms.ts deriveTurnOutcome.
  • Anthropic docs: stop_reason values.
  • Related: span-tree foundation (stop_reason is a natural attribute on the Turn root span).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions