Skip to content

Derived analytics archive: materialize the ledger into a local queryable store #40

@willwashburn

Description

@willwashburn

Context

Burn's canonical data path today is the append-only JSONL ledger plus ad-hoc query-time folding. That is the right write path, but it is already the wrong read path for the backlog that now exists.

Examples:

Agentsview's main architectural lesson is not its UI. It is that a local derived archive unlocks everything else: fast queries, richer joins, durable computed signals, and eventually search.

This issue is for that derived archive.

Non-goal

Do not replace ledger.jsonl as the source of truth.

The JSONL ledger should remain the canonical append-only event spine. The archive is a materialized read model built from it.

Why #4 is not enough

#4 covers incremental cursors, dedup, and canonical project keys. That improves ingest, but it does not solve the core read-path problem:

  • every non-trivial query still has to scan the ledger
  • stamps still have to be folded repeatedly
  • cross-turn and cross-session joins still happen in memory
  • future features will keep re-implementing the same grouping logic

This issue is the missing second half: a stable, local analytics store.

Proposed shape

Add a derived SQLite database at:

  • ~/.relayburn/archive.sqlite

Built incrementally from the ledger and rebuildable from scratch.

Canonical principle

  • ledger.jsonl is the authoritative event log
  • archive.sqlite is disposable / rebuildable
  • if the archive is missing or corrupted, burn archive rebuild recreates it from the ledger

Schema (MVP)

The exact table names can change, but the archive needs at least these logical entities.

1. sessions

One row per (source, sessionId).

Suggested fields:

  • source
  • session_id
  • project
  • project_key
  • workflow_id
  • harness
  • started_at
  • ended_at
  • turn_count
  • message_count
  • model_set_json
  • relationship_type (root / continuation / fork / subagent once that exists)
  • parent_session_id
  • has_content
  • fidelity flags from the coverage issue

2. turns

One row per normalized turn.

Suggested fields:

  • source
  • session_id
  • message_id
  • turn_index
  • ts
  • model
  • activity
  • stop_reason
  • input_tokens
  • output_tokens
  • reasoning_tokens
  • cache_read_tokens
  • cache_create_5m_tokens
  • cache_create_1h_tokens
  • cost_input_usd
  • cost_output_usd
  • cost_reasoning_usd
  • cost_cache_read_usd
  • cost_cache_create_usd
  • cost_total_usd
  • materialized enrichment columns such as workflow_id, agent_id, persona, tier

3. tool_calls

One row per tool call attached to a turn.

Suggested fields:

  • source
  • session_id
  • message_id
  • call_index
  • tool_use_id
  • tool_name
  • target
  • args_hash
  • optional normalized category

4. tool_result_events

Do not block archive work on full content storage, but reserve the table now.

Suggested fields:

  • source
  • session_id
  • message_id
  • tool_use_id
  • call_index
  • event_index
  • status
  • content_length
  • content_hash
  • subagent_session_id
  • agent_id
  • event_source (tool_result, subagent_notification, etc.)
  • ts

This table is the bridge to the passive-reader execution-graph issue.

5. archive_state

Track incremental build progress.

Suggested fields:

  • ledger_offset_bytes
  • ledger_mtime_ms
  • archive_version
  • last_compacted_at
  • last_rebuild_at

Query model

Commands should read from the archive by default once it exists.

Examples:

Build / maintenance commands

Add a small archive command group.

burn archive build
burn archive rebuild
burn archive status
burn archive vacuum

Behavior:

  • build applies any ledger tail not yet materialized
  • rebuild drops and recreates the archive from the ledger
  • status reports schema version, row counts, last sync point, and file sizes
  • vacuum runs SQLite maintenance if needed

Incremental update strategy

Prefer append-friendly materialization keyed off the ledger position.

High-level flow:

  1. Read archive state
  2. Seek the ledger from the last processed byte offset
  3. Parse only new lines
  4. Upsert session / turn / tool rows
  5. Advance archive state only after transaction success

The archive should be safe to rebuild from zero at any time.

Relationship to #33

#33 proposes a content sidecar. That is compatible with this issue.

Two acceptable architectures:

  1. Content stays in sidecar JSONL, and the archive stores only content hashes / lengths / foreign keys
  2. Content is optionally copied into archive tables when content.store=full

Either is fine. The important thing is that the archive does not require raw content to be useful.

Relationship to coverage / fidelity

The archive should not blur:

  • zero tokens
  • missing tokens
  • aggregate-only sources
  • cost-only sources
  • fully attributed per-turn sources

This issue should consume the fidelity metadata from the dedicated coverage issue rather than inventing its own ad hoc null semantics.

Acceptance

  • ledger.jsonl remains canonical; deleting archive.sqlite and running burn archive rebuild fully recreates it.
  • burn summary can execute against the archive without scanning the entire ledger.
  • burn compare (burn compare: model comparison by observed activity category #38) and burn plans (Plan-based monthly quota tracking (complement to #5 block forecasting) #39) can be implemented as SQL-style grouped queries rather than ad hoc in-memory scans.
  • Materialized enrichment columns mean commands no longer have to fold all stamps on every query.
  • Archive rebuild is deterministic: rebuilding twice from the same ledger yields the same row counts and primary keys.
  • The archive can coexist with no content sidecar, hash-only sidecar, or full sidecar.

Depends on

  • #4 for stable ingest identity and canonical project keys
  • coverage/fidelity issue to define how missing / partial data is represented
  • execution-graph issue to define richer relationship and tool-result event rows

Unblocks

  • #26 MCP self-query without full-ledger scans
  • #38 compare as a real query, not a giant in-memory reducer
  • #39 monthly plan tracking with cheap rolling aggregates
  • future search / analytics without committing to a giant application surface

Priority

High. This is the architectural gap between burn as an event collector and burn as a usable local analytics system.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions