Skip to content

feat(etl): Wave 2B — 5 mart builders (daily, session, project, provider_day, model_day)#74

Merged
0bserver07 merged 1 commit into
mainfrom
feat/etl-marts
May 5, 2026
Merged

feat(etl): Wave 2B — 5 mart builders (daily, session, project, provider_day, model_day)#74
0bserver07 merged 1 commit into
mainfrom
feat/etl-marts

Conversation

@0bserver07
Copy link
Copy Markdown
Owner

Summary

Wave 2B of the three-layer ETL refactor (see docs/specs/etl-architecture.md). Ships five MartBuilder subclasses under stackunderflow/etl/marts/ — indexed read-side rollups derived from usage_events:

  • DailyMartBuilder(day, project_id, provider, model, speed) rollup
  • SessionMartBuilder — one row per session with lifetime aggregates + is_one_shot flag + primary_model
  • ProjectMartBuilder — one row per project with lifetime totals
  • ProviderDayMartBuilder(day, provider) rollup for the by-provider chart
  • ModelDayMartBuilder(day, model, speed) rollup for compare-across-agents

Each builder is watermarked (mart_watermark.last_event_id per mart, independently tracked) and idempotent. Two refresh patterns:

  • Additive marts (daily, provider_day, model_day): INSERT ... ON CONFLICT DO UPDATE adds incremental events into existing rows. After the additive upsert, COUNT(DISTINCT session_id) columns are recomputed for affected keys via a follow-up UPDATE — DISTINCT counts don't sum across refresh windows without double-counting sessions that span the boundary.
  • Per-entity marts (session, project): INSERT OR REPLACE over a subquery that re-aggregates from scratch for affected session_id / project_id values. New events invalidate prior per-entity aggregates, so the row gets rewritten in full.

rebuild_from_scratch() does DELETE FROM <mart>; refresh(conn, since_event_id=0) — drops + full backfill, same final state as incremental.

The Wave 1 foundation pieces (Normalizer/MartBuilder ABCs, marts/normalize registries, watermark helpers, v006_etl_layer.sql schema migration) are scaffolded in this branch so the marts can be exercised end-to-end before Wave 1 lands. They mirror the spec contract; Wave 1's PR will reconcile any textual overlap when it merges first.

Notes

  • No version bump — still v0.6.1.
  • Migration file is v006_etl_layer.sql (not v004_* as the spec text says): two migrations (v004 synthetic-models cleanup, v005 cursor-workspace redistribute) shipped between the spec being written and this PR.
  • Marts only depend on the usage_events table contents — no provider-specific logic, no import from stackunderflow.etl.normalize.
  • Test count: 1341 → 1374 (+33 new tests, 2 skipped unchanged).

Test plan

  • Per-mart unit tests — tests/stackunderflow/etl/marts/test_<mart>.py (8 daily, 4 provider_day, 4 model_day, 7 session, 5 project)
  • Integration test — 100 synthetic events spanning 3 days × 2 providers × 3 models, full pipeline + cost conservation + watermark contract + two-window-incremental matches one-shot rebuild
  • pytest tests/ -q clean: 1374 passed, 2 skipped
  • ruff check stackunderflow/etl/ clean
  • ruff format stackunderflow/etl/ --check clean

…er_day, model_day)

Wave 2B of the three-layer ETL pipeline (see docs/specs/etl-architecture.md).
Five MartBuilder subclasses under stackunderflow/etl/marts/, each watermarked
and idempotent.

Additive marts (daily, provider_day, model_day) refresh via INSERT ... ON
CONFLICT DO UPDATE so re-runs after partial failure self-heal. Their
COUNT(DISTINCT session_id) columns are recomputed for affected keys after
the additive upsert — DISTINCT counts don't sum across refresh windows
without double-counting sessions that span the boundary.

Per-entity marts (session, project) use INSERT OR REPLACE over a
re-aggregated subquery so a new event for an existing session/project
invalidates the prior aggregate and recomputes from all events.

rebuild_from_scratch() drops + repopulates each mart from scratch for
the --rebuild path. Same final state as a clean incremental run.

Foundation pieces (Normalizer/MartBuilder ABCs, marts/normalize registries,
watermark helpers, v006_etl_layer.sql migration) are scaffolded here so
the marts can be exercised end-to-end before Wave 1 lands. They mirror
the spec contract exactly; Wave 1's PR will reconcile any textual
overlap when it merges.

33 new tests in tests/stackunderflow/etl/marts/ (8 daily, 4 provider_day,
4 model_day, 7 session, 5 project, 5 integration). Total suite:
1341 → 1374 passing.

No version bump (still v0.6.1).
@0bserver07 0bserver07 merged commit cc4ce29 into main May 5, 2026
7 of 9 checks passed
@0bserver07 0bserver07 deleted the feat/etl-marts branch May 5, 2026 04:36
0bserver07 added a commit that referenced this pull request May 6, 2026
Bumps to 0.7.0. Consolidates the [Unreleased] CHANGELOG entries from
the 11 ETL PRs (#72, #73, #74, #75, #76, #79, #81, #80, #78, #77, #82)
into a single [0.7.0] section.

New: docs/HANDOFF.md — state-of-the-codebase walkthrough for incoming
agents. Architecture map, recent history, key gotchas, what's left,
files-to-read-first.

End-state on the maintainer's real store:
  150,337 usage_events
  Marts populated and watermarks in sync
  Dashboard cold-load 2.5s → <50ms warm
  Watcher 155ms end-to-end source-file-write → dashboard-data-fresh

1598 backend tests passing, 2 skipped, 11 deselected (slow suite).
Frontend typecheck + build clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0bserver07 added a commit that referenced this pull request May 20, 2026
…er_day, model_day) (#74)

Wave 2B of the three-layer ETL pipeline (see docs/specs/etl-architecture.md).
Five MartBuilder subclasses under stackunderflow/etl/marts/, each watermarked
and idempotent.

Additive marts (daily, provider_day, model_day) refresh via INSERT ... ON
CONFLICT DO UPDATE so re-runs after partial failure self-heal. Their
COUNT(DISTINCT session_id) columns are recomputed for affected keys after
the additive upsert — DISTINCT counts don't sum across refresh windows
without double-counting sessions that span the boundary.

Per-entity marts (session, project) use INSERT OR REPLACE over a
re-aggregated subquery so a new event for an existing session/project
invalidates the prior aggregate and recomputes from all events.

rebuild_from_scratch() drops + repopulates each mart from scratch for
the --rebuild path. Same final state as a clean incremental run.

Foundation pieces (Normalizer/MartBuilder ABCs, marts/normalize registries,
watermark helpers, v006_etl_layer.sql migration) are scaffolded here so
the marts can be exercised end-to-end before Wave 1 lands. They mirror
the spec contract exactly; Wave 1's PR will reconcile any textual
overlap when it merges.

33 new tests in tests/stackunderflow/etl/marts/ (8 daily, 4 provider_day,
4 model_day, 7 session, 5 project, 5 integration). Total suite:
1341 → 1374 passing.

No version bump (still v0.6.1).
0bserver07 added a commit that referenced this pull request May 20, 2026
Bumps to 0.7.0. Consolidates the [Unreleased] CHANGELOG entries from
the 11 ETL PRs (#72, #73, #74, #75, #76, #79, #81, #80, #78, #77, #82)
into a single [0.7.0] section.

New: docs/HANDOFF.md — state-of-the-codebase walkthrough for incoming
agents. Architecture map, recent history, key gotchas, what's left,
files-to-read-first.

End-state on the maintainer's real store:
  150,337 usage_events
  Marts populated and watermarks in sync
  Dashboard cold-load 2.5s → <50ms warm
  Watcher 155ms end-to-end source-file-write → dashboard-data-fresh

1598 backend tests passing, 2 skipped, 11 deselected (slow suite).
Frontend typecheck + build clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant