Skip to content

feat(etl): Wave 4B — backfill streams messages → usage_events; ingest writer auto-normalizes#80

Merged
0bserver07 merged 1 commit into
mainfrom
feat/etl-backfill-real
May 6, 2026
Merged

feat(etl): Wave 4B — backfill streams messages → usage_events; ingest writer auto-normalizes#80
0bserver07 merged 1 commit into
mainfrom
feat/etl-backfill-real

Conversation

@0bserver07
Copy link
Copy Markdown
Owner

Summary

Wave 4B fills in the body Wave 1 skeleton-locked. stackunderflow etl backfill now actually populates usage_events against real data; the ingest writer gains a per-record normalize+insert hook so newly-ingested messages auto-create events without needing a backfill pass.

  • Backfill orchestrator (stackunderflow/etl/backfill.py) streams the messages → sessions → projects join in 5K-row chunks, one transaction per chunk so a partial failure leaves a recoverable state. Each row dispatches to its provider's registered normalizer (Wave 2A); events insert via INSERT OR IGNORE against uniq_events_msg, so re-runs are idempotent. force=True wipes events + watermarks and rebuilds every mart via MartBuilder.rebuild_from_scratch before the normalize pass starts fresh.
  • Ingest writer hook (stackunderflow/ingest/writer.py) runs the matching normalizer over the messages a per-file ingest just inserted, in the same transaction. The shared normalize_and_insert_event() helper is now the single source of truth for the events-row INSERT SQL — both the writer hook and the backfill orchestrator route through it. Marts auto-refresh via refresh_all_marts() after the per-file commit.
  • CLI polish (stackunderflow/cli.py) — etl backfill body gains a tqdm progress bar (falls back to periodic log lines when tqdm is absent), improved final summary, and clearer help text now that the command is no longer a no-op.

Smoke test (user's real store)

Real store: 247,278 messages, 0 events before this PR.

Run Events Inserted Skipped Dup Duration
First 150,337 0 225.78s
Re-run 0 150,337 29.25s

Marts after first run: 940 daily, 841 session, 151 project, 146 provider_day, 184 model_day. Total cost: $64,974.36 — exactly equal to SUM(daily_mart.cost_usd). Watermarks for all 5 marts at 150,337.

Test plan

  • pytest tests/ -q clean — 1487 passed, 2 skipped (was 1474; +13 new tests across test_backfill_real.py and test_writer_normalizes.py)
  • Existing test_backfill.py (Wave 1 shape contract) still passes — orchestrator surface unchanged
  • Existing test_writer.py (5 tests) still passes — message-insert path unchanged
  • End-to-end smoke test against ~/.stackunderflow/store.db: events_inserted ≈ 150K out of 247K messages (assistant rows only); re-run is fully idempotent
  • Wave 3A hot-path routes (/api/projects, /api/dashboard-data, /api/cost-data) now read from populated marts instead of falling through to aggregator

🤖 Generated with Claude Code

… writer auto-normalizes

Wave 1 shipped the orchestrator skeleton; Waves 2A/2B/2C registered the
normalizers, marts, and watcher. Until now `stackunderflow etl backfill`
remained a no-op against real data: 247K messages in the user's store,
0 rows in `usage_events`, every Wave-3A mart-read fell through to the
aggregator path.

This wave wires the body:

* `stackunderflow/etl/backfill.py` streams the `messages → sessions →
  projects` join in 5K-row chunks (one transaction per chunk so a
  partial failure leaves a recoverable state). Each row dispatches to
  its provider's registered normalizer; events insert via INSERT OR
  IGNORE against `uniq_events_msg`, so re-runs are idempotent. After
  events are written, `refresh_all_marts()` advances every mart's
  watermark. `force=True` wipes events + watermarks and rebuilds every
  mart via `MartBuilder.rebuild_from_scratch` before the normalize
  pass starts fresh.

* `stackunderflow/ingest/writer.py` adds a per-record normalize+insert
  hook so newly-ingested messages auto-create events in the same
  transaction that wrote the messages rows. The shared
  `normalize_and_insert_event()` helper is the single source of truth
  for the events-row INSERT SQL — both the writer hook and the
  backfill orchestrator route through it. After the per-file commit,
  the writer calls `refresh_all_marts()` so marts stay current
  without needing the watcher cycle.

* `stackunderflow/cli.py`'s `etl backfill` body gains a `tqdm`
  progress bar (falls back to periodic log lines when tqdm is absent),
  improved final summary, and clearer help text now that the command
  is no longer a no-op.

Smoke test on the user's real store (247,278 messages):
  - first run: 150,337 events inserted in 225s; 940 daily, 841
    session, 151 project, 146 provider_day, 184 model_day rows;
    total cost $64,974.36 (matches daily_mart total exactly)
  - re-run: 0 inserted, 150,337 skipped duplicate, 29s — fully
    idempotent

Test count: 1474 → 1487 (+13). 7 new in
`tests/stackunderflow/etl/test_backfill_real.py` (events per
provider, idempotent re-run, force rebuild end-state, mart watermark
advance, all-five-marts populate, daily-mart cost matches events,
partial-state recovery), 6 new in
`tests/stackunderflow/ingest/test_writer_normalizes.py` (5-record
ingest yields 5 events, daily mart populates, idempotent re-ingest,
unregistered-provider no-op, role-filter, normalize-failure-doesn't-
fail-ingest).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@0bserver07 0bserver07 force-pushed the feat/etl-backfill-real branch from 3b1e9ed to 9aee3ae Compare May 6, 2026 21:11
@0bserver07 0bserver07 merged commit 9598ba7 into main May 6, 2026
9 checks passed
@0bserver07 0bserver07 deleted the feat/etl-backfill-real branch May 6, 2026 21:13
0bserver07 added a commit that referenced this pull request May 6, 2026
Bumps to 0.7.0. Consolidates the [Unreleased] CHANGELOG entries from
the 11 ETL PRs (#72, #73, #74, #75, #76, #79, #81, #80, #78, #77, #82)
into a single [0.7.0] section.

New: docs/HANDOFF.md — state-of-the-codebase walkthrough for incoming
agents. Architecture map, recent history, key gotchas, what's left,
files-to-read-first.

End-state on the maintainer's real store:
  150,337 usage_events
  Marts populated and watermarks in sync
  Dashboard cold-load 2.5s → <50ms warm
  Watcher 155ms end-to-end source-file-write → dashboard-data-fresh

1598 backend tests passing, 2 skipped, 11 deselected (slow suite).
Frontend typecheck + build clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0bserver07 added a commit that referenced this pull request May 20, 2026
… writer auto-normalizes (#80)

Wave 1 shipped the orchestrator skeleton; Waves 2A/2B/2C registered the
normalizers, marts, and watcher. Until now `stackunderflow etl backfill`
remained a no-op against real data: 247K messages in the user's store,
0 rows in `usage_events`, every Wave-3A mart-read fell through to the
aggregator path.

This wave wires the body:

* `stackunderflow/etl/backfill.py` streams the `messages → sessions →
  projects` join in 5K-row chunks (one transaction per chunk so a
  partial failure leaves a recoverable state). Each row dispatches to
  its provider's registered normalizer; events insert via INSERT OR
  IGNORE against `uniq_events_msg`, so re-runs are idempotent. After
  events are written, `refresh_all_marts()` advances every mart's
  watermark. `force=True` wipes events + watermarks and rebuilds every
  mart via `MartBuilder.rebuild_from_scratch` before the normalize
  pass starts fresh.

* `stackunderflow/ingest/writer.py` adds a per-record normalize+insert
  hook so newly-ingested messages auto-create events in the same
  transaction that wrote the messages rows. The shared
  `normalize_and_insert_event()` helper is the single source of truth
  for the events-row INSERT SQL — both the writer hook and the
  backfill orchestrator route through it. After the per-file commit,
  the writer calls `refresh_all_marts()` so marts stay current
  without needing the watcher cycle.

* `stackunderflow/cli.py`'s `etl backfill` body gains a `tqdm`
  progress bar (falls back to periodic log lines when tqdm is absent),
  improved final summary, and clearer help text now that the command
  is no longer a no-op.

Smoke test on the user's real store (247,278 messages):
  - first run: 150,337 events inserted in 225s; 940 daily, 841
    session, 151 project, 146 provider_day, 184 model_day rows;
    total cost $64,974.36 (matches daily_mart total exactly)
  - re-run: 0 inserted, 150,337 skipped duplicate, 29s — fully
    idempotent

Test count: 1474 → 1487 (+13). 7 new in
`tests/stackunderflow/etl/test_backfill_real.py` (events per
provider, idempotent re-run, force rebuild end-state, mart watermark
advance, all-five-marts populate, daily-mart cost matches events,
partial-state recovery), 6 new in
`tests/stackunderflow/ingest/test_writer_normalizes.py` (5-record
ingest yields 5 events, daily mart populates, idempotent re-ingest,
unregistered-provider no-op, role-filter, normalize-failure-doesn't-
fail-ingest).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0bserver07 added a commit that referenced this pull request May 20, 2026
Bumps to 0.7.0. Consolidates the [Unreleased] CHANGELOG entries from
the 11 ETL PRs (#72, #73, #74, #75, #76, #79, #81, #80, #78, #77, #82)
into a single [0.7.0] section.

New: docs/HANDOFF.md — state-of-the-codebase walkthrough for incoming
agents. Architecture map, recent history, key gotchas, what's left,
files-to-read-first.

End-state on the maintainer's real store:
  150,337 usage_events
  Marts populated and watermarks in sync
  Dashboard cold-load 2.5s → <50ms warm
  Watcher 155ms end-to-end source-file-write → dashboard-data-fresh

1598 backend tests passing, 2 skipped, 11 deselected (slow suite).
Frontend typecheck + build clean.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant