feat(etl): Wave 4B — backfill streams messages → usage_events; ingest writer auto-normalizes#80
Merged
Merged
Conversation
… writer auto-normalizes
Wave 1 shipped the orchestrator skeleton; Waves 2A/2B/2C registered the
normalizers, marts, and watcher. Until now `stackunderflow etl backfill`
remained a no-op against real data: 247K messages in the user's store,
0 rows in `usage_events`, every Wave-3A mart-read fell through to the
aggregator path.
This wave wires the body:
* `stackunderflow/etl/backfill.py` streams the `messages → sessions →
projects` join in 5K-row chunks (one transaction per chunk so a
partial failure leaves a recoverable state). Each row dispatches to
its provider's registered normalizer; events insert via INSERT OR
IGNORE against `uniq_events_msg`, so re-runs are idempotent. After
events are written, `refresh_all_marts()` advances every mart's
watermark. `force=True` wipes events + watermarks and rebuilds every
mart via `MartBuilder.rebuild_from_scratch` before the normalize
pass starts fresh.
* `stackunderflow/ingest/writer.py` adds a per-record normalize+insert
hook so newly-ingested messages auto-create events in the same
transaction that wrote the messages rows. The shared
`normalize_and_insert_event()` helper is the single source of truth
for the events-row INSERT SQL — both the writer hook and the
backfill orchestrator route through it. After the per-file commit,
the writer calls `refresh_all_marts()` so marts stay current
without needing the watcher cycle.
* `stackunderflow/cli.py`'s `etl backfill` body gains a `tqdm`
progress bar (falls back to periodic log lines when tqdm is absent),
improved final summary, and clearer help text now that the command
is no longer a no-op.
Smoke test on the user's real store (247,278 messages):
- first run: 150,337 events inserted in 225s; 940 daily, 841
session, 151 project, 146 provider_day, 184 model_day rows;
total cost $64,974.36 (matches daily_mart total exactly)
- re-run: 0 inserted, 150,337 skipped duplicate, 29s — fully
idempotent
Test count: 1474 → 1487 (+13). 7 new in
`tests/stackunderflow/etl/test_backfill_real.py` (events per
provider, idempotent re-run, force rebuild end-state, mart watermark
advance, all-five-marts populate, daily-mart cost matches events,
partial-state recovery), 6 new in
`tests/stackunderflow/ingest/test_writer_normalizes.py` (5-record
ingest yields 5 events, daily mart populates, idempotent re-ingest,
unregistered-provider no-op, role-filter, normalize-failure-doesn't-
fail-ingest).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3b1e9ed to
9aee3ae
Compare
0bserver07
added a commit
that referenced
this pull request
May 6, 2026
Bumps to 0.7.0. Consolidates the [Unreleased] CHANGELOG entries from the 11 ETL PRs (#72, #73, #74, #75, #76, #79, #81, #80, #78, #77, #82) into a single [0.7.0] section. New: docs/HANDOFF.md — state-of-the-codebase walkthrough for incoming agents. Architecture map, recent history, key gotchas, what's left, files-to-read-first. End-state on the maintainer's real store: 150,337 usage_events Marts populated and watermarks in sync Dashboard cold-load 2.5s → <50ms warm Watcher 155ms end-to-end source-file-write → dashboard-data-fresh 1598 backend tests passing, 2 skipped, 11 deselected (slow suite). Frontend typecheck + build clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0bserver07
added a commit
that referenced
this pull request
May 20, 2026
… writer auto-normalizes (#80) Wave 1 shipped the orchestrator skeleton; Waves 2A/2B/2C registered the normalizers, marts, and watcher. Until now `stackunderflow etl backfill` remained a no-op against real data: 247K messages in the user's store, 0 rows in `usage_events`, every Wave-3A mart-read fell through to the aggregator path. This wave wires the body: * `stackunderflow/etl/backfill.py` streams the `messages → sessions → projects` join in 5K-row chunks (one transaction per chunk so a partial failure leaves a recoverable state). Each row dispatches to its provider's registered normalizer; events insert via INSERT OR IGNORE against `uniq_events_msg`, so re-runs are idempotent. After events are written, `refresh_all_marts()` advances every mart's watermark. `force=True` wipes events + watermarks and rebuilds every mart via `MartBuilder.rebuild_from_scratch` before the normalize pass starts fresh. * `stackunderflow/ingest/writer.py` adds a per-record normalize+insert hook so newly-ingested messages auto-create events in the same transaction that wrote the messages rows. The shared `normalize_and_insert_event()` helper is the single source of truth for the events-row INSERT SQL — both the writer hook and the backfill orchestrator route through it. After the per-file commit, the writer calls `refresh_all_marts()` so marts stay current without needing the watcher cycle. * `stackunderflow/cli.py`'s `etl backfill` body gains a `tqdm` progress bar (falls back to periodic log lines when tqdm is absent), improved final summary, and clearer help text now that the command is no longer a no-op. Smoke test on the user's real store (247,278 messages): - first run: 150,337 events inserted in 225s; 940 daily, 841 session, 151 project, 146 provider_day, 184 model_day rows; total cost $64,974.36 (matches daily_mart total exactly) - re-run: 0 inserted, 150,337 skipped duplicate, 29s — fully idempotent Test count: 1474 → 1487 (+13). 7 new in `tests/stackunderflow/etl/test_backfill_real.py` (events per provider, idempotent re-run, force rebuild end-state, mart watermark advance, all-five-marts populate, daily-mart cost matches events, partial-state recovery), 6 new in `tests/stackunderflow/ingest/test_writer_normalizes.py` (5-record ingest yields 5 events, daily mart populates, idempotent re-ingest, unregistered-provider no-op, role-filter, normalize-failure-doesn't- fail-ingest). Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0bserver07
added a commit
that referenced
this pull request
May 20, 2026
Bumps to 0.7.0. Consolidates the [Unreleased] CHANGELOG entries from the 11 ETL PRs (#72, #73, #74, #75, #76, #79, #81, #80, #78, #77, #82) into a single [0.7.0] section. New: docs/HANDOFF.md — state-of-the-codebase walkthrough for incoming agents. Architecture map, recent history, key gotchas, what's left, files-to-read-first. End-state on the maintainer's real store: 150,337 usage_events Marts populated and watermarks in sync Dashboard cold-load 2.5s → <50ms warm Watcher 155ms end-to-end source-file-write → dashboard-data-fresh 1598 backend tests passing, 2 skipped, 11 deselected (slow suite). Frontend typecheck + build clean. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wave 4B fills in the body Wave 1 skeleton-locked.
stackunderflow etl backfillnow actually populatesusage_eventsagainst real data; the ingest writer gains a per-record normalize+insert hook so newly-ingested messages auto-create events without needing a backfill pass.stackunderflow/etl/backfill.py) streams themessages → sessions → projectsjoin in 5K-row chunks, one transaction per chunk so a partial failure leaves a recoverable state. Each row dispatches to its provider's registered normalizer (Wave 2A); events insert viaINSERT OR IGNOREagainstuniq_events_msg, so re-runs are idempotent.force=Truewipes events + watermarks and rebuilds every mart viaMartBuilder.rebuild_from_scratchbefore the normalize pass starts fresh.stackunderflow/ingest/writer.py) runs the matching normalizer over the messages a per-file ingest just inserted, in the same transaction. The sharednormalize_and_insert_event()helper is now the single source of truth for the events-row INSERT SQL — both the writer hook and the backfill orchestrator route through it. Marts auto-refresh viarefresh_all_marts()after the per-file commit.stackunderflow/cli.py) —etl backfillbody gains atqdmprogress bar (falls back to periodic log lines when tqdm is absent), improved final summary, and clearer help text now that the command is no longer a no-op.Smoke test (user's real store)
Real store: 247,278 messages, 0 events before this PR.
Marts after first run: 940 daily, 841 session, 151 project, 146 provider_day, 184 model_day. Total cost: $64,974.36 — exactly equal to
SUM(daily_mart.cost_usd). Watermarks for all 5 marts at 150,337.Test plan
pytest tests/ -qclean — 1487 passed, 2 skipped (was 1474; +13 new tests acrosstest_backfill_real.pyandtest_writer_normalizes.py)test_backfill.py(Wave 1 shape contract) still passes — orchestrator surface unchangedtest_writer.py(5 tests) still passes — message-insert path unchanged/api/projects,/api/dashboard-data,/api/cost-data) now read from populated marts instead of falling through to aggregator🤖 Generated with Claude Code